Text Feature Extraction is the process of transforming text into numerical feature vectors to be used by Machine Learning Algorithms. This blog will focus on two different ways in which text can be transformed into numerical values by using Python's Scikit-Learn library. Text documents or corpus cannot be directly fed into Machine Learning Algorithms because these algorithms expect data input that are numerical feature vectors or matrices. There are many ways of converting a corpus to a feature vector such as Bag of Words and TF-IDF Weighting. Before digging deeper let's first define some important terminology in the area of Text Feature Extraction.
IMPORTANT TERMINOLOGY:
Corpus:- This is a collection of all documents or text, usually stored as a comma separated list of strings.
Document:- This represents each element in the corpus and it is a piece of text of any length.
Token:- A document or text usually first needs to be broken down into small chunks referred to as tokens or words. This process is called tokenization and on its own can be a rather comprehensive topic as there are multiple strategies for tokenization that takes into account delimeters, n-grams and regular expressions.
Count/Frequency:- The number of times a token occurs in a text document represents that token's frequency within the document.
Feature:- Each individual token identified across the entire corpus is treated as a feature with a unique ID in the final feature vector. Keep in mind that there is no particular ordering to the features in a vector.
Feature Vector:- An vector representation of a corpus where each row represents a document and the columns represent the tokens found across the entire corpus.
Vectorization:- This is the process of converting a corpus or collection of text documents into a numerical feature vector where the columns represent the tokens found in the entire corpus and each row represents a document.