VANAUDEL ANALYTIX
Follow Us
  • Home
  • About
  • Python Blog
  • SAS Blog
  • Open Data Science
  • Consulting
  • Contact

text feature extraction using python's scikit-learn

9/1/2019

1 Comment

 
Text Feature Extraction is the process of transforming text into numerical feature vectors to be used by Machine Learning Algorithms. This blog will focus on two different ways in which text can be transformed into numerical values by using Python's Scikit-Learn library. Text documents or corpus cannot be directly fed into Machine Learning Algorithms because these algorithms expect data input that are numerical feature vectors or matrices. There are many ways of converting a corpus to a feature vector such as Bag of Words and TF-IDF Weighting. Before digging deeper let's first define some important terminology in the area of Text Feature Extraction.
 
IMPORTANT TERMINOLOGY:

Corpus:- This is a collection of all documents or text, usually stored as a comma separated list of strings.

Document:- This represents each element in the corpus and it is a piece of text of any length.

Token:- A document or text usually first needs to be broken down into small chunks referred to as tokens or words. This process is called tokenization and on its own can be a rather comprehensive topic as there are multiple strategies for tokenization that takes into account delimeters, n-grams and regular expressions.

Count/Frequency:- The number of times a token occurs in a text document represents that token's frequency within the document. 

Feature:- Each individual token identified across the entire corpus is treated as a feature with a unique ID in the final feature vector. Keep in mind  that there is no particular ordering to the features in a vector. 

Feature Vector:- An vector representation of a corpus where each row represents a document and the columns represent the tokens found across the entire corpus. 

Vectorization:- This is the process of converting a corpus or collection of text documents into a numerical feature vector where the columns represent the tokens found in the entire corpus and each row represents a document.


Read More
1 Comment

    Author

    My name is Vanessa Afolabi also known as @TheSASMom. I am a Data Scientist fluent in SAS, R, Python and SQL with a passion for Machine Learning and Research.

    Archives

    September 2019
    August 2019
    January 2019

    Categories

    All

    RSS Feed

Powered by Create your own unique website with customizable templates.