VANAUDEL ANALYTIX
Follow Us
  • Home
  • About
  • Python Blog
  • SAS Blog
  • Open Data Science
  • Consulting
  • Contact

text feature extraction using python's scikit-learn

9/1/2019

1 Comment

 
Text Feature Extraction is the process of transforming text into numerical feature vectors to be used by Machine Learning Algorithms. This blog will focus on two different ways in which text can be transformed into numerical values by using Python's Scikit-Learn library. Text documents or corpus cannot be directly fed into Machine Learning Algorithms because these algorithms expect data input that are numerical feature vectors or matrices. There are many ways of converting a corpus to a feature vector such as Bag of Words and TF-IDF Weighting. Before digging deeper let's first define some important terminology in the area of Text Feature Extraction.
 
IMPORTANT TERMINOLOGY:

Corpus:- This is a collection of all documents or text, usually stored as a comma separated list of strings.

Document:- This represents each element in the corpus and it is a piece of text of any length.

Token:- A document or text usually first needs to be broken down into small chunks referred to as tokens or words. This process is called tokenization and on its own can be a rather comprehensive topic as there are multiple strategies for tokenization that takes into account delimeters, n-grams and regular expressions.

Count/Frequency:- The number of times a token occurs in a text document represents that token's frequency within the document. 

Feature:- Each individual token identified across the entire corpus is treated as a feature with a unique ID in the final feature vector. Keep in mind  that there is no particular ordering to the features in a vector. 

Feature Vector:- An vector representation of a corpus where each row represents a document and the columns represent the tokens found across the entire corpus. 

Vectorization:- This is the process of converting a corpus or collection of text documents into a numerical feature vector where the columns represent the tokens found in the entire corpus and each row represents a document.


Read More
1 Comment

Handling Missing data

8/30/2019

1 Comment

 
Know Your Missing Data:

Sometimes datasets contain missing data. Imagine a dataset with a column "age", where a lot of the ages are missing. This could pose a problem because it is impossible to perform statistical analysis when data is missing. Missing data can appear in any form including blanks and NaN. It is important to correctly identify the nature of missing data in your datasets. Values such as zeroes and unknown could easily be confused for missing data when for the given use case or problem statement they are not. Always  know your missing data.  
Picture

Read More
1 Comment

Pre-processing text for nlp

8/24/2019

1 Comment

 
In the field of Natural Language Processing (NLP), preprocessing of text needs to occur before running any Machine Learning algorithms. This blog will provide some common examples of the types of preprocessing that need to be applied to text before any analysis can take place.
Picture
What is Text Preprocessing?

Text preprocessing involves the cleaning, normalization and standardization of text before the application of NLP techniques. Keep in mind that although there are several types of text preprocessing techniques, one must use the ones suitable for their use case in an order that makes sense. 

Read More
1 Comment

Creating Word clouds in python

8/24/2019

1 Comment

 
Given a particular text or string, it is possible to create an image containing the words of that text, where the size of each word is dependent on its frequency within the text. Such an image is referred to as a Word Cloud. Using Python's wordcloud and matplotlib libraries, a Data Scientist can easily generate a word cloud given an input string or text.

The following is a Word Cloud generated for The Mother Goose poem called "Hey, diddle, diddle".
Picture

Read More
1 Comment

Pandas VS pyspark cheat sheet

8/21/2019

2 Comments

 
Picture
Picture

​Data Scientists sometimes alternate between using Pyspark and Pandas dataframes depending on the use case and the size of data being analysed. It can sometimes get confusing and hard to remember the syntax for processing each type of dataframe. The following cheat sheet provides a side by side comparison of Pandas and Pyspark syntax needed to accomplish some common programming tasks. 

Read More
2 Comments

IBM NATURAL LANGUAGE UNDERSTANDING

8/17/2019

0 Comments

 
ABOUT IBM NLU
IBM Natural Language Understanding (NLU) can be used to analyze the semantic features of text such as the content of web pages, raw HTML and text documents. NLU also has the ability to analyze target phrases in context of the surrounding text for focused sentiment and emotion results.

The semantic features that can be extracted from URLs, raw HTML and text using NLU include:
  • Categories
  • Concepts
  • Emotion
  • Entities
  • Keywords
  • Metadata
  • Relations
  • Semantic Roles
  • Sentiment
Picture

Read More
0 Comments

GRAPH THEORY AND NETWORK ANALYSIS

8/17/2019

0 Comments

 
WHAT IS GRAPH THEORY AND NETWORK ANALYSIS
Graph Theory is the mathematical study of the properties and applications of graphs. Graphs are mathematical structures used to model pairwise relations between objects. Graphs are also referred to as networks and contain a set of vertices/nodes/points connected by edges/links/lines.
Picture
Graph Theory can be applied to Network Analysis, Link Analysis and Social Network Analysis. These types of analysis borrow notations from Graph Theory and are focused on investigating social structures represented as networks, by applying a variety of mathematical, computational and statistical techniques.

HISTORY OF GRAPH THEORY

Graph Theory was first introduced and studied in 1736 by Leonhard Euler who was interested in solving the Konigsberg Bridge Problem. Konigsberg was a city in Prussia, Russia with the river Pregel flowing through it creating two islands. The city and islands were connected by seven bridges. The goal of the Konigsberg Bridge Problem was to devise a walk through the city that would cross each of the 7 bridges once and only once with no doubling back, ensuring that you ended where you started.
Picture

Read More
0 Comments

Fuzzy Matching using Fuzzywuzzy

1/26/2019

1 Comment

 
Have you ever wanted to determine just how similar two strings are? Using the equal sign returns whether two strings are the same or not. It does not give any information or measure regarding how similar or dissimilar two strings are.
Picture
Python’s Fuzzywuzzy library contains many methods that can be used to compute a similarity measure for two strings. The Fuzzywuzzy library contains a module called fuzz that contains several methods that can be used to compare two strings and return a value from 0 to 100 as a measure of similarity.
Picture

Read More
1 Comment

    Author

    My name is Vanessa Afolabi also known as @TheSASMom. I am a Data Scientist fluent in SAS, R, Python and SQL with a passion for Machine Learning and Research.

    Archives

    September 2019
    August 2019
    January 2019

    Categories

    All

    RSS Feed

Powered by Create your own unique website with customizable templates.