All Categories

text feature extraction using python's scikit-learn

9/1/2019

Text Feature Extraction is the process of transforming text into numerical feature vectors to be used by Machine Learning Algorithms. This blog will focus on two different ways in which text can be transformed into numerical values by using Python's Scikit-Learn library. Text documents or corpus cannot be directly fed into Machine Learning Algorithms because these algorithms expect data input that are numerical feature vectors or matrices. There are many ways of converting a corpus to a feature vector such as Bag of Words and TF-IDF Weighting. Before digging deeper let's first define some important terminology in the area of Text Feature Extraction.

IMPORTANT TERMINOLOGY:

Corpus:- This is a collection of all documents or text, usually stored as a comma separated list of strings.

Document:- This represents each element in the corpus and it is a piece of text of any length.

Token:- A document or text usually first needs to be broken down into small chunks referred to as tokens or words. This process is called tokenization and on its own can be a rather comprehensive topic as there are multiple strategies for tokenization that takes into account delimeters, n-grams and regular expressions.

Count/Frequency:- The number of times a token occurs in a text document represents that token's frequency within the document.

Feature:- Each individual token identified across the entire corpus is treated as a feature with a unique ID in the final feature vector. Keep in mind that there is no particular ordering to the features in a vector.

Feature Vector:- An vector representation of a corpus where each row represents a document and the columns represent the tokens found across the entire corpus.

Vectorization:- This is the process of converting a corpus or collection of text documents into a numerical feature vector where the columns represent the tokens found in the entire corpus and each row represents a document.

1 Comment

Handling Missing data

8/30/2019

1 Comment

Know Your Missing Data:

Sometimes datasets contain missing data. Imagine a dataset with a column "age", where a lot of the ages are missing. This could pose a problem because it is impossible to perform statistical analysis when data is missing. Missing data can appear in any form including blanks and NaN. It is important to correctly identify the nature of missing data in your datasets. Values such as zeroes and unknown could easily be confused for missing data when for the given use case or problem statement they are not. Always know your missing data.

1 Comment

Pre-processing text for nlp

8/24/2019

1 Comment

In the field of Natural Language Processing (NLP), preprocessing of text needs to occur before running any Machine Learning algorithms. This blog will provide some common examples of the types of preprocessing that need to be applied to text before any analysis can take place.

What is Text Preprocessing?

Text preprocessing involves the cleaning, normalization and standardization of text before the application of NLP techniques. Keep in mind that although there are several types of text preprocessing techniques, one must use the ones suitable for their use case in an order that makes sense.

1 Comment

Creating Word clouds in python

8/24/2019

0 Comments

Given a particular text or string, it is possible to create an image containing the words of that text, where the size of each word is dependent on its frequency within the text. Such an image is referred to as a Word Cloud. Using Python's wordcloud and matplotlib libraries, a Data Scientist can easily generate a word cloud given an input string or text.

The following is a Word Cloud generated for The Mother Goose poem called "Hey, diddle, diddle".

0 Comments

Pandas VS pyspark cheat sheet

8/21/2019

2 Comments

Data Scientists sometimes alternate between using Pyspark and Pandas dataframes depending on the use case and the size of data being analysed. It can sometimes get confusing and hard to remember the syntax for processing each type of dataframe. The following cheat sheet provides a side by side comparison of Pandas and Pyspark syntax needed to accomplish some common programming tasks.

2 Comments

IBM NATURAL LANGUAGE UNDERSTANDING

8/17/2019

0 Comments

ABOUT IBM NLU
IBM Natural Language Understanding (NLU) can be used to analyze the semantic features of text such as the content of web pages, raw HTML and text documents. NLU also has the ability to analyze target phrases in context of the surrounding text for focused sentiment and emotion results.

The semantic features that can be extracted from URLs, raw HTML and text using NLU include:

Categories
Concepts
Emotion
Entities
Keywords
Metadata
Relations
Semantic Roles
Sentiment

0 Comments

GRAPH THEORY AND NETWORK ANALYSIS

8/17/2019

0 Comments

WHAT IS GRAPH THEORY AND NETWORK ANALYSIS
Graph Theory is the mathematical study of the properties and applications of graphs. Graphs are mathematical structures used to model pairwise relations between objects. Graphs are also referred to as networks and contain a set of vertices/nodes/points connected by edges/links/lines.

Graph Theory can be applied to Network Analysis, Link Analysis and Social Network Analysis. These types of analysis borrow notations from Graph Theory and are focused on investigating social structures represented as networks, by applying a variety of mathematical, computational and statistical techniques.

HISTORY OF GRAPH THEORY
Graph Theory was first introduced and studied in 1736 by Leonhard Euler who was interested in solving the Konigsberg Bridge Problem. Konigsberg was a city in Prussia, Russia with the river Pregel flowing through it creating two islands. The city and islands were connected by seven bridges. The goal of the Konigsberg Bridge Problem was to devise a walk through the city that would cross each of the 7 bridges once and only once with no doubling back, ensuring that you ended where you started.

0 Comments

Fuzzy Matching using Fuzzywuzzy

1/26/2019

1 Comment

Have you ever wanted to determine just how similar two strings are? Using the equal sign returns whether two strings are the same or not. It does not give any information or measure regarding how similar or dissimilar two strings are.

Python’s Fuzzywuzzy library contains many methods that can be used to compute a similarity measure for two strings. The Fuzzywuzzy library contains a module called fuzz that contains several methods that can be used to compare two strings and return a value from 0 to 100 as a measure of similarity.

1 Comment

text feature extraction using python's scikit-learn

Handling Missing data

Pre-processing text for nlp

Creating Word clouds in python

Pandas VS pyspark cheat sheet

IBM NATURAL LANGUAGE UNDERSTANDING

GRAPH THEORY AND NETWORK ANALYSIS

Fuzzy Matching using Fuzzywuzzy

Author

Archives

Categories