Text preprocessing involves the cleaning, normalization and standardization of text before the application of NLP techniques. Keep in mind that although there are several types of text preprocessing techniques, one must use the ones suitable for their use case in an order that makes sense.
Sometimes text needs to be segmented or broken down into smaller chunks before any in depth analysis can occur. This is called Tokenization. Python's NLTK library contains several tokenizers including Word Tokenizers and Regular Expression Tokenizers. Word Tokenizers break a string up into words whereas Regular Expression Tokenizers splits a string into substrings by using regular expressions. Regular Expression Tokenizers also serve a dual purpose when used to get rid of non-alphanumeric characters by extracting patterns from text.
Lower casing is probably one of the most simple forms of text preprocessing. It can be applied to text columns such as people's names, business names and addresses. By lower casing text, we get rid of any distractions and inconsistencies in case that could affect Machine Learning algorithms. Lower casing all text is a good way to ensure that different versions of the same word are not being fed to NLP algorithms because of mixed cases. Keep in mind that there can also be situations where certain words should not be lower cased. Knowing your use case and domain can influence whether or not all words are lower cased.
Stop Words Removal:
Stop words are commonly used words in a particular language. These words are usually removed or filtered out before Natural Language Processing tasks such as Topic Modelling and Text Classification can occur. Some examples of English stop words are "the", "is", "my" and "on". These words are removed because they don't provide any value or useful information during NLP. By removing stop words, analysis can focus on more important words related to the subject matter at hand. It is important to ensure that stop word lists only contain uninformative words that are not required for predictions. Some stop word lists are predefined and provided by Python libraries such as NLTK and Scikit Learn. However, there is also the option to create a customized stop word list for your given NLP task.
Normalizing Text - Stemming:
Stemming is a process whereby text is stripped off any affixes. Inflected words are reduced to their corresponding stem, base or root form. Stemming works by chopping off the ends of words to convert them to their stem or root. In many cases, the result of stemming is not itself a word. Python's NLTK library contains many stemming algorithms including Porter Stemmer, Snowball Stemmer and Lancaster Stemmer.
Normalizing Text - Lemmatization:
This process also strips words of any affixes except that the resulting stem/root returned is always a dictionary word. Lemmatization doesn't just chop off the endings of inflected words in order to get to the root form. Each word has a Part of Speech tag attached to it before lemmatization occurs by using a Word Net dictionary, ultimately resulting in a root that is a real dictionary word.
Parts of Speech Tagging (POS Tagging):
Part-of-Speech tagging also referred to as POS tagging or simply tagging is the process of classifying words into their part-of-speech. POS tagging classifies words as one of nine parts of speech such as conjunctions, adverbs, nouns and verbs, although there are several more categories and sub categories. In most cases, POS tagging follows tokenization in a typical NLP pipeline. One might ask, what role does POST tagging play in NLP. One such role is in building lemmatizers. POS tags can also be used in disambiguation when one word has different meanings. It can also help to augment document classification.
Sometimes the text in question contains unwanted noise such as spaces, non-alphanumeric characters, numbers and html formatting that need to be removed. What gets removed though, is highly dependent on the domain of the use case at hand. Knowing what noise and distractions need to be removed from text is a crucial and compulsory step in text preprocessing to ensure that inconsistent data is not used in text analysis.
Expanding Word Contractions:
In the English language, a word contraction is formed when two words are combined and made shorter by placing an apostrophe where letters have been omitted. Some common examples of contractions include words such as I'm, you're, he'll and can't. In Natural Language Processing these contractions are expanded and returned to the original words from which there were created. The expansion of contractions is crucial for ensuring that only full words are analysed. A good example would result in the word "I'm" being converted to "I am" before NLP occurs.
Stripping, changing cases, tokenization, stop word removal, stemming and lemmatization are just a subset of the types of preprocessing that can be applied to text data during Natural Language Processing. It is up to the Data Scientist to pick which ones are required and the order in which they should be executed depending on the problem statement. Once preprocessing is completed, the text is ready for future analysis.