Corpus:- This is a collection of all documents or text, usually stored as a comma separated list of strings.
Document:- This represents each element in the corpus and it is a piece of text of any length.
Token:- A document or text usually first needs to be broken down into small chunks referred to as tokens or words. This process is called tokenization and on its own can be a rather comprehensive topic as there are multiple strategies for tokenization that takes into account delimeters, n-grams and regular expressions.
Count/Frequency:- The number of times a token occurs in a text document represents that token's frequency within the document.
Feature:- Each individual token identified across the entire corpus is treated as a feature with a unique ID in the final feature vector. Keep in mind that there is no particular ordering to the features in a vector.
Feature Vector:- An vector representation of a corpus where each row represents a document and the columns represent the tokens found across the entire corpus.
Vectorization:- This is the process of converting a corpus or collection of text documents into a numerical feature vector where the columns represent the tokens found in the entire corpus and each row represents a document.