Linguistic Pre-Processing of Social Media Texts

December 03, 2022

There are several existing linguistic pre-processing tools such as tokenizers, part-of-speech taggers, parsers, and named entity recognizers, with a focus on their adaptation to social media data. Following steps are carried out during text preprocessing:

Tokenization is the process of breaking down a text into words and sentences. We delete the punctuation and lowercase the words. A letter from the end of each word is removed until the stem is reached in the rudimentary procedure known as stemming. There are numerous exceptions in languages like English.
Lemmatization looks beyond word reduction and uses a language's whole vocabulary to apply a morphological analysis to words. For instance, third-person nouns are transformed to first-person, while past and future tense verbs are changed to present tense. Stemming is the process of reducing words to their root form.

Normalization is a process that converts a list of words to a more uniform sequence. Other operations can interact with the data and won't have to deal with problems that might jeopardize the process by converting the words to a standard format. For instance, making all terms lowercase will make searching easier

Word embeddings, also known as vectorization, is a technique used in natural language processing (NLP) to convert words or phrases from a lexicon into a corresponding vector of real numbers, which is then used to discover word predictions, word similarity, and semantics. Vectorization is the process of turning words into numbers.

Text representation models in NLP:

1. Bag of words

The bag-of-words approach has had considerable success in solving issues like language modelling and document classification because it is straightforward to comprehend and put into practise.It is a condensing representation used in information retrieval and natural language processing (IR). This paradigm ignores syntax and even word order in favour of multiplicity, representing a text as the collection (multiset) of its words.The bag-of-words model is frequently employed in document classification techniques in which the (frequency of) occurrence of each word is used as a feature for instructing a classifier.

2.TF-IDF

The term frequency and Inverse document frequency is calculayed for every word in the corpus, and multiplication of TF and IDF gives the document vectors. So, how is TF-IDF calculated:
Term Frequency (TF) — (No. of repeated words in sentence) / (No. of words in sentence)

Inverse Document Frequency (IDF) — log[ (No. of sentences) / (No. of sentences containing word)]

Search This Blog

Natural Language Processing for Social Media