[NLP Series][P1] A Journey Through NLP History and the Power of TF-IDF

Long time no see! I’ve been swamped with my master’s course in CS, but now I’m back to share some exciting NLP knowledge with you all. In this series, we’ll take a journey through the history of NLP, starting with TF-IDF and working our way through vector embeddings, CNNs, RNNs, LSTMs, and BERT. Buckle up, because we’re diving deep into the tech that powers modern language processing!

Introduction to NLP (Natural Language Processing)

Before diving into the details, let’s step back and understand why NLP (Natural Language Processing) is such an important field. At its core, NLP teaches machines to understand and process human language. Imagine talking to your device and it truly "gets" what you’re saying – that’s the power of NLP.

In the early days, NLP was built on classical methods like rule-based systems and statistical models. These approaches focused on the structure of language, syntax, and semantics. However, they often struggled with real-world challenges like ambiguity and complexity.

[NLP series][P1] Classical NLP — Classical NLP [1]

A breakthrough in traditional NLP came with techniques like TF-IDF (Term Frequency-Inverse Document Frequency), which allowed machines to quantify and rank words based on their importance. This was a game-changer, laying the foundation for modern NLP methods.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a way to figure out how important a word is in a document compared to all other documents.

Let’s understand TF-IDF with a simple example based on how Google ranks pages. Google uses TF-IDF as one of the factors for ranking websites. To keep it simple, imagine Google only has three pages with the following content:

Page 1: "I love you and I love games"
Page 2: "I hate you"
Page 3: "I love Sekiro"

At first glance, it’s clear that words like "love," "hate," and "Sekiro" seem important for these pages. But how can we be sure? For example, "love" appears twice on Page 1, making it significant. On the other hand, common words like "I" and "you" appear across all pages, so they aren’t as important.

Let's calculate the Term Frequency (TF) table for the given example. Firstly, count terms for each page :

	I	love	you	and	games	hate	Sekiro
Page 1	2	2	1	1	1	0	0
Page 2	1	0	1	0	0	1	0
Page 3	1	1	0	0	0	0	1

You can see that I used a short example for simplicity, but in reality, the content can be much longer. For instance, Page 1 could contain 1,000 occurrences of the word "love," which would introduce a bias. Therefore, it’s essential to normalize the data to a range between 0 and 1.

	I	love	you	and	games	hate	Sekiro
Page 1	1	1	0.5	0.5	0.5	0	0
Page 2	1	0	1	0	0	1	0
Page 3	1	1	0	0	0	0	1

Next, we calculate the Inverse Document Frequency (IDF), which helps weigh down common terms (like "I" and "you") that appear across multiple pages while giving more importance to rare terms (like "hate" and "Sekiro").

$IDF(term) = lo g (\frac{Total number of documents}{Number of documents containing the term})$

Now, let’s compute IDF table:

I	love	you	and	games	hate	Sekiro
log(3/3)	log(3/2)	log(3/2)	log(3/1)	log(3/1)	log(3/1)	log(3/1)

Finally, we compute TF-IDF by multiplying the normalized TF values by their corresponding IDF scores for each term and page:

	love	you	and	games	hate	Sekiro
Page 1	0.176	0.088	0.239	0.239	0	0
Page 2	0	0.176	0	0	0.477	0
Page 3	0.176	0	0	0	0	0.477

From the TF-IDF calculations, we can draw the following insights:

Significant Terms: Words like "love" and "hate" stand out on their respective pages because their Term Frequency (TF) is high, and they are not overly common across the corpus. Their Inverse Document Frequency (IDF) values ensure they are given importance.
Less Impactful Words: Common terms such as "I" and "you" carry minimal weight. This is because they appear frequently across all pages, resulting in very low IDF values that diminish their overall importance.
Rare and Unique Terms: Words like "Sekiro" and "games" have higher TF-IDF scores on the pages they appear. Their rarity across the corpus leads to higher IDF values, emphasizing their significance.

This example illustrates how TF-IDF effectively identifies the most meaningful words in a document by balancing their frequency within a page and their rarity across the entire corpus. This balance allows for a more nuanced understanding of term relevance, which is crucial for applications like search engines and text analysis.