terminology

🟣 Terminology: NLP Basics

Tokenization: Splitting text into units. "I love cats" → ["I", "love", "cats"]. Modern models use subword tokenization (handles unknown words).

TF-IDF: Term Frequency × Inverse Document Frequency. Words frequent in one document but rare everywhere get high scores. Common words ("the", "is") get low scores. Simple but surprisingly effective baseline.

Word Embeddings (Word2Vec, GloVe): Map words to dense vectors. Similar words are close together. Famous: "king" - "man" + "woman" ≈ "queen".

Transformers/BERT: Contextual embeddings — same word gets DIFFERENT vectors depending on context. "bank" in "river bank" ≠ "bank" in "bank account." Foundation of modern NLP.

Practice Questions

Q: What's the key advantage of BERT over Word2Vec?