🟣 Terminology: NLP Basics
Tokenization: Splitting text into units. "I love cats" → ["I", "love", "cats"]. Modern models use subword tokenization (handles unknown words).
TF-IDF: Term Frequency × Inverse Document Frequency. Words frequent in one document but rare everywhere get high scores. Common words ("the", "is") get low scores. Simple but surprisingly effective baseline.
Word Embeddings (Word2Vec, GloVe): Map words to dense vectors. Similar words are close together. Famous: "king" - "man" + "woman" ≈ "queen".
Transformers/BERT: Contextual embeddings — same word gets DIFFERENT vectors depending on context. "bank" in "river bank" ≠"bank" in "bank account." Foundation of modern NLP.
Practice Questions
Q: What's the key advantage of BERT over Word2Vec?