🟣 Terminology: CAP Theorem and MapReduce
CAP Theorem A distributed system can guarantee at most 2 of 3: - Consistency — all nodes see the same data simultaneously - Availability — every request gets a response - Partition tolerance — system works despite network failures
Network partitions WILL happen, so you actually choose between: - CP (consistent, may reject during partitions) — banks, financial systems - AP (always responds, may serve stale data) — social media, caching
MapReduce Processing huge datasets across many machines: - Map: Apply function to each record → key-value pairs - Shuffle: Group values by key - Reduce: Aggregate per key
Word count: Map: each doc → (word, 1). Reduce: sum per word → (word, total).
Practice Questions
Q: You're building a real-time fraud detector. CP or AP system? Why?