Question Bank

148 questions

+ Add Question
sql Code Practice module-01

What happens if you write `WHERE total_sales > 100` and `total_sales` is an alias from `SELECT SUM(sales) AS total_sales`?

Answer & Edit
It fails. WHERE runs before SELECT, so the alias doesn't exist yet. Use HAVING instead (runs after GROUP BY/aggregation).
python Code Practice module-01

Which data structure would you use if you need to check whether an item exists in a collection of 10 million elements, and why?

Answer & Edit
A `set`. Lookup is O(1) via hashing, vs O(n) for a list scanning every element.
stats Free Text module-01

You roll two dice. What's P(at least one 6)?

Answer & Edit
1 - P(no sixes) = 1 - (5/6 × 5/6) = 1 - 25/36 = **11/36 ≈ 0.306**
terminology Free Text module-01

Your company stores raw JSON logs from 50 different services. Where does this go — warehouse or lake?

Answer & Edit
Data lake — it's raw, mixed-format data. After ETL/ELT processing, the cleaned structured version goes into the warehouse for analysis.
sql Code Practice module-01

You INNER JOIN a table of 100 customers with a table of 80 orders. Can you get more than 100 rows?

Answer & Edit
Yes! If a customer has multiple orders, they appear once per order. INNER JOIN produces one row per match, not per left-side row.
behavioral Star Practice module-01

Think of ONE specific project where you used data to influence a decision. Write down the S, T, A, R in 2-3 sentences each. You'll need this story in multiple interviews.

Answer & Edit
Use the STAR framework: Situation, Task, Action, Result. Keep under 2 minutes.
general Free Text module-01

What's the SQL execution order?

Answer & Edit
FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT
general Free Text module-01

Why is set lookup O(1) but list lookup O(n)?

Answer & Edit
Sets use hash tables (compute hash, jump to bucket). Lists scan every element sequentially.
general Free Text module-01

P(A or B) when A and B are NOT mutually exclusive?

Answer & Edit
P(A) + P(B) - P(A and B) — subtract the overlap to avoid double-counting.
general Free Text module-01

What does ETL stand for?

Answer & Edit
Extract, Transform, Load.
general Free Text module-01

How do you find customers with no orders using SQL?

Answer & Edit
LEFT JOIN customers to orders, then WHERE orders.id IS NULL.
general Free Text module-01

What does STAR stand for?

Answer & Edit
Situation, Task, Action, Result.
sql Code Practice module-02

You need the second-highest salary per department. Which ranking function and why?

Answer & Edit
`DENSE_RANK()`. If two people tie for #1, RANK would make the next person #3 (skipping #2). DENSE_RANK gives the next person #2, which is what "second highest" means.
python Code Practice module-02

You want to find customers in df1 that do NOT exist in df2. How?

Answer & Edit
terminology Free Text module-02

Your e-commerce app writes 10,000 orders per minute. Your analytics team runs daily reports aggregating all orders. Same database?

Answer & Edit
No — use OLTP for the app (fast writes) and replicate to an OLAP warehouse for analytics (fast reads/aggregations). Running heavy analytical queries on the production OLTP database would slow down the app.
general Free Text module-02

What's the difference between RANK() and DENSE_RANK() for values [50, 50, 40]?

Answer & Edit
RANK: 1, 1, 3 (skips 2). DENSE_RANK: 1, 1, 2 (no skip).
general Free Text module-02

In Pandas, what does `indicator=True` add to a merge?

Answer & Edit
A `_merge` column with values 'left_only', 'right_only', or 'both' — shows which rows matched.
general Free Text module-02

Work through Bayes: disease rate 1/500, test sensitivity 95%, false positive rate 2%. Someone tests positive — what's the probability they're sick? (Do the math.)

Answer & Edit
P = (0.95 × 0.002) / (0.95 × 0.002 + 0.02 × 0.998) = 0.0019 / (0.0019 + 0.01996) = 0.0019 / 0.02186 ≈ **8.7%**
general Free Text module-02

OLTP vs OLAP — which uses columnar storage and why?

Answer & Edit
OLAP uses columnar storage because analytical queries typically read few columns across many rows — columnar means you only load the columns you need.
general Free Text module-02

Why can't you filter a window function directly in WHERE?

Answer & Edit
WHERE executes before SELECT, and window functions are computed during SELECT. Need a CTE/subquery to filter.
sql Code Practice module-03

Write a query to find users whose spending THIS month is more than double their LAST month's spending.

Answer & Edit
python Code Practice module-03

You want to add a column showing what percentage of their department's total salary each employee represents. Which function?

Answer & Edit
`transform()`:
stats Free Text module-03

Average 3 emails per hour (Poisson). What's the variance? What distribution describes time between emails?

Answer & Edit
Variance = 3 (Poisson: mean = variance). Time between emails is Exponential with λ=3, so mean wait = 1/3 hour = 20 minutes.
general Free Text module-03

What does `LAG(revenue) OVER (ORDER BY month)` return for the first row?

Answer & Edit
NULL — there's no previous row to look back at.
general Free Text module-03

`transform()` vs `agg()` — which changes the number of rows?

Answer & Edit
`agg()` reduces rows (one per group). `transform()` keeps the same number of rows.
general Free Text module-03

Poisson distribution: mean = 7. What's the variance?

Answer & Edit
7 — Poisson's mean equals its variance.
general Free Text module-03

What does GAME stand for?

Answer & Edit
Goal, Actions, Metrics, Evaluate.
general Free Text module-03

What's a guardrail metric and why does it matter?

Answer & Edit
A metric that must NOT degrade while optimizing the primary metric. Catches unintended consequences (e.g., pushing engagement but increasing spam).
general Free Text module-03

Write CASE WHEN to compute a conversion rate from events with types 'view' and 'purchase'.

Answer & Edit
`ROUND(100.0 * SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) / NULLIF(SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END), 0), 2) AS conversion_rate`
stats Free Text module-04

An interviewer asks: "Your A/B test got p = 0.04. What does that mean?" Answer in one clear sentence.

Answer & Edit
"If the new feature truly had no effect on the metric, there's only a 4% probability we'd observe a difference this large or larger in our experiment — so we reject the null hypothesis at the 5% significance level."
stats Free Text module-04

Your boss says "we can only detect a 5% lift — the actual lift is probably 2%. What do we need?"

Answer & Edit
More sample size. Power depends on effect size — a smaller expected effect needs a much larger sample to detect. You'd recalculate n using the power formula with the smaller minimum detectable effect.
sql Code Practice module-04

Table `flights` has columns: flight_id, origin, destination, departure_time. Write a query to find all pairs of flights where you can connect (flight 1's destination = flight 2's origin, and flight 2 departs after flight 1 arrives).

Answer & Edit
terminology Free Text module-04

You're designing a data warehouse for analytics. Normalized or denormalized? Why?

Answer & Edit
Denormalized (or a star schema). Analytics queries aggregate across many dimensions — pre-joining reduces the need for expensive runtime JOINs. Read performance matters more than write efficiency in a warehouse.
general Free Text module-04

State what a p-value means in one sentence.

Answer & Edit
The probability of observing data as extreme or more extreme than what was observed, assuming the null hypothesis is true.
general Free Text module-04

Type I vs Type II — which is the false positive?

Answer & Edit
Type I is the false positive (rejecting H₀ when it's true — "crying wolf").
general Free Text module-04

Write a self-join to find employees who report to the same manager.

Answer & Edit
`SELECT a.name, b.name FROM employees a JOIN employees b ON a.manager_id = b.manager_id AND a.employee_id < b.employee_id` (the < avoids duplicates and self-pairs)
general Free Text module-04

`df.loc[0:3]` vs `df.iloc[0:3]` — how many rows each?

Answer & Edit
`loc[0:3]` = 4 rows (labels 0,1,2,3 — inclusive). `iloc[0:3]` = 3 rows (positions 0,1,2 — exclusive end).
general Free Text module-04

What's the SettingWithCopyWarning and how do you fix it?

Answer & Edit
Chained indexing (`df[...][...] = val`) may modify a copy. Fix: use `df.loc[condition, column] = val`.
general Free Text module-04

Normalized vs denormalized database — which for analytics and why?

Answer & Edit
Denormalized — fewer JOINs means faster analytical queries. Read performance > write efficiency for analytics.
ml Free Text module-05

Your model gets 95% train accuracy and 72% test accuracy. What's the problem and what do you try first?

Answer & Edit
Overfitting (high variance) — 23% gap between train and test. Try: more training data, add regularization (L1/L2), reduce model complexity, or use dropout if neural net.
general Free Text module-05

Your model has high BOTH train and test error. Diagnosis and fix?

Answer & Edit
Underfitting (high bias). Fix: more complex model, add features, reduce regularization.
general Free Text module-05

RF reduces ___. XGBoost reduces ___.

Answer & Edit
RF reduces variance. XGBoost reduces bias.
general Free Text module-05

Write a 3-day rolling average window clause.

Answer & Edit
`AVG(x) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)`
general Free Text module-05

Why should you switch in Monty Hall?

Answer & Edit
Initial pick = 1/3. The remaining door inherits the full 2/3 after the host reveals a goat.
general Free Text module-05

Expected value of rolling a fair die?

Answer & Edit
3.5
general Free Text module-05

How many people for >50% shared birthday probability?

Answer & Edit
23 people.
ml Free Text module-06

You have 500 features but suspect only ~20 matter. L1 or L2?

Answer & Edit
L1 (Lasso) — it will zero out the ~480 irrelevant features, effectively doing feature selection.
ml Free Text module-06

You're building a fraud detection system. What metric do you prioritize and why?

Answer & Edit
Recall — missing fraud (false negative) is far more costly than investigating a legitimate transaction (false positive). You'd also track precision to ensure you're not overwhelming the fraud team with false alerts.
stats Free Text module-06

Why randomize by user ID, not by session?

Answer & Edit
A user might have multiple sessions. If they see version A in one session and B in another, you're contaminating the experiment. Hashing user ID ensures they always see the same version.
python Code Practice module-06

What's the difference between `[x for x in range(n)]` and `(x for x in range(n))`?

Answer & Edit
Square brackets = list comprehension (stores all values, O(n) memory). Parentheses = generator expression (yields one at a time, O(1) memory).
sql Code Practice module-06

Table has 100 rows. Column `bonus` has 20 NULLs. What does `AVG(bonus)` divide by?

Answer & Edit
80 — it ignores the 20 NULL rows. This can be surprising if you expect it to treat NULLs as 0. If you want NULLs as 0: `AVG(COALESCE(bonus, 0))`.
general Free Text module-06

L1 vs L2: which one can zero out coefficients?

Answer & Edit
L1 (Lasso) — its absolute value penalty can drive coefficients to exactly zero.
general Free Text module-06

Can you apply L1/L2 to a Random Forest? Why or why not?

Answer & Edit
No — L1/L2 regularize coefficients. Trees have no coefficients. Trees use max_depth, min_samples_leaf, etc.
general Free Text module-06

Precision vs recall: which do you prioritize for a cancer screening model?

Answer & Edit
Recall — missing cancer (false negative) is far worse than a false alarm (false positive).
general Free Text module-06

Why is accuracy misleading for imbalanced data?

Answer & Edit
A model predicting only the majority class gets high accuracy while catching zero minority cases. 99% accuracy on 99/1 split = useless.
general Free Text module-06

In an A/B test, why not check results every day?

Answer & Edit
Peeking inflates false positive rates — a test designed for α=0.05 can have 20-30% actual error rates with repeated peeking. Pre-commit to end date or use sequential methods.
general Free Text module-06

`NULL = NULL` returns what?

Answer & Edit
NULL (not TRUE). Use `IS NULL` for NULL comparison.
general Free Text module-06

What's a generator and when would you use one?

Answer & Edit
A function/expression that yields values lazily one at a time. Use for large datasets that don't fit in memory.
stats Free Text module-07

You run an A/B test for 3 days, see p=0.02, and your PM wants to ship immediately. What do you say?

Answer & Edit
"We should wait. Three days isn't enough to capture weekly patterns (weekday vs weekend behavior differs). Also, early significance with a small sample is unreliable — the effect size estimate is noisy and may shrink. Let's run to our pre-committed duration."
general Free Text module-07

Name 3 A/B testing pitfalls and the fix for each.

Answer & Edit
Peeking (pre-commit to end date), multiple testing (Bonferroni correction or single primary metric), novelty effect (run longer). Also: Simpson's paradox (check segments), network effects (cluster randomization).
general Free Text module-07

What is CUPED and why does it help?

Answer & Edit
Reduces variance by adjusting for pre-experiment behavior, letting you detect smaller effects with the same sample size.
general Free Text module-07

"DAU dropped 10%." What's your first question?

Answer & Edit
"Is this compared to last week or same week last year? Is it all platforms/regions, or concentrated somewhere?" — clarify before decomposing.
general Free Text module-07

What does the sigmoid function do in logistic regression?

Answer & Edit
Squashes any real number to (0,1), making the output interpretable as a probability.
general Free Text module-07

When should you drop NaN vs fill it?

Answer & Edit
Drop if random + few rows. Fill if data is valuable, but check whether missingness itself is informative.
general Free Text module-07

Tell your "failure" STAR story in under 2 minutes. (Actually do this out loud.)

Answer & Edit
(Self-grade: Did you use "I"? Quantify? Show learning? Under 2 min?)
ml Free Text module-08

Your interviewer asks: "Why not just oversample the minority class by duplicating rows?"

Answer & Edit
Simple duplication doesn't add new information — the model just sees the same examples multiple times, which can lead to overfitting on those exact examples. SMOTE is better because it creates NEW synthetic points between existing minority samples, adding diversity. But even SMOTE can create unrealistic samples if the feature space is sparse.
terminology Free Text module-08

You're building a real-time fraud detector. CP or AP system? Why?

Answer & Edit
CP — you need consistency. A payment system can't serve stale data about account balances or fraud flags. It's better to briefly reject a transaction during a network partition than to approve a fraudulent one based on outdated info.
general Free Text module-08

Why is accuracy bad for imbalanced data? What metric do you use instead?

Answer & Edit
Predicting majority class always = high accuracy, zero usefulness. Use F1, PR-AUC, or recall depending on cost of false positives vs false negatives.
general Free Text module-08

What is SMOTE and what's the critical rule for using it?

Answer & Edit
Creates synthetic minority samples by interpolating between existing ones. ONLY apply to training data — never test/validation (data leakage).
general Free Text module-08

Explain the consecutive-days SQL trick in your own words.

Answer & Edit
Subtracting an incrementing row number from consecutive dates produces the same group value. Non-consecutive dates produce different values, creating separate groups you can count.
general Free Text module-08

When do you use pivot_table vs melt in Pandas?

Answer & Edit
Pivot: aggregate + reshape to see a metric across two dimensions. Melt: convert column headers into row values (wide → long format).
general Free Text module-08

CAP theorem: what do the three letters stand for?

Answer & Edit
Consistency, Availability, Partition tolerance.
general Free Text module-08

p = 0.08 at α = 0.05. Is the null hypothesis true?

Answer & Edit
No — "fail to reject H₀" ≠ "H₀ is true." It means insufficient evidence at this sample size.
ml Free Text module-09

You're predicting daily stock prices. Can you use standard 5-fold CV?

Answer & Edit
No — time series data has temporal dependencies. Random folds would leak future information into training. Use walk-forward or time-series split.
ml Free Text module-09

Your decision tree has 100% training accuracy. Good or bad?

Answer & Edit
Bad — almost certainly overfitting. An unconstrained tree can memorize every training example by creating a leaf for each one. Regularize with max_depth, min_samples_leaf.
ml Free Text module-09

"Total app downloads reached 50 million!" Is this a good success metric?

Answer & Edit
No — it's a vanity metric. It only goes up and says nothing about engagement. Many downloaded apps are never opened. Better: DAU, d7 retention, or MAU with an engagement threshold.
sql Code Practice module-09

Write a query to get the first day of each user's signup month.

Answer & Edit
`SELECT user_id, DATE_TRUNC('month', signup_date) AS signup_month FROM users`
general Free Text module-09

Why can't you use standard k-fold CV on time series data?

Answer & Edit
Random splits leak future data into training, violating temporal order. Use walk-forward/time-series split.
general Free Text module-09

What is Gini impurity? What value means maximum impurity for binary classification?

Answer & Edit
Gini = 1 - Σ(pᵢ²). For binary: max impurity = 0.5 (50/50 class split).
general Free Text module-09

Name 3 North Star metrics for 3 different companies.

Answer & Edit
Facebook: DAU. Airbnb: nights booked. Spotify: time listening. Slack: messages/user/day. Uber: rides completed.
general Free Text module-09

What does a Python decorator actually do mechanically?

Answer & Edit
It replaces the original function with a wrapper function that calls the original but adds behavior (logging, timing, caching, etc.).
general Free Text module-09

`DATE_TRUNC('month', '2024-07-18')` returns what?

Answer & Edit
2024-07-01 (first day of the month).
general Free Text module-09

Your decision tree has 100% train accuracy and 60% test accuracy. Diagnose.

Answer & Edit
Overfitting — 40% gap between train and test. Regularize: increase min_samples_leaf, decrease max_depth, or switch to Random Forest/boosting.
stats Free Text module-10

You sample 100 customer wait times. Mean = 5 min, SD = 3 min. What's the standard error of the mean?

Answer & Edit
SE = σ/√n = 3/√100 = 3/10 = **0.3 minutes.** Even though individual wait times vary widely (SD=3), the mean of 100 observations is estimated with much more precision.
stats Free Text module-10

Sample mean = 50, SD = 12, n = 36. What's the 95% CI?

Answer & Edit
SE = 12/√36 = 2. CI = 50 ± 1.96 × 2 = 50 ± 3.92 = **(46.08, 53.92)**
ml Free Text module-10

"What IS a loss function?" (Explain simply.)

Answer & Edit
A measure of how wrong the model's predictions are. Lower = better. MSE for regression (average squared error), log loss for classification (penalizes confident wrong predictions heavily). Gradient descent minimizes this function.
terminology Free Text module-10

You're building a dashboard that shows the CEO daily revenue summaries. Batch or streaming?

Answer & Edit
Batch — daily summaries don't need real-time processing. Run a daily ETL job overnight. Reserve streaming for things that need sub-second response (fraud detection, live pricing).
general Free Text module-10

State the CLT in one sentence.

Answer & Edit
Regardless of the underlying distribution, sample means approach a normal distribution as sample size increases, with mean = population mean and SD = σ/√n.
general Free Text module-10

Sample mean = 100, SD = 20, n = 64. What's the 95% CI?

Answer & Edit
SE = 20/√64 = 2.5. CI = 100 ± 1.96 × 2.5 = 100 ± 4.9 = **(95.1, 104.9)**
general Free Text module-10

What does a 95% confidence interval ACTUALLY mean?

Answer & Edit
If you repeated the experiment 100 times, ~95 of the resulting intervals would contain the true value. NOT "95% probability the true value is in this interval."
general Free Text module-10

EXISTS vs IN — when do you use each?

Answer & Edit
EXISTS for large tables/correlated subqueries (stops at first match). IN for small static value lists.
general Free Text module-10

What is a correlated subquery?

Answer & Edit
A subquery that references the outer query, re-executing for each outer row.
general Free Text module-10

Learning rate too high → what happens? Too low?

Answer & Edit
Too high → overshoots, oscillates, may diverge. Too low → very slow convergence.
general Free Text module-10

Batch vs streaming — which for a real-time fraud alert system?

Answer & Edit
Streaming — fraud detection needs millisecond response times.
python Code Practice module-11

A column has 2 million rows but only 5 unique string values. How do you reduce its memory?

Answer & Edit
`df['col'] = df['col'].astype('category')` — stores 5 unique values + integer codes instead of 2M full strings. Can reduce memory 95%+.
general Free Text module-11

Walk through a Fermi estimate for "How many pizza deliveries happen in Chicago per day?"

Answer & Edit
~2.8M population in city proper. Maybe 20% order pizza weekly = 560K orders/week ÷ 7 = ~80K/day. Of those, maybe 60% delivery = ~48K. Sanity check: there are ~2,000 pizza places × ~25 deliveries/day ≈ 50K. ✓
general Free Text module-11

Tabular data: deep learning or gradient boosting? Why?

Answer & Edit
Gradient boosting — faster, more interpretable, needs less data, usually outperforms DL on tabular data.
general Free Text module-11

How does converting a string column to category type save memory?

Answer & Edit
Stores unique values once + integer codes per row, instead of full string per row. For 2M rows with 5 unique values, goes from storing 2M strings to 5 strings + 2M tiny integers.
general Free Text module-11

Expected rolls to see all 6 die faces?

Answer & Edit
14.7 rolls. E = 6(1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6).
general Free Text module-11

What's a star schema? Name the two types of tables.

Answer & Edit
Central fact table (transactions/events) surrounded by dimension tables (who, what, where, when). Optimized for analytical queries with simple, predictable joins.
general Free Text module-11

Bayes — a defective item from a two-machine factory. Can you set up the formula without looking?

Answer & Edit
P(A|Def) = P(Def|A)×P(A) / [P(Def|A)×P(A) + P(Def|B)×P(B)]
terminology Free Text module-12

What's the key advantage of BERT over Word2Vec?

Answer & Edit
BERT produces contextual embeddings — the same word gets different representations depending on surrounding words. Word2Vec gives each word a single fixed vector regardless of context.
terminology Free Text module-12

A time series of daily ice cream sales has an upward slope and spikes every summer. Name the two components.

Answer & Edit
Trend (upward slope) and seasonality (annual summer spikes).
sql Code Practice module-12

When would you actually need UNION (not UNION ALL)?

Answer & Edit
When the two tables might have overlapping rows and you want each unique row exactly once. Example: combining a "customers who bought" list with a "customers who browsed" list, and you want a unique list of all customers who did either.
general Free Text module-12

BERT vs Word2Vec — key difference in one sentence?

Answer & Edit
BERT gives contextual embeddings (same word → different vector based on context); Word2Vec gives one fixed vector per word.
general Free Text module-12

What is stationarity and how do you test for it?

Answer & Edit
Statistical properties don't change over time. Test with ADF (Augmented Dickey-Fuller). Fix with differencing.
general Free Text module-12

What is model drift? Name the two types.

Answer & Edit
Model performance degrades as real-world data changes. Data drift (input distribution changes) and concept drift (input-output relationship changes).
general Free Text module-12

"DAU down, revenue up" — give 2 possible explanations.

Answer & Edit
(a) Churned users were low-value, remaining users spend more. (b) Price increase drove away price-sensitive users.
general Free Text module-12

UNION vs UNION ALL — which is faster and why?

Answer & Edit
UNION ALL — no deduplication step needed, no sorting/comparing.
general Free Text module-12

SE = σ/√n. If n goes from 100 to 400, what happens to SE?

Answer & Edit
Halves. SE = σ/√400 = σ/20 vs σ/√100 = σ/10. √n doubled, so SE halved.
ml Free Text module-13

You have 200 features. After PCA, the first 15 components explain 95% of variance. What do you do?

Answer & Edit
Keep 15 components, drop the rest. You've reduced from 200 to 15 dimensions while retaining 95% of the information. The remaining 185 components mostly capture noise.
general Free Text module-13

Name 2 ways Chicago DS interviews differ from FAANG.

Answer & Edit
More modeling questions (less LeetCode), domain knowledge heavily weighted, take-home assignments standard, rigid STAR behavioral.
general Free Text module-13

What is PCA? Is it feature selection?

Answer & Edit
PCA finds new axes (principal components) that capture maximum variance, then keeps the top k. It's NOT feature selection — each component is a linear combination of ALL original features.
general Free Text module-13

WHERE vs HAVING: which filters groups after aggregation?

Answer & Edit
HAVING — WHERE filters individual rows before grouping.
general Free Text module-13

What does `if __name__ == '__main__'` do?

Answer & Edit
Checks if the script is run directly (executes the code) vs imported as a module (skips the code).
general Free Text module-13

Recite the A/B test design steps from memory.

Answer & Edit
(1) Hypothesis (2) Metrics (primary/secondary/guardrail) (3) Sample size (4) Duration ≥2 weeks (5) Randomize by user (6) Analyze at end date — no peeking (7) Statistical + practical significance.
general Free Text module-13

Your SWE background — how is this an advantage for Chicago DS roles?

Answer & Edit
Most Chicago DS candidates come from analytics/academia. Your ability to build end-to-end systems (pipelines, deployment, production code) is rare and highly valued. Lean into this in every behavioral.
terminology Flashcard module-14

What does ETL stand for?

Answer & Edit
Extract, Transform, Load.
terminology Flashcard module-14

OLAP vs OLTP in one sentence each?

Answer & Edit
OLTP: fast individual transactions (app backend). OLAP: complex analytical queries across millions of rows (BI/dashboards).
terminology Flashcard module-14

What is the CAP theorem?

Answer & Edit
Distributed systems can guarantee at most 2 of 3: Consistency, Availability, Partition tolerance. Since partitions happen, you choose CP or AP.
terminology Flashcard module-14

Batch vs streaming?

Answer & Edit
Batch: process data on a schedule (reports, ETL). Streaming: process in real-time as it arrives (fraud, live dashboards).
terminology Flashcard module-14

What is a feature store?

Answer & Edit
Centralized system for storing/serving ML features, ensuring consistency between training and production.
sql Code Practice module-14

Top 3 highest-spending customers. Just write it.

Answer & Edit
SELECT customer_id, SUM(amount) AS total_spend FROM orders GROUP BY customer_id ORDER BY total_spend DESC LIMIT 3;
sql Code Practice module-14

Second highest salary per department.

Answer & Edit
WITH ranked AS ( SELECT department, employee, salary, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk FROM employees ) SELECT * FROM ranked WHERE rnk = 2;
sql Code Practice module-14

Month-over-month growth rate.

Answer & Edit
WITH monthly AS ( SELECT DATE_TRUNC('month', order_date) AS month, SUM(revenue) AS rev FROM orders GROUP BY 1 ) SELECT month, rev, ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY month)) / NULLIF(LAG(rev) OVER (ORDER BY month), 0), 2) AS growth_pct FROM monthly;
sql Code Practice module-14

Users who exist in signups but never made a purchase.

Answer & Edit
SELECT s.user_id FROM signups s LEFT JOIN purchases p ON s.user_id = p.user_id WHERE p.user_id IS NULL;
sql Code Practice module-14

Click-through rate: clicks / impressions per campaign.

Answer & Edit
SELECT campaign_id, ROUND(100.0 * SUM(CASE WHEN event = 'click' THEN 1 ELSE 0 END) / NULLIF(SUM(CASE WHEN event = 'impression' THEN 1 ELSE 0 END), 0) , 2) AS ctr FROM events GROUP BY campaign_id;
stats Free Text module-14

State Bayes' theorem and work through: 1% disease rate, 95% sensitivity, 3% false positive. Person tests positive.

Answer & Edit
P(D|+) = (0.95 × 0.01) / (0.95 × 0.01 + 0.03 × 0.99) = 0.0095 / (0.0095 + 0.0297) = 0.0095 / 0.0392 ≈ **24.2%**
stats Free Text module-14

What does p = 0.04 mean?

Answer & Edit
If H₀ were true (no effect), there's only a 4% chance of observing data this extreme. At α=0.05, we reject H₀.
stats Free Text module-14

Type I vs Type II?

Answer & Edit
Type I (α) = false positive, crying wolf. Type II (β) = false negative, missing the wolf. Power = 1-β ≥ 0.80.
stats Free Text module-14

Why always switch in Monty Hall?

Answer & Edit
Initial pick = 1/3. Other two doors = 2/3. Host reveals one is wrong, remaining door gets the full 2/3. Switching = 2/3 win rate.
stats Free Text module-14

95% confidence interval: what does it ACTUALLY mean?

Answer & Edit
If you repeated the experiment 100 times, ~95 of the intervals would contain the true value. NOT "95% probability the true value is in this specific interval."
ml Free Text module-14

Bias-variance: your model has 97% train accuracy, 68% test accuracy.

Answer & Edit
Overfitting (high variance). Big gap. Fix: more data, regularize, simplify, dropout.
ml Free Text module-14

L1 vs L2: 300 features, you think ~20 matter.

Answer & Edit
L1 (Lasso) — drives irrelevant features to exactly zero. Built-in feature selection.
ml Free Text module-14

RF vs XGBoost: you have noisy data and 1 hour before deadline.

Answer & Edit
Random Forest — good defaults, resistant to noise, hard to screw up. XGBoost needs tuning.
ml Free Text module-14

You built a fraud model with 99.5% accuracy. Your manager is impressed. Should you be?

Answer & Edit
No — if fraud is 0.5% of transactions, predicting "not fraud" always = 99.5% accuracy. Check precision, recall, F1, PR-AUC.
ml Free Text module-14

Can you apply L1/L2 to Random Forest?

Answer & Edit
No. L1/L2 regularize coefficients. Trees have none. Trees regularize via max_depth, min_samples_leaf, etc.
python Code Practice module-14

What's wrong with:

Answer & Edit
Chained indexing — may modify a copy. Fix: `df.loc[df['x'] > 5, 'y'] = 10`
python Code Practice module-14

You need each employee's department average as a new column.

Answer & Edit
`transform()` — keeps the same number of rows.
python Code Practice module-14

`df.loc[0:3]` returns how many rows?

Answer & Edit
4 rows (labels 0, 1, 2, 3 — loc is inclusive on both ends).
python Code Practice module-14

Why is NumPy faster than Python lists?

Answer & Edit
Contiguous memory (cache-friendly), homogeneous types (no per-element type checking), vectorized C operations (no Python loop overhead). 10-100x faster.
python Code Practice module-14

List vs generator for 10M items?

Answer & Edit
Generator — O(1) memory (yields one at a time) vs O(n) memory for list (stores all at once).
product Free Text module-14

"How would you measure success of a search feature?" Use GAME.

Answer & Edit
- G: Reduce time to find what users want, increase actions from search - A: Type query → see results → click result → complete action - M: Primary — search-to-action conversion. Secondary — zero-result rate, avg result position clicked. Guardrail — page load time, user satisfaction - E: High clicks but low conversions = results look relevant but aren't. Track post-click behavior.
product Free Text module-14

"Revenue dropped 15% this quarter." First 3 things you do.

Answer & Edit
(1) Clarify: vs last quarter or YoY? All segments? (2) Decompose: Revenue = Users × Conversion × AOV — which dropped? (3) Check internal (deployments, bugs, logging changes) before external (seasonality, competitors).
product Free Text module-14

Estimate: How many data scientists work in Chicago?

Answer & Edit
~5,000 companies with 500+ employees in Chicago. Maybe 30% have DS teams averaging ~5 DS each. 1,500 × 5 = ~7,500. Plus smaller companies and consultancies → maybe ~10-12K. (Rough but defensible.)
behavioral Star Practice module-14

"Tell me about a time you used data to influence a decision."

Answer & Edit
*Use your prepared STAR story. Time yourself.*
behavioral Star Practice module-14

"Describe a failed project."

Answer & Edit
*Own it. Show learning. End on what you do differently now.*
behavioral Star Practice module-14

"How do you explain a model to a non-technical stakeholder?"

Answer & Edit
*Focus on: no jargon, use visualization/analogy, tie to business outcome.*