Your company stores raw JSON logs from 50 different services. Where does this go — warehouse or lake?
Answer & Edit
Data lake — it's raw, mixed-format data. After ETL/ELT processing, the cleaned structured version goes into the warehouse for analysis.
sqlCode Practicemodule-01
You INNER JOIN a table of 100 customers with a table of 80 orders. Can you get more than 100 rows?
Answer & Edit
Yes! If a customer has multiple orders, they appear once per order. INNER JOIN produces one row per match, not per left-side row.
behavioralStar Practicemodule-01
Think of ONE specific project where you used data to influence a decision. Write down the S, T, A, R in 2-3 sentences each. You'll need this story in multiple interviews.
Answer & Edit
Use the STAR framework: Situation, Task, Action, Result. Keep under 2 minutes.
generalFree Textmodule-01
What's the SQL execution order?
Answer & Edit
FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT
generalFree Textmodule-01
Why is set lookup O(1) but list lookup O(n)?
Answer & Edit
Sets use hash tables (compute hash, jump to bucket). Lists scan every element sequentially.
generalFree Textmodule-01
P(A or B) when A and B are NOT mutually exclusive?
Answer & Edit
P(A) + P(B) - P(A and B) — subtract the overlap to avoid double-counting.
generalFree Textmodule-01
What does ETL stand for?
Answer & Edit
Extract, Transform, Load.
generalFree Textmodule-01
How do you find customers with no orders using SQL?
Answer & Edit
LEFT JOIN customers to orders, then WHERE orders.id IS NULL.
generalFree Textmodule-01
What does STAR stand for?
Answer & Edit
Situation, Task, Action, Result.
sqlCode Practicemodule-02
You need the second-highest salary per department. Which ranking function and why?
Answer & Edit
`DENSE_RANK()`. If two people tie for #1, RANK would make the next person #3 (skipping #2). DENSE_RANK gives the next person #2, which is what "second highest" means.
pythonCode Practicemodule-02
You want to find customers in df1 that do NOT exist in df2. How?
Answer & Edit
terminologyFree Textmodule-02
Your e-commerce app writes 10,000 orders per minute. Your analytics team runs daily reports aggregating all orders. Same database?
Answer & Edit
No — use OLTP for the app (fast writes) and replicate to an OLAP warehouse for analytics (fast reads/aggregations). Running heavy analytical queries on the production OLTP database would slow down the app.
generalFree Textmodule-02
What's the difference between RANK() and DENSE_RANK() for values [50, 50, 40]?
In Pandas, what does `indicator=True` add to a merge?
Answer & Edit
A `_merge` column with values 'left_only', 'right_only', or 'both' — shows which rows matched.
generalFree Textmodule-02
Work through Bayes: disease rate 1/500, test sensitivity 95%, false positive rate 2%. Someone tests positive — what's the probability they're sick? (Do the math.)
OLTP vs OLAP — which uses columnar storage and why?
Answer & Edit
OLAP uses columnar storage because analytical queries typically read few columns across many rows — columnar means you only load the columns you need.
generalFree Textmodule-02
Why can't you filter a window function directly in WHERE?
Answer & Edit
WHERE executes before SELECT, and window functions are computed during SELECT. Need a CTE/subquery to filter.
sqlCode Practicemodule-03
Write a query to find users whose spending THIS month is more than double their LAST month's spending.
Answer & Edit
pythonCode Practicemodule-03
You want to add a column showing what percentage of their department's total salary each employee represents. Which function?
Answer & Edit
`transform()`:
statsFree Textmodule-03
Average 3 emails per hour (Poisson). What's the variance? What distribution describes time between emails?
Answer & Edit
Variance = 3 (Poisson: mean = variance). Time between emails is Exponential with λ=3, so mean wait = 1/3 hour = 20 minutes.
generalFree Textmodule-03
What does `LAG(revenue) OVER (ORDER BY month)` return for the first row?
Answer & Edit
NULL — there's no previous row to look back at.
generalFree Textmodule-03
`transform()` vs `agg()` — which changes the number of rows?
Answer & Edit
`agg()` reduces rows (one per group). `transform()` keeps the same number of rows.
generalFree Textmodule-03
Poisson distribution: mean = 7. What's the variance?
Answer & Edit
7 — Poisson's mean equals its variance.
generalFree Textmodule-03
What does GAME stand for?
Answer & Edit
Goal, Actions, Metrics, Evaluate.
generalFree Textmodule-03
What's a guardrail metric and why does it matter?
Answer & Edit
A metric that must NOT degrade while optimizing the primary metric. Catches unintended consequences (e.g., pushing engagement but increasing spam).
generalFree Textmodule-03
Write CASE WHEN to compute a conversion rate from events with types 'view' and 'purchase'.
Answer & Edit
`ROUND(100.0 * SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) / NULLIF(SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END), 0), 2) AS conversion_rate`
statsFree Textmodule-04
An interviewer asks: "Your A/B test got p = 0.04. What does that mean?" Answer in one clear sentence.
Answer & Edit
"If the new feature truly had no effect on the metric, there's only a 4% probability we'd observe a difference this large or larger in our experiment — so we reject the null hypothesis at the 5% significance level."
statsFree Textmodule-04
Your boss says "we can only detect a 5% lift — the actual lift is probably 2%. What do we need?"
Answer & Edit
More sample size. Power depends on effect size — a smaller expected effect needs a much larger sample to detect. You'd recalculate n using the power formula with the smaller minimum detectable effect.
sqlCode Practicemodule-04
Table `flights` has columns: flight_id, origin, destination, departure_time. Write a query to find all pairs of flights where you can connect (flight 1's destination = flight 2's origin, and flight 2 departs after flight 1 arrives).
Answer & Edit
terminologyFree Textmodule-04
You're designing a data warehouse for analytics. Normalized or denormalized? Why?
Answer & Edit
Denormalized (or a star schema). Analytics queries aggregate across many dimensions — pre-joining reduces the need for expensive runtime JOINs. Read performance matters more than write efficiency in a warehouse.
generalFree Textmodule-04
State what a p-value means in one sentence.
Answer & Edit
The probability of observing data as extreme or more extreme than what was observed, assuming the null hypothesis is true.
generalFree Textmodule-04
Type I vs Type II — which is the false positive?
Answer & Edit
Type I is the false positive (rejecting H₀ when it's true — "crying wolf").
generalFree Textmodule-04
Write a self-join to find employees who report to the same manager.
Answer & Edit
`SELECT a.name, b.name FROM employees a JOIN employees b ON a.manager_id = b.manager_id AND a.employee_id < b.employee_id` (the < avoids duplicates and self-pairs)
generalFree Textmodule-04
`df.loc[0:3]` vs `df.iloc[0:3]` — how many rows each?
What's the SettingWithCopyWarning and how do you fix it?
Answer & Edit
Chained indexing (`df[...][...] = val`) may modify a copy. Fix: use `df.loc[condition, column] = val`.
generalFree Textmodule-04
Normalized vs denormalized database — which for analytics and why?
Answer & Edit
Denormalized — fewer JOINs means faster analytical queries. Read performance > write efficiency for analytics.
mlFree Textmodule-05
Your model gets 95% train accuracy and 72% test accuracy. What's the problem and what do you try first?
Answer & Edit
Overfitting (high variance) — 23% gap between train and test. Try: more training data, add regularization (L1/L2), reduce model complexity, or use dropout if neural net.
generalFree Textmodule-05
Your model has high BOTH train and test error. Diagnosis and fix?
`AVG(x) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)`
generalFree Textmodule-05
Why should you switch in Monty Hall?
Answer & Edit
Initial pick = 1/3. The remaining door inherits the full 2/3 after the host reveals a goat.
generalFree Textmodule-05
Expected value of rolling a fair die?
Answer & Edit
3.5
generalFree Textmodule-05
How many people for >50% shared birthday probability?
Answer & Edit
23 people.
mlFree Textmodule-06
You have 500 features but suspect only ~20 matter. L1 or L2?
Answer & Edit
L1 (Lasso) — it will zero out the ~480 irrelevant features, effectively doing feature selection.
mlFree Textmodule-06
You're building a fraud detection system. What metric do you prioritize and why?
Answer & Edit
Recall — missing fraud (false negative) is far more costly than investigating a legitimate transaction (false positive). You'd also track precision to ensure you're not overwhelming the fraud team with false alerts.
statsFree Textmodule-06
Why randomize by user ID, not by session?
Answer & Edit
A user might have multiple sessions. If they see version A in one session and B in another, you're contaminating the experiment. Hashing user ID ensures they always see the same version.
pythonCode Practicemodule-06
What's the difference between `[x for x in range(n)]` and `(x for x in range(n))`?
Answer & Edit
Square brackets = list comprehension (stores all values, O(n) memory). Parentheses = generator expression (yields one at a time, O(1) memory).
sqlCode Practicemodule-06
Table has 100 rows. Column `bonus` has 20 NULLs. What does `AVG(bonus)` divide by?
Answer & Edit
80 — it ignores the 20 NULL rows. This can be surprising if you expect it to treat NULLs as 0. If you want NULLs as 0: `AVG(COALESCE(bonus, 0))`.
generalFree Textmodule-06
L1 vs L2: which one can zero out coefficients?
Answer & Edit
L1 (Lasso) — its absolute value penalty can drive coefficients to exactly zero.
generalFree Textmodule-06
Can you apply L1/L2 to a Random Forest? Why or why not?
Answer & Edit
No — L1/L2 regularize coefficients. Trees have no coefficients. Trees use max_depth, min_samples_leaf, etc.
generalFree Textmodule-06
Precision vs recall: which do you prioritize for a cancer screening model?
Answer & Edit
Recall — missing cancer (false negative) is far worse than a false alarm (false positive).
generalFree Textmodule-06
Why is accuracy misleading for imbalanced data?
Answer & Edit
A model predicting only the majority class gets high accuracy while catching zero minority cases. 99% accuracy on 99/1 split = useless.
generalFree Textmodule-06
In an A/B test, why not check results every day?
Answer & Edit
Peeking inflates false positive rates — a test designed for α=0.05 can have 20-30% actual error rates with repeated peeking. Pre-commit to end date or use sequential methods.
generalFree Textmodule-06
`NULL = NULL` returns what?
Answer & Edit
NULL (not TRUE). Use `IS NULL` for NULL comparison.
generalFree Textmodule-06
What's a generator and when would you use one?
Answer & Edit
A function/expression that yields values lazily one at a time. Use for large datasets that don't fit in memory.
statsFree Textmodule-07
You run an A/B test for 3 days, see p=0.02, and your PM wants to ship immediately. What do you say?
Answer & Edit
"We should wait. Three days isn't enough to capture weekly patterns (weekday vs weekend behavior differs). Also, early significance with a small sample is unreliable — the effect size estimate is noisy and may shrink. Let's run to our pre-committed duration."
generalFree Textmodule-07
Name 3 A/B testing pitfalls and the fix for each.
Answer & Edit
Peeking (pre-commit to end date), multiple testing (Bonferroni correction or single primary metric), novelty effect (run longer). Also: Simpson's paradox (check segments), network effects (cluster randomization).
generalFree Textmodule-07
What is CUPED and why does it help?
Answer & Edit
Reduces variance by adjusting for pre-experiment behavior, letting you detect smaller effects with the same sample size.
generalFree Textmodule-07
"DAU dropped 10%." What's your first question?
Answer & Edit
"Is this compared to last week or same week last year? Is it all platforms/regions, or concentrated somewhere?" — clarify before decomposing.
generalFree Textmodule-07
What does the sigmoid function do in logistic regression?
Answer & Edit
Squashes any real number to (0,1), making the output interpretable as a probability.
generalFree Textmodule-07
When should you drop NaN vs fill it?
Answer & Edit
Drop if random + few rows. Fill if data is valuable, but check whether missingness itself is informative.
generalFree Textmodule-07
Tell your "failure" STAR story in under 2 minutes. (Actually do this out loud.)
Answer & Edit
(Self-grade: Did you use "I"? Quantify? Show learning? Under 2 min?)
mlFree Textmodule-08
Your interviewer asks: "Why not just oversample the minority class by duplicating rows?"
Answer & Edit
Simple duplication doesn't add new information — the model just sees the same examples multiple times, which can lead to overfitting on those exact examples. SMOTE is better because it creates NEW synthetic points between existing minority samples, adding diversity. But even SMOTE can create unrealistic samples if the feature space is sparse.
terminologyFree Textmodule-08
You're building a real-time fraud detector. CP or AP system? Why?
Answer & Edit
CP — you need consistency. A payment system can't serve stale data about account balances or fraud flags. It's better to briefly reject a transaction during a network partition than to approve a fraudulent one based on outdated info.
generalFree Textmodule-08
Why is accuracy bad for imbalanced data? What metric do you use instead?
Answer & Edit
Predicting majority class always = high accuracy, zero usefulness. Use F1, PR-AUC, or recall depending on cost of false positives vs false negatives.
generalFree Textmodule-08
What is SMOTE and what's the critical rule for using it?
Answer & Edit
Creates synthetic minority samples by interpolating between existing ones. ONLY apply to training data — never test/validation (data leakage).
generalFree Textmodule-08
Explain the consecutive-days SQL trick in your own words.
Answer & Edit
Subtracting an incrementing row number from consecutive dates produces the same group value. Non-consecutive dates produce different values, creating separate groups you can count.
generalFree Textmodule-08
When do you use pivot_table vs melt in Pandas?
Answer & Edit
Pivot: aggregate + reshape to see a metric across two dimensions. Melt: convert column headers into row values (wide → long format).
generalFree Textmodule-08
CAP theorem: what do the three letters stand for?
Answer & Edit
Consistency, Availability, Partition tolerance.
generalFree Textmodule-08
p = 0.08 at α = 0.05. Is the null hypothesis true?
Answer & Edit
No — "fail to reject H₀" ≠ "H₀ is true." It means insufficient evidence at this sample size.
mlFree Textmodule-09
You're predicting daily stock prices. Can you use standard 5-fold CV?
Answer & Edit
No — time series data has temporal dependencies. Random folds would leak future information into training. Use walk-forward or time-series split.
mlFree Textmodule-09
Your decision tree has 100% training accuracy. Good or bad?
Answer & Edit
Bad — almost certainly overfitting. An unconstrained tree can memorize every training example by creating a leaf for each one. Regularize with max_depth, min_samples_leaf.
mlFree Textmodule-09
"Total app downloads reached 50 million!" Is this a good success metric?
Answer & Edit
No — it's a vanity metric. It only goes up and says nothing about engagement. Many downloaded apps are never opened. Better: DAU, d7 retention, or MAU with an engagement threshold.
sqlCode Practicemodule-09
Write a query to get the first day of each user's signup month.
Answer & Edit
`SELECT user_id, DATE_TRUNC('month', signup_date) AS signup_month FROM users`
generalFree Textmodule-09
Why can't you use standard k-fold CV on time series data?
Answer & Edit
Random splits leak future data into training, violating temporal order. Use walk-forward/time-series split.
generalFree Textmodule-09
What is Gini impurity? What value means maximum impurity for binary classification?
Answer & Edit
Gini = 1 - Σ(pᵢ²). For binary: max impurity = 0.5 (50/50 class split).
generalFree Textmodule-09
Name 3 North Star metrics for 3 different companies.
What does a Python decorator actually do mechanically?
Answer & Edit
It replaces the original function with a wrapper function that calls the original but adds behavior (logging, timing, caching, etc.).
generalFree Textmodule-09
`DATE_TRUNC('month', '2024-07-18')` returns what?
Answer & Edit
2024-07-01 (first day of the month).
generalFree Textmodule-09
Your decision tree has 100% train accuracy and 60% test accuracy. Diagnose.
Answer & Edit
Overfitting — 40% gap between train and test. Regularize: increase min_samples_leaf, decrease max_depth, or switch to Random Forest/boosting.
statsFree Textmodule-10
You sample 100 customer wait times. Mean = 5 min, SD = 3 min. What's the standard error of the mean?
Answer & Edit
SE = σ/√n = 3/√100 = 3/10 = **0.3 minutes.** Even though individual wait times vary widely (SD=3), the mean of 100 observations is estimated with much more precision.
statsFree Textmodule-10
Sample mean = 50, SD = 12, n = 36. What's the 95% CI?
Answer & Edit
SE = 12/√36 = 2. CI = 50 ± 1.96 × 2 = 50 ± 3.92 = **(46.08, 53.92)**
mlFree Textmodule-10
"What IS a loss function?" (Explain simply.)
Answer & Edit
A measure of how wrong the model's predictions are. Lower = better. MSE for regression (average squared error), log loss for classification (penalizes confident wrong predictions heavily). Gradient descent minimizes this function.
terminologyFree Textmodule-10
You're building a dashboard that shows the CEO daily revenue summaries. Batch or streaming?
Answer & Edit
Batch — daily summaries don't need real-time processing. Run a daily ETL job overnight. Reserve streaming for things that need sub-second response (fraud detection, live pricing).
generalFree Textmodule-10
State the CLT in one sentence.
Answer & Edit
Regardless of the underlying distribution, sample means approach a normal distribution as sample size increases, with mean = population mean and SD = σ/√n.
generalFree Textmodule-10
Sample mean = 100, SD = 20, n = 64. What's the 95% CI?
Answer & Edit
SE = 20/√64 = 2.5. CI = 100 ± 1.96 × 2.5 = 100 ± 4.9 = **(95.1, 104.9)**
generalFree Textmodule-10
What does a 95% confidence interval ACTUALLY mean?
Answer & Edit
If you repeated the experiment 100 times, ~95 of the resulting intervals would contain the true value. NOT "95% probability the true value is in this interval."
generalFree Textmodule-10
EXISTS vs IN — when do you use each?
Answer & Edit
EXISTS for large tables/correlated subqueries (stops at first match). IN for small static value lists.
generalFree Textmodule-10
What is a correlated subquery?
Answer & Edit
A subquery that references the outer query, re-executing for each outer row.
generalFree Textmodule-10
Learning rate too high → what happens? Too low?
Answer & Edit
Too high → overshoots, oscillates, may diverge. Too low → very slow convergence.
generalFree Textmodule-10
Batch vs streaming — which for a real-time fraud alert system?
A column has 2 million rows but only 5 unique string values. How do you reduce its memory?
Answer & Edit
`df['col'] = df['col'].astype('category')` — stores 5 unique values + integer codes instead of 2M full strings. Can reduce memory 95%+.
generalFree Textmodule-11
Walk through a Fermi estimate for "How many pizza deliveries happen in Chicago per day?"
Answer & Edit
~2.8M population in city proper. Maybe 20% order pizza weekly = 560K orders/week ÷ 7 = ~80K/day. Of those, maybe 60% delivery = ~48K. Sanity check: there are ~2,000 pizza places × ~25 deliveries/day ≈ 50K. ✓
generalFree Textmodule-11
Tabular data: deep learning or gradient boosting? Why?
Answer & Edit
Gradient boosting — faster, more interpretable, needs less data, usually outperforms DL on tabular data.
generalFree Textmodule-11
How does converting a string column to category type save memory?
Answer & Edit
Stores unique values once + integer codes per row, instead of full string per row. For 2M rows with 5 unique values, goes from storing 2M strings to 5 strings + 2M tiny integers.
What's a star schema? Name the two types of tables.
Answer & Edit
Central fact table (transactions/events) surrounded by dimension tables (who, what, where, when). Optimized for analytical queries with simple, predictable joins.
generalFree Textmodule-11
Bayes — a defective item from a two-machine factory. Can you set up the formula without looking?
BERT produces contextual embeddings — the same word gets different representations depending on surrounding words. Word2Vec gives each word a single fixed vector regardless of context.
terminologyFree Textmodule-12
A time series of daily ice cream sales has an upward slope and spikes every summer. Name the two components.
Answer & Edit
Trend (upward slope) and seasonality (annual summer spikes).
sqlCode Practicemodule-12
When would you actually need UNION (not UNION ALL)?
Answer & Edit
When the two tables might have overlapping rows and you want each unique row exactly once. Example: combining a "customers who bought" list with a "customers who browsed" list, and you want a unique list of all customers who did either.
generalFree Textmodule-12
BERT vs Word2Vec — key difference in one sentence?
Answer & Edit
BERT gives contextual embeddings (same word → different vector based on context); Word2Vec gives one fixed vector per word.
generalFree Textmodule-12
What is stationarity and how do you test for it?
Answer & Edit
Statistical properties don't change over time. Test with ADF (Augmented Dickey-Fuller). Fix with differencing.
generalFree Textmodule-12
What is model drift? Name the two types.
Answer & Edit
Model performance degrades as real-world data changes. Data drift (input distribution changes) and concept drift (input-output relationship changes).
generalFree Textmodule-12
"DAU down, revenue up" — give 2 possible explanations.
Answer & Edit
(a) Churned users were low-value, remaining users spend more. (b) Price increase drove away price-sensitive users.
generalFree Textmodule-12
UNION vs UNION ALL — which is faster and why?
Answer & Edit
UNION ALL — no deduplication step needed, no sorting/comparing.
generalFree Textmodule-12
SE = σ/√n. If n goes from 100 to 400, what happens to SE?
Answer & Edit
Halves. SE = σ/√400 = σ/20 vs σ/√100 = σ/10. √n doubled, so SE halved.
mlFree Textmodule-13
You have 200 features. After PCA, the first 15 components explain 95% of variance. What do you do?
Answer & Edit
Keep 15 components, drop the rest. You've reduced from 200 to 15 dimensions while retaining 95% of the information. The remaining 185 components mostly capture noise.
generalFree Textmodule-13
Name 2 ways Chicago DS interviews differ from FAANG.
Answer & Edit
More modeling questions (less LeetCode), domain knowledge heavily weighted, take-home assignments standard, rigid STAR behavioral.
generalFree Textmodule-13
What is PCA? Is it feature selection?
Answer & Edit
PCA finds new axes (principal components) that capture maximum variance, then keeps the top k. It's NOT feature selection — each component is a linear combination of ALL original features.
generalFree Textmodule-13
WHERE vs HAVING: which filters groups after aggregation?
Answer & Edit
HAVING — WHERE filters individual rows before grouping.
generalFree Textmodule-13
What does `if __name__ == '__main__'` do?
Answer & Edit
Checks if the script is run directly (executes the code) vs imported as a module (skips the code).
generalFree Textmodule-13
Recite the A/B test design steps from memory.
Answer & Edit
(1) Hypothesis (2) Metrics (primary/secondary/guardrail) (3) Sample size (4) Duration ≥2 weeks (5) Randomize by user (6) Analyze at end date — no peeking (7) Statistical + practical significance.
generalFree Textmodule-13
Your SWE background — how is this an advantage for Chicago DS roles?
Answer & Edit
Most Chicago DS candidates come from analytics/academia. Your ability to build end-to-end systems (pipelines, deployment, production code) is rare and highly valued. Lean into this in every behavioral.
terminologyFlashcardmodule-14
What does ETL stand for?
Answer & Edit
Extract, Transform, Load.
terminologyFlashcardmodule-14
OLAP vs OLTP in one sentence each?
Answer & Edit
OLTP: fast individual transactions (app backend). OLAP: complex analytical queries across millions of rows (BI/dashboards).
terminologyFlashcardmodule-14
What is the CAP theorem?
Answer & Edit
Distributed systems can guarantee at most 2 of 3: Consistency, Availability, Partition tolerance. Since partitions happen, you choose CP or AP.
terminologyFlashcardmodule-14
Batch vs streaming?
Answer & Edit
Batch: process data on a schedule (reports, ETL). Streaming: process in real-time as it arrives (fraud, live dashboards).
terminologyFlashcardmodule-14
What is a feature store?
Answer & Edit
Centralized system for storing/serving ML features, ensuring consistency between training and production.
sqlCode Practicemodule-14
Top 3 highest-spending customers. Just write it.
Answer & Edit
SELECT customer_id, SUM(amount) AS total_spend
FROM orders
GROUP BY customer_id
ORDER BY total_spend DESC
LIMIT 3;
sqlCode Practicemodule-14
Second highest salary per department.
Answer & Edit
WITH ranked AS (
SELECT department, employee, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
FROM employees
)
SELECT * FROM ranked WHERE rnk = 2;
sqlCode Practicemodule-14
Month-over-month growth rate.
Answer & Edit
WITH monthly AS (
SELECT DATE_TRUNC('month', order_date) AS month, SUM(revenue) AS rev
FROM orders GROUP BY 1
)
SELECT month, rev,
ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY month))
/ NULLIF(LAG(rev) OVER (ORDER BY month), 0), 2) AS growth_pct
FROM monthly;
sqlCode Practicemodule-14
Users who exist in signups but never made a purchase.
Answer & Edit
SELECT s.user_id
FROM signups s
LEFT JOIN purchases p ON s.user_id = p.user_id
WHERE p.user_id IS NULL;
sqlCode Practicemodule-14
Click-through rate: clicks / impressions per campaign.
Answer & Edit
SELECT campaign_id,
ROUND(100.0 *
SUM(CASE WHEN event = 'click' THEN 1 ELSE 0 END) /
NULLIF(SUM(CASE WHEN event = 'impression' THEN 1 ELSE 0 END), 0)
, 2) AS ctr
FROM events GROUP BY campaign_id;
statsFree Textmodule-14
State Bayes' theorem and work through: 1% disease rate, 95% sensitivity, 3% false positive. Person tests positive.
If H₀ were true (no effect), there's only a 4% chance of observing data this extreme. At α=0.05, we reject H₀.
statsFree Textmodule-14
Type I vs Type II?
Answer & Edit
Type I (α) = false positive, crying wolf. Type II (β) = false negative, missing the wolf. Power = 1-β ≥ 0.80.
statsFree Textmodule-14
Why always switch in Monty Hall?
Answer & Edit
Initial pick = 1/3. Other two doors = 2/3. Host reveals one is wrong, remaining door gets the full 2/3. Switching = 2/3 win rate.
statsFree Textmodule-14
95% confidence interval: what does it ACTUALLY mean?
Answer & Edit
If you repeated the experiment 100 times, ~95 of the intervals would contain the true value. NOT "95% probability the true value is in this specific interval."
mlFree Textmodule-14
Bias-variance: your model has 97% train accuracy, 68% test accuracy.
Answer & Edit
Overfitting (high variance). Big gap. Fix: more data, regularize, simplify, dropout.
mlFree Textmodule-14
L1 vs L2: 300 features, you think ~20 matter.
Answer & Edit
L1 (Lasso) — drives irrelevant features to exactly zero. Built-in feature selection.
mlFree Textmodule-14
RF vs XGBoost: you have noisy data and 1 hour before deadline.
Answer & Edit
Random Forest — good defaults, resistant to noise, hard to screw up. XGBoost needs tuning.
mlFree Textmodule-14
You built a fraud model with 99.5% accuracy. Your manager is impressed. Should you be?
Answer & Edit
No — if fraud is 0.5% of transactions, predicting "not fraud" always = 99.5% accuracy. Check precision, recall, F1, PR-AUC.
mlFree Textmodule-14
Can you apply L1/L2 to Random Forest?
Answer & Edit
No. L1/L2 regularize coefficients. Trees have none. Trees regularize via max_depth, min_samples_leaf, etc.
pythonCode Practicemodule-14
What's wrong with:
Answer & Edit
Chained indexing — may modify a copy. Fix: `df.loc[df['x'] > 5, 'y'] = 10`
pythonCode Practicemodule-14
You need each employee's department average as a new column.
Answer & Edit
`transform()` — keeps the same number of rows.
pythonCode Practicemodule-14
`df.loc[0:3]` returns how many rows?
Answer & Edit
4 rows (labels 0, 1, 2, 3 — loc is inclusive on both ends).
pythonCode Practicemodule-14
Why is NumPy faster than Python lists?
Answer & Edit
Contiguous memory (cache-friendly), homogeneous types (no per-element type checking), vectorized C operations (no Python loop overhead). 10-100x faster.
pythonCode Practicemodule-14
List vs generator for 10M items?
Answer & Edit
Generator — O(1) memory (yields one at a time) vs O(n) memory for list (stores all at once).
productFree Textmodule-14
"How would you measure success of a search feature?" Use GAME.
Answer & Edit
- G: Reduce time to find what users want, increase actions from search
- A: Type query → see results → click result → complete action
- M: Primary — search-to-action conversion. Secondary — zero-result rate, avg result position clicked. Guardrail — page load time, user satisfaction
- E: High clicks but low conversions = results look relevant but aren't. Track post-click behavior.
productFree Textmodule-14
"Revenue dropped 15% this quarter." First 3 things you do.
Answer & Edit
(1) Clarify: vs last quarter or YoY? All segments? (2) Decompose: Revenue = Users × Conversion × AOV — which dropped? (3) Check internal (deployments, bugs, logging changes) before external (seasonality, competitors).
productFree Textmodule-14
Estimate: How many data scientists work in Chicago?
Answer & Edit
~5,000 companies with 500+ employees in Chicago. Maybe 30% have DS teams averaging ~5 DS each. 1,500 × 5 = ~7,500. Plus smaller companies and consultancies → maybe ~10-12K. (Rough but defensible.)
behavioralStar Practicemodule-14
"Tell me about a time you used data to influence a decision."
Answer & Edit
*Use your prepared STAR story. Time yourself.*
behavioralStar Practicemodule-14
"Describe a failed project."
Answer & Edit
*Own it. Show learning. End on what you do differently now.*
behavioralStar Practicemodule-14
"How do you explain a model to a non-technical stakeholder?"
Answer & Edit
*Focus on: no jargon, use visualization/analogy, tie to business outcome.*