🔶 Python: Handling Missing Data
# Detect
df.isnull().sum() # NaN count per column
df.isnull().mean() # Fraction missing per column
# Remove
df.dropna() # Drop ANY row with NaN
df.dropna(subset=['critical_col']) # Only if specific column is NaN
# Fill
df['col'].fillna(0) # Constant
df['col'].fillna(df['col'].mean()) # Column mean
df['col'].fillna(method='ffill') # Forward fill (last known value)
Interview question: "When would you drop NaN vs fill it?"
Answer: Drop when: missingness is random AND few rows affected (<5%). Fill when: data is valuable, missingness has a pattern (time series → ffill), or the column is critical. NEVER blindly fill with mean — check if the missingness is informative (e.g., income=NaN might mean "refused to answer," which IS information).