A Friendly Introduction to Machine Learning: Purpose, Validation, and the Bias–Variance Tradeoff

Machine Learning, Prediction, and Economics Series 1

🏷️ Machine Learning📊 Cross-Validation

This post provides a friendly introduction to machine learning fundamentals, explaining how it differs from causal analysis, how to evaluate models with cross-validation, and the crucial bias-variance tradeoff.

Why Machine Learning, and How It Differs from Causal Analysis

Machine Learning (ML) is a toolbox for finding patterns in data that are useful for prediction. Given inputs xx and an outcome yy, the goal is to learn a function f^(x)\hat f(x) that predicts yy for new xx with small error. Examples include forecasting demand, ranking search results, or classifying images.

Causal analysis answers a different question. Instead of asking "what will yy be for this xx," it asks "what would yy be if we changed something." This is about consequences of interventions.

Let me give you an example from my job market paper on kelp forest restoration:

  • Predictive question: How much kelp biomass will this reef have next summer given current urchin density and forecasted temperature anomalies?
  • Causal question: What is the effect of a targeted urchin removal program on kelp biomass and subsequent abalone habitat over the next year?

Both are valuable. Prediction guides decisions that rely on accurate foresight. Causal analysis guides policy interventions that involve pulling levers and understanding their effects.

Rule of thumb. If your objective is to minimize out-of-sample prediction error, you're in an ML setting. If your objective is to estimate what would happen under an alternative action, you're in a causal inference setting.

How We Evaluate ML Models: Cross-Validation in Plain Terms

A reliable ML workflow measures how well a model generalizes. Training error alone is optimistic because the model has already seen those data. In other words, let yy be the outcome of interest and X\mathbf{X} be the predictor variables. The training data have already been used to "supervise" the model in learning how to predict these known outcomes. This setup is called supervised learning—and even a simple linear regression is a form of supervised learning.

Cross-validation (CV) simulates prediction on unseen data by splitting the dataset into KK parts (folds), training on K1K-1 parts, and evaluating on the held-out part. We repeat for each fold and average the errors.

K-fold Cross-Validation with Mean Squared Error (MSE)

There are multiple ways to quantify prediction error—mean absolute error (MAE), mean squared error (MSE), and others. Today, let me explain the concept using MSE.

Denote y^i\hat y_i as the model's prediction for observation ii. The squared error between the actual value yiy_i and the prediction y^i\hat y_i is (yiy^i)2(y_i - \hat y_i)^2.

In KK-fold cross-validation, for each observation ii we train the model without the fold containing ii and compute its out-of-fold prediction; call this y~i\tilde y_i. The KK-fold estimate of out-of-sample mean squared error is

CVK  =  1ni=1n(yiy~i)2\mathrm{CV}_K \;=\; \frac{1}{n}\sum_{i=1}^n \bigl(y_i - \tilde y_i\bigr)^2

where y~i\tilde y_i is the prediction for ii from the model trained without using its fold.

A Toy Numeric Example

Let me show you a numerical example with 5 data points and K=5K=5. When n=Kn=K, this special case is called "leave-one-out cross-validation" (LOOCV). Suppose we have outcomes

yi{1.6,  3.1,  2.6,  4.3,  2.1},i=1,2,,5y_i \in \{1.6,\; 3.1,\; 2.6,\; 4.3,\; 2.1\}, \quad i=1,2,\ldots,5

with no features for simplicity, and use the sample mean as the prediction rule. For each ii, compute the mean of the other n1n-1 points and use it to predict the held-out value. Denote this training-set mean by

yˉi  =  1n1jiyj(the subscript i means “leave i out”)\bar y_{-i} \;=\; \frac{1}{n-1}\sum_{j\neq i} y_j \quad \text{(the subscript $-i$ means ``leave $i$ out'')}

For each fold, we compute the mean of the remaining four points as our prediction for the held-out value, then compute the squared error:

FoldHeld-out yiy_iTrain mean yˉi\bar y_{-i}Error yiyˉiy_i-\bar y_{-i}Squared error ()2(\cdot)^2
11.61.63.0253.0251.425-1.4252.0312.031
23.13.12.6502.6500.4500.4500.2030.203
32.62.62.7752.7750.175-0.1750.0310.031
44.34.32.3502.3501.9501.9503.8033.803
52.12.12.9002.9000.800-0.8000.6400.640
Average CV MSE1.3411.341

Each row mimics a situation where a new point arrives that the model has not seen. We refit the model without that point, predict it, and measure the loss. Averaging across the five folds gives the cross-validated mean squared error, which is 1.3411.341 for this toy example.

Why CV Works

By repeatedly training without the evaluation subset, CV approximates performance on new data from the same process. Choosing KK involves a tradeoff: larger KK uses more data for training (reducing bias) but may increase variance and computation. Common choices are K=5K=5 or K=10K=10.

Bias-Variance Decomposition

Consider data generated by y=f(x)+εy = f(x) + \varepsilon where E[εx]=0\mathbb{E}[\varepsilon\mid x]=0 and Var(εx)=σ2\mathrm{Var}(\varepsilon\mid x)=\sigma^2.

For a learned predictor f^\hat f at fixed xx, the expected prediction error decomposes as:

E[(f^(x)y)2]expected prediction error  =  (E[f^(x)]f(x))2bias2  +  Var(f^(x))variance  +  σ2irreducible error\underbrace{\mathbb{E}\bigl[(\hat f(x) - y)^2\bigr]}_{\text{expected prediction error}} \;=\; \underbrace{\bigl(\mathbb{E}[\hat f(x)] - f(x)\bigr)^2}_{\text{bias}^2} \;+\; \underbrace{\mathrm{Var}\bigl(\hat f(x)\bigr)}_{\text{variance}} \;+\; \underbrace{\sigma^2}_{\text{irreducible error}}

Derivation

Starting from the prediction error and adding/subtracting E[f^(x)]\mathbb{E}[\hat f(x)]:

E[(f^(x)y)2]=E[(f^(x)f(x)ε)2]=E[(f^(x)E[f^(x)]deviation+E[f^(x)]f(x)biasε)2]\begin{align} \mathbb{E}\bigl[(\hat f(x) - y)^2\bigr] &= \mathbb{E}\bigl[(\hat f(x) - f(x) - \varepsilon)^2\bigr]\\ &= \mathbb{E}\bigl[\bigl(\underbrace{\hat f(x) - \mathbb{E}[\hat f(x)]}_{\text{deviation}} + \underbrace{\mathbb{E}[\hat f(x)] - f(x)}_{\text{bias}} - \varepsilon\bigr)^2\bigr] \end{align}

Expanding the square yields three terms:

=E[(f^(x)E[f^(x)])2]+E[(E[f^(x)]f(x))2]+E[ε2]+2E[(f^(x)E[f^(x)])(E[f^(x)]f(x)ε)]=0\begin{align} &= \mathbb{E}\bigl[(\hat f(x) - \mathbb{E}[\hat f(x)])^2\bigr] + \mathbb{E}\bigl[(\mathbb{E}[\hat f(x)] - f(x))^2\bigr] + \mathbb{E}[\varepsilon^2]\\ &\quad + \underbrace{2\mathbb{E}\bigl[(\hat f(x) - \mathbb{E}[\hat f(x)])(\mathbb{E}[\hat f(x)] - f(x) - \varepsilon)\bigr]}_{=\,0} \end{align}

The cross term vanishes because:

  • E[f^(x)E[f^(x)]]=0\mathbb{E}[\hat f(x) - \mathbb{E}[\hat f(x)]] = 0 by construction
  • f^(x)\hat f(x) (based on training data) is independent of future noise ε\varepsilon

Therefore:

E[(f^(x)y)2]=Var(f^(x))variance+(E[f^(x)]f(x))2bias2+σ2irreducible error\mathbb{E}\bigl[(\hat f(x) - y)^2\bigr] = \underbrace{\mathrm{Var}(\hat f(x))}_{\text{variance}} + \underbrace{(\mathbb{E}[\hat f(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\sigma^2}_{\text{irreducible error}}

Interpretation

This separates error into three parts:

  • Bias reflects systematic misspecification. High bias comes from models that are too simple or constrained.
  • Variance reflects sensitivity to the particular sample. High variance comes from overfitting flexible models to noise.
  • Irreducible error σ2\sigma^2 comes from genuine randomness or unobserved factors. No model can eliminate it.

What ML Tries to Do

Given finite data, we manage a bias–variance tradeoff. A more flexible model can reduce bias by capturing complex patterns, but it may increase variance by chasing noise. Regularization, model selection, and ensembling are tools to navigate this tradeoff:

  • Regularization (e.g., Lasso, Ridge) constrains complexity to reduce variance while maintaining enough flexibility to keep bias acceptable — this is what I will discuss in the next posting.
  • Model selection with CV chooses hyperparameters or selects among candidate models to minimize estimated out-of-sample error.
  • Ensembles (e.g., bagging, boosting, random forests) reduce variance through averaging or reduce bias through staged fitting.

Machine learning seeks to minimize bias2^2 + variance, while causal inference prioritizes minimizing bias alone.

Takeaways and What Comes Next

  • ML is about accurate prediction, while causal analysis is about the effect of interventions.
  • K-fold cross-validation provides an honest estimate of generalization error. The toy example illustrates how to compute CV MSE step by step.
  • The bias–variance decomposition explains why model choice and regularization matter. Good practice balances flexibility and stability.

This post sets the stage for a hands-on showcase in the next article. I'll present a practical ML workflow that uses cross-validation for model selection, reports uncertainty clearly, and communicates performance in a way that decision-makers can trust.

© 2025 Kyumin Kim. All rights reserved.