Kyumin Kim | A Friendly Introduction to Machine Learning: Purpose, Validation, and the Bias

This post provides a friendly introduction to machine learning fundamentals, explaining how it differs from causal analysis, how to evaluate models with cross-validation, and the crucial bias-variance tradeoff.

Why Machine Learning, and How It Differs from Causal Analysis

Machine Learning (ML) is a toolbox for finding patterns in data that are useful for prediction. Given inputs $x$ and an outcome $y$ , the goal is to learn a function $\hat f(x)$ that predicts $y$ for new $x$ with small error. Examples include forecasting demand, ranking search results, or classifying images.

Causal analysis answers a different question. Instead of asking "what will $y$ be for this $x$ ," it asks "what would $y$ be if we changed something." This is about consequences of interventions.

Let me give you an example from my job market paper on kelp forest restoration:

Predictive question: How much kelp biomass will this reef have next summer given current urchin density and forecasted temperature anomalies?
Causal question: What is the effect of a targeted urchin removal program on kelp biomass and subsequent abalone habitat over the next year?

Both are valuable. Prediction guides decisions that rely on accurate foresight. Causal analysis guides policy interventions that involve pulling levers and understanding their effects.

Rule of thumb. If your objective is to minimize out-of-sample prediction error, you're in an ML setting. If your objective is to estimate what would happen under an alternative action, you're in a causal inference setting.

How We Evaluate ML Models: Cross-Validation in Plain Terms

A reliable ML workflow measures how well a model generalizes. Training error alone is optimistic because the model has already seen those data. In other words, let $y$ be the outcome of interest and $\mathbf{X}$ be the predictor variables. The training data have already been used to "supervise" the model in learning how to predict these known outcomes. This setup is called supervised learning—and even a simple linear regression is a form of supervised learning.

Cross-validation (CV) simulates prediction on unseen data by splitting the dataset into $K$ parts (folds), training on $K-1$ parts, and evaluating on the held-out part. We repeat for each fold and average the errors.

K-fold Cross-Validation with Mean Squared Error (MSE)

There are multiple ways to quantify prediction error—mean absolute error (MAE), mean squared error (MSE), and others. Today, let me explain the concept using MSE.

Denote $\hat y_i$ as the model's prediction for observation $i$ . The squared error between the actual value $y_i$ and the prediction $\hat y_i$ is $(y_i - \hat y_i)^2$ .

In $K$ -fold cross-validation, for each observation $i$ we train the model without the fold containing $i$ and compute its out-of-fold prediction; call this $\tilde y_i$ . The $K$ -fold estimate of out-of-sample mean squared error is

\mathrm{CV}_K \;=\; \frac{1}{n}\sum_{i=1}^n \bigl(y_i - \tilde y_i\bigr)^2

where $\tilde y_i$ is the prediction for $i$ from the model trained without using its fold.

A Toy Numeric Example

Let me show you a numerical example with 5 data points and $K=5$ . When $n=K$ , this special case is called "leave-one-out cross-validation" (LOOCV). Suppose we have outcomes

y_i \in \{1.6,\; 3.1,\; 2.6,\; 4.3,\; 2.1\}, \quad i=1,2,\ldots,5

with no features for simplicity, and use the sample mean as the prediction rule. For each $i$ , compute the mean of the other $n-1$ points and use it to predict the held-out value. Denote this training-set mean by

\bar y_{-i} \;=\; \frac{1}{n-1}\sum_{j\neq i} y_j \quad \text{(the subscript $-i$ means ``leave $i$ out'')}

For each fold, we compute the mean of the remaining four points as our prediction for the held-out value, then compute the squared error:

Fold	Held-out $y_i$	Train mean $\bar y_{-i}$	Error $y_i-\bar y_{-i}$	Squared error $(\cdot)^2$
1	$1.6$	$3.025$	$-1.425$	$2.031$
2	$3.1$	$2.650$	$0.450$	$0.203$
3	$2.6$	$2.775$	$-0.175$	$0.031$
4	$4.3$	$2.350$	$1.950$	$3.803$
5	$2.1$	$2.900$	$-0.800$	$0.640$
Average CV MSE				$1.341$

Each row mimics a situation where a new point arrives that the model has not seen. We refit the model without that point, predict it, and measure the loss. Averaging across the five folds gives the cross-validated mean squared error, which is $1.341$ for this toy example.

Why CV Works

By repeatedly training without the evaluation subset, CV approximates performance on new data from the same process. Choosing $K$ involves a tradeoff: larger $K$ uses more data for training (reducing bias) but may increase variance and computation. Common choices are $K=5$ or $K=10$ .

Bias-Variance Decomposition

Consider data generated by $y = f(x) + \varepsilon$ where $\mathbb{E}[\varepsilon\mid x]=0$ and $\mathrm{Var}(\varepsilon\mid x)=\sigma^2$ .

For a learned predictor $\hat f$ at fixed $x$ , the expected prediction error decomposes as:

\underbrace{\mathbb{E}\bigl[(\hat f(x) - y)^2\bigr]}_{\text{expected prediction error}} \;=\; \underbrace{\bigl(\mathbb{E}[\hat f(x)] - f(x)\bigr)^2}_{\text{bias}^2} \;+\; \underbrace{\mathrm{Var}\bigl(\hat f(x)\bigr)}_{\text{variance}} \;+\; \underbrace{\sigma^2}_{\text{irreducible error}}

Derivation

Starting from the prediction error and adding/subtracting $\mathbb{E}[\hat f(x)]$ :

\begin{align} \mathbb{E}\bigl[(\hat f(x) - y)^2\bigr] &= \mathbb{E}\bigl[(\hat f(x) - f(x) - \varepsilon)^2\bigr]\\ &= \mathbb{E}\bigl[\bigl(\underbrace{\hat f(x) - \mathbb{E}[\hat f(x)]}_{\text{deviation}} + \underbrace{\mathbb{E}[\hat f(x)] - f(x)}_{\text{bias}} - \varepsilon\bigr)^2\bigr] \end{align}

Expanding the square yields three terms:

\begin{align} &= \mathbb{E}\bigl[(\hat f(x) - \mathbb{E}[\hat f(x)])^2\bigr] + \mathbb{E}\bigl[(\mathbb{E}[\hat f(x)] - f(x))^2\bigr] + \mathbb{E}[\varepsilon^2]\\ &\quad + \underbrace{2\mathbb{E}\bigl[(\hat f(x) - \mathbb{E}[\hat f(x)])(\mathbb{E}[\hat f(x)] - f(x) - \varepsilon)\bigr]}_{=\,0} \end{align}

The cross term vanishes because:

$\mathbb{E}[\hat f(x) - \mathbb{E}[\hat f(x)]] = 0$ by construction
$\hat f(x)$ (based on training data) is independent of future noise $\varepsilon$

Therefore:

\mathbb{E}\bigl[(\hat f(x) - y)^2\bigr] = \underbrace{\mathrm{Var}(\hat f(x))}_{\text{variance}} + \underbrace{(\mathbb{E}[\hat f(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\sigma^2}_{\text{irreducible error}}

Interpretation

This separates error into three parts:

Bias reflects systematic misspecification. High bias comes from models that are too simple or constrained.
Variance reflects sensitivity to the particular sample. High variance comes from overfitting flexible models to noise.
Irreducible error $\sigma^2$ comes from genuine randomness or unobserved factors. No model can eliminate it.

What ML Tries to Do

Given finite data, we manage a bias–variance tradeoff. A more flexible model can reduce bias by capturing complex patterns, but it may increase variance by chasing noise. Regularization, model selection, and ensembling are tools to navigate this tradeoff:

Regularization (e.g., Lasso, Ridge) constrains complexity to reduce variance while maintaining enough flexibility to keep bias acceptable — this is what I will discuss in the next posting.
Model selection with CV chooses hyperparameters or selects among candidate models to minimize estimated out-of-sample error.
Ensembles (e.g., bagging, boosting, random forests) reduce variance through averaging or reduce bias through staged fitting.

Machine learning seeks to minimize bias $^2$ + variance, while causal inference prioritizes minimizing bias alone.

Takeaways and What Comes Next

ML is about accurate prediction, while causal analysis is about the effect of interventions.
K-fold cross-validation provides an honest estimate of generalization error. The toy example illustrates how to compute CV MSE step by step.
The bias–variance decomposition explains why model choice and regularization matter. Good practice balances flexibility and stability.

This post sets the stage for a hands-on showcase in the next article. I'll present a practical ML workflow that uses cross-validation for model selection, reports uncertainty clearly, and communicates performance in a way that decision-makers can trust.

A Friendly Introduction to Machine Learning: Purpose, Validation, and the Bias–Variance Tradeoff