A Friendly Introduction to Machine Learning: Purpose, Validation, and the Bias–Variance Tradeoff
Machine Learning, Prediction, and Economics Series 1
This post provides a friendly introduction to machine learning fundamentals, explaining how it differs from causal analysis, how to evaluate models with cross-validation, and the crucial bias-variance tradeoff.
Why Machine Learning, and How It Differs from Causal Analysis
Machine Learning (ML) is a toolbox for finding patterns in data that are useful for prediction. Given inputs and an outcome , the goal is to learn a function that predicts for new with small error. Examples include forecasting demand, ranking search results, or classifying images.
Causal analysis answers a different question. Instead of asking "what will be for this ," it asks "what would be if we changed something." This is about consequences of interventions.
Let me give you an example from my job market paper on kelp forest restoration:
- Predictive question: How much kelp biomass will this reef have next summer given current urchin density and forecasted temperature anomalies?
- Causal question: What is the effect of a targeted urchin removal program on kelp biomass and subsequent abalone habitat over the next year?
Both are valuable. Prediction guides decisions that rely on accurate foresight. Causal analysis guides policy interventions that involve pulling levers and understanding their effects.
Rule of thumb. If your objective is to minimize out-of-sample prediction error, you're in an ML setting. If your objective is to estimate what would happen under an alternative action, you're in a causal inference setting.
How We Evaluate ML Models: Cross-Validation in Plain Terms
A reliable ML workflow measures how well a model generalizes. Training error alone is optimistic because the model has already seen those data. In other words, let be the outcome of interest and be the predictor variables. The training data have already been used to "supervise" the model in learning how to predict these known outcomes. This setup is called supervised learning—and even a simple linear regression is a form of supervised learning.
Cross-validation (CV) simulates prediction on unseen data by splitting the dataset into parts (folds), training on parts, and evaluating on the held-out part. We repeat for each fold and average the errors.
K-fold Cross-Validation with Mean Squared Error (MSE)
There are multiple ways to quantify prediction error—mean absolute error (MAE), mean squared error (MSE), and others. Today, let me explain the concept using MSE.
Denote as the model's prediction for observation . The squared error between the actual value and the prediction is .
In -fold cross-validation, for each observation we train the model without the fold containing and compute its out-of-fold prediction; call this . The -fold estimate of out-of-sample mean squared error is
where is the prediction for from the model trained without using its fold.
A Toy Numeric Example
Let me show you a numerical example with 5 data points and . When , this special case is called "leave-one-out cross-validation" (LOOCV). Suppose we have outcomes
with no features for simplicity, and use the sample mean as the prediction rule. For each , compute the mean of the other points and use it to predict the held-out value. Denote this training-set mean by
For each fold, we compute the mean of the remaining four points as our prediction for the held-out value, then compute the squared error:
| Fold | Held-out | Train mean | Error | Squared error |
|---|---|---|---|---|
| 1 | ||||
| 2 | ||||
| 3 | ||||
| 4 | ||||
| 5 | ||||
| Average CV MSE | ||||
Each row mimics a situation where a new point arrives that the model has not seen. We refit the model without that point, predict it, and measure the loss. Averaging across the five folds gives the cross-validated mean squared error, which is for this toy example.
Why CV Works
By repeatedly training without the evaluation subset, CV approximates performance on new data from the same process. Choosing involves a tradeoff: larger uses more data for training (reducing bias) but may increase variance and computation. Common choices are or .
Bias-Variance Decomposition
Consider data generated by where and .
For a learned predictor at fixed , the expected prediction error decomposes as:
Derivation
Starting from the prediction error and adding/subtracting :
Expanding the square yields three terms:
The cross term vanishes because:
- by construction
- (based on training data) is independent of future noise
Therefore:
Interpretation
This separates error into three parts:
- Bias reflects systematic misspecification. High bias comes from models that are too simple or constrained.
- Variance reflects sensitivity to the particular sample. High variance comes from overfitting flexible models to noise.
- Irreducible error comes from genuine randomness or unobserved factors. No model can eliminate it.
What ML Tries to Do
Given finite data, we manage a bias–variance tradeoff. A more flexible model can reduce bias by capturing complex patterns, but it may increase variance by chasing noise. Regularization, model selection, and ensembling are tools to navigate this tradeoff:
- Regularization (e.g., Lasso, Ridge) constrains complexity to reduce variance while maintaining enough flexibility to keep bias acceptable — this is what I will discuss in the next posting.
- Model selection with CV chooses hyperparameters or selects among candidate models to minimize estimated out-of-sample error.
- Ensembles (e.g., bagging, boosting, random forests) reduce variance through averaging or reduce bias through staged fitting.
Machine learning seeks to minimize bias + variance, while causal inference prioritizes minimizing bias alone.
Takeaways and What Comes Next
- ML is about accurate prediction, while causal analysis is about the effect of interventions.
- K-fold cross-validation provides an honest estimate of generalization error. The toy example illustrates how to compute CV MSE step by step.
- The bias–variance decomposition explains why model choice and regularization matter. Good practice balances flexibility and stability.
This post sets the stage for a hands-on showcase in the next article. I'll present a practical ML workflow that uses cross-validation for model selection, reports uncertainty clearly, and communicates performance in a way that decision-makers can trust.