Kyumin Kim | Structural Estimation via NFXP (Rust 1987)

This post walks through the Rust (1987) bus engine replacement problem and explains, step by step, how the Nested Fixed-Point (NFXP) algorithm estimates structural parameters. This time, I attached Julia code for solving this problem. The goal is to build intuition for why each step is necessary and how the pieces connect.

Julia code

Companion script: rust_nfxp.jl. Requires Optim, ForwardDiff, and Distributions. Run in Julia: include("rust_nfxp.jl").

Download rust_nfxp.jl

For more detailed guidance step by step, I attach a more detailed PDF version here.

Download PDF (NFXP guidance)

Introduction: Structural vs. Reduced-Form

In empirical economics, there are broadly two approaches: reduced-form estimation and structural estimation.

Reduced-form econometrics usually revolves around causal inference, often with quasi-experimental designs. It aims to estimate how a treatment or policy $A$ causally affects outcome $B$ . Think of the growing difference-in-differences (DiD) literature, instrumental variables (IV), regression discontinuity designs (RDD), and so on. Reduced-form work does not explicitly model agents’ optimization problems or equilibrium conditions. Instead, it focuses on cleanly identifying causal effects: “If $X$ changes, what happens to $Y$ ?”—with as little bias as possible, and with less reliance on heavy economic theory for behavior. In a fisheries context, a reduced-form question might be: “Did installing marine protected areas (MPAs) improve fish harvest in this region, comparing before and after?”

Structural econometrics goes further by explicitly modeling agents’ decision-making. It starts with a behavioral model grounded in economic theory and then estimates the model’s parameters from data. The question is “Why do people make these choices?” and the approach builds a theoretical structure around it.

The workflow looks like this. First, you build a model from economic theory—e.g. consumers maximize utility, or fishers maximize expected profit when choosing where and how much to harvest. Second, you derive equilibrium or first-order conditions. Third, you estimate the “deep parameters” in those conditions: preference parameters, technology parameters, cost parameters, and so on.

I’ll be honest: I didn’t really get “deep parameters” in my first year of the PhD. One of my econometrics professors, Aaron Smith—now an endowed professor at UC Berkeley—made it click in office hours. He said something like: “Hey Kyumin, think of parameters that have specific names from economic theory, like elasticities—parameters tied to theory-based behavior.” That helped a lot. Structural econometrics is often used in industrial organization (IO)–style problems with very explicit decision-making. In practice, random utility models (RUM) implemented via logit or mixed logit are common. A structural question in fisheries would be: “If we increased or decreased the size of MPAs, what would the effect of that size change be on harvest?”

There are many great guides and textbooks for reduced-form methods—e.g. Causal Inference: The Mixtape by Scott Cunningham and Mostly Harmless Econometrics by Angrist and Pischke. Practical, example-driven guides for structural methods seem rarer. So I decided to write about one of the core topics: the Rust (1987) bus engine model and the NFXP algorithm.

My First Structural Econometrics Journey: Dynamic Discrete Choice

My first deep dive into structural modeling was in a dynamic optimization course. For the midterm, I had to learn dynamic discrete choice models and, at the same time, the computational algorithms to solve them—in particular, the Nested Fixed-Point (NFXP) method.

The midterm was Harold Zurcher’s bus engine replacement problem—a classic dynamic discrete choice model in structural econometrics. The problem is whether to replace a bus engine or not: the optimal decision problem faced by Harold Zurcher, who was responsible for bus engine replacement in the real world. (John Rust later used Zurcher’s engine replacement records and applied his structural model to this data.)

I really struggled building this model in Julia at the time. Even understanding the concepts and the algorithm felt intimidating. I recently looked at the midterm code I wrote about four years ago. Now I feel like I can explain what’s going on to others. So I’m writing this as a guide for anyone encountering this class of models for the first time. The “deep structural parameters” we estimate are the replacement cost ( $RC$ ) and the marginal engine maintenance cost ( $\theta_1$ ). I’ll spell those out below.

The Problem: What Are We Trying to Do?

Harold Zurcher is a bus maintenance superintendent. Every period, he looks at a bus engine’s accumulated mileage and makes a binary decision:

$a = 0$ : Keep the current engine (pay maintenance costs that increase with mileage)
$a = 1$ : Replace the engine (pay a large one-time cost; mileage resets to zero)

We observe Zurcher’s decisions across many buses and many time periods. Our goal is to recover the cost parameters that rationalize his behavior:

$RC$ : replacement cost (how expensive is a new engine?)
$\theta_1$ : marginal maintenance cost (how much does mileage hurt?)

Why “structural”? A reduced-form approach would simply regress replacement on mileage. Structural estimation goes further: we model Zurcher as a rational, forward-looking agent who solves a dynamic optimization problem. That lets us ask counterfactuals like “What would happen if engine prices doubled?”

The Model

State and Actions

The state variable is mileage, discretized into bins $x \in \{0, 1, 2, \ldots, n-1\}$ (we use $n = 90$ ). The action is binary: $a \in \{0, 1\}$ .

Flow Utility

Each period, Zurcher receives utility (really, negative cost):

u(x, a=0; \theta) = -\theta_1 \cdot x \quad \text{(keep: cost grows with mileage)}

u(x, a=1; \theta) = -RC \quad \text{(replace: fixed cost, mileage resets)}

Transition Probabilities

After the action, mileage changes stochastically. If Zurcher keeps ( $a=0$ ), mileage increases by $\Delta x \in \{0, 1, 2\}$ with probabilities $(\theta_{31}, \theta_{32}, 1 - \theta_{31} - \theta_{32})$ . If he replaces ( $a=1$ ), mileage resets to 0 and then increases by the same random increment. This is captured by a transition matrix $F$ (e.g. in make_transition_matrix()).

Two kinds of probability

(1) $p(x'|x, a; \theta_3)$ : Transition probability—how the environment evolves. We estimate it in Step 1 by counting frequencies.

(2) $\mathrm{P}(a|x; \theta)$ : Choice probability—how Zurcher behaves. This is what the model predicts and what we match to data in Step 2. Don’t confuse them: $p(x'|x,a)$ is an input; $\mathrm{P}(a|x)$ is an output.

The Error Term: Why Gumbel?

Zurcher’s utility includes an unobserved shock $\varepsilon$ :

\tilde{u}(x, a) = u(x, a; \theta) + \varepsilon(a)

where $\varepsilon(0)$ and $\varepsilon(1)$ are i.i.d. Type I Extreme Value (Gumbel). Rust (1987) chose this distribution because of a convenient analytical result (McFadden, 1978): the expectation of the max over the shocks has a closed form. If we write choice-specific values as $W_a$ , then

\mathbb{E}_\varepsilon\!\left[\max_a \big(W_a + \varepsilon(a)\big)\right] = \log \sum_a \exp(W_a) + \gamma

where $\gamma$ is Euler’s constant. So the Gumbel assumption turns the $\max$ operator into a smooth log-sum-exp, which keeps the Bellman equation differentiable and makes choice probabilities logit.

The Bellman Equation

Zurcher is forward-looking. Define the expected value function as the expected continuation value after the transition:

\mathrm{EV}(x, a) \equiv \sum_{x'} \mathbb{E}_{\varepsilon'}\! V(x', \varepsilon') \cdot p(x'|x, a)

Applying the Gumbel result, the expectation over $\varepsilon'$ is

\mathbb{E}_{\varepsilon'}\! V(x', \varepsilon') = \log\!\left[\sum_{a'} \exp\!\big(u(x', a'; \theta) + \beta \cdot \mathrm{EV}(x', a')\big)\right]

Substituting back gives the Bellman fixed-point equation:

\mathrm{EV}(x, a) = \sum_{x'} \log\!\left[\sum_{a'} \exp\!\big(u(x', a'; \theta) + \beta \cdot \mathrm{EV}(x', a')\big)\right] \cdot p(x'|x, a; \theta_3)

$\mathrm{EV}$ appears on both sides—solving it is the job of the inner loop.

Where did $\varepsilon$ go? The shocks were integrated out by taking the expectation. The Gumbel assumption gives a closed-form log-sum-exp instead of numerical simulation.

Choice Probabilities

Given $\mathrm{EV}$ , the probability that Zurcher replaces is logit:

\mathrm{P}(a=1|x) = \frac{\exp(v_1(x))}{\exp(v_0(x)) + \exp(v_1(x))} = \frac{1}{1 + \exp(v_0(x) - v_1(x))}

The choice-specific values $v_0(x)$ and $v_1(x)$ have the explicit form (flow utility plus discounted expected continuation value):

v_0(x) = u(x, 0) + \beta \sum_{x'} \mathrm{EV}(x') \cdot F(x, x') \quad \text{(keep)}

v_1(x) = u(x, 1) + \beta \sum_{x'} \mathrm{EV}(x') \cdot F(1, x') \quad \text{(replace; mileage resets to 0 then transitions)}

Here $\mathrm{EV}(x')$ is the expected value at next state $x'$ (integrated over the Gumbel shocks), and $F(x, x')$ is the transition probability from current state $x$ to next state $x'$ given the action (so $F(1, x')$ is the transition from state 0 after replacement). These choice-specific values are what we match to the data when we form the likelihood.

Why Is the Bellman Equation Necessary?

In a static discrete choice model, you would just plug flow utilities into the logit formula:

\mathrm{P}(a=1|x) = \frac{\exp(u_1(x))}{\exp(u_0(x)) + \exp(u_1(x))}

No Bellman needed—get choice probabilities, form the likelihood, done. But Zurcher is forward-looking. At mileage 30, deciding whether to replace depends on what happens at 40, 50, 60, and so on. The value of keeping today depends on the value of keeping tomorrow, and so on. That infinite recursion is captured by the Bellman equation. The Bellman is a computational bottleneck between the parameters $\theta$ and the likelihood: we need to solve it to evaluate “how good is this $\theta$ ?”

The NFXP Algorithm

NFXP (Rust, 1987) estimates $(\theta_1, RC)$ by maximum likelihood, with the Bellman equation solved as an intermediate step. The name “Nested Fixed-Point” means: fixed-point iteration (solving the Bellman) is nested inside the optimization loop.

Overview

Step 1 (pre-estimation): Estimate $\theta_3$ from data by frequency counting.

Step 2 (NFXP): Fix $\hat{\theta}_3$ and solve:

\max_{RC,\, \theta_1} \;\; \log L(RC, \theta_1 \mid \text{data}, \hat{\theta}_3)

Each likelihood evaluation requires solving the Bellman equation.

Step 1: Transition Probability Estimation

Look at all periods where Zurcher did not replace ( $a=0$ ), and count how much mileage increased. The frequency estimator is

\hat{\theta}_{3k} = \frac{\#\{\text{obs. where } \Delta x = k \text{ and } a = 0\}}{\#\{\text{obs. where } a = 0\}}, \quad k = 0, 1

This works under Rust’s conditional independence assumption: the transition probabilities are independent of the unobserved $\varepsilon$ . In the code this is estimate_transition().

Step 2: The Nested Loops

Outer loop (optimizer): An optimizer (e.g. Nelder-Mead) searches over $(RC, \theta_1)$ to maximize the log-likelihood. At each candidate:

Inner loop (Bellman solver): Given $(RC, \theta_1)$ , solve the Bellman by contraction mapping iteration:

Initialize $\mathrm{EV}^{(0)} = \mathbf{0}$ (vector of zeros, one per state).
Compute $\mathrm{EV}^{(k+1)}$ from $\mathrm{EV}^{(k)}$ using the Bellman equation.
Check convergence: $\|\mathrm{EV}^{(k+1)} - \mathrm{EV}^{(k)}\|_\infty < 10^{-12}$ .
If converged, return $\mathrm{EV}^*$ ; otherwise go to step 2.

This is solve_EV() in the code.

The inner loop always converges. The Bellman operator is a contraction (because $\beta < 1$ ). So we get a fixed point for whatever $(RC, \theta_1)$ we plug in. Convergence of the inner loop does not mean we found the right parameters—only “given these parameters, this is what rational behavior looks like.” Finding the right parameters is the outer loop’s job.

From $\mathrm{EV}^*$ to likelihood: Compute choice probabilities $\mathrm{P}(a|x)$ from $\mathrm{EV}^*$ , then evaluate the log-likelihood:

\log L = \sum_{i=1}^{N_{\text{buses}}} \sum_{t=1}^{T} \log \mathrm{P}(a_{it} \mid x_{it}; \theta, \mathrm{EV}^*)

The optimizer receives this scalar and updates $(RC, \theta_1)$ . This is neg_log_likelihood().

Why solve the Bellman every time? When $(RC, \theta_1)$ changes, the whole value function changes. A higher $\theta_1$ makes maintenance more expensive at every mileage, which changes the optimal replacement threshold and the future value of every state. The old $\mathrm{EV}^*$ is no longer valid—we solve from scratch.

Inference: Standard Errors

After finding $\hat{\theta} = (\widehat{RC}, \hat{\theta}_1)$ , we need standard errors. Under standard conditions, the MLE is asymptotically normal:

\sqrt{N}\,(\hat{\theta} - \theta_0) \;\xrightarrow{d}\; \mathcal{N}\!\left(0,\; \mathcal{I}(\theta_0)^{-1}\right)

where $\mathcal{I}(\theta)$ is the information matrix. In practice we estimate the variance as the inverse Hessian of the negative log-likelihood at $\hat{\theta}$ :

\widehat{\text{Var}}(\hat{\theta}) = H^{-1}, \quad H = \frac{\partial^2 (-\log L)}{\partial \theta \,\partial \theta'}\bigg|_{\theta = \hat{\theta}}

Standard errors are $\text{SE}(\hat{\theta}_j) = \sqrt{(H^{-1})_{jj}}$ . For testing $H_0: \theta_j = 0$ we use the $t$ -statistic:

t = \frac{\hat{\theta}_j}{\text{SE}(\hat{\theta}_j)}

In the code we use ForwardDiff.hessian() to compute $H$ via automatic differentiation (compute_se(), print_inference()).

Optimal Policy

Once we have $\hat{\theta}$ , the replacement probability $\mathrm{P}(a=1|x)$ at each mileage state describes Zurcher’s optimal behavior. With Gumbel errors the policy is probabilistic rather than a sharp threshold. In practice you typically see:

At low mileage: $\mathrm{P}(\text{replace}) \approx 0$ (maintenance is cheap)
Around some threshold: $\mathrm{P}(\text{replace})$ rises steeply
At high mileage: $\mathrm{P}(\text{replace}) \approx 1$ (maintenance is so costly that replacement is almost certain)

The threshold depends on the ratio $RC / \theta_1$ : a higher $RC$ (expensive replacement) delays replacement; a higher $\theta_1$ (expensive maintenance) accelerates it.

NFXP vs. MPEC

NFXP solves the Bellman to completion at every parameter guess. An alternative is MPEC (Su and Judd, 2012), which reformulates the problem as

\max_{\theta,\, \mathrm{EV}} \;\; \log L(\theta, \mathrm{EV}) \quad \text{subject to} \quad \mathrm{EV} = \Gamma(\mathrm{EV}; \theta)

where $\Gamma$ is the Bellman operator. Instead of an inner loop that solves the Bellman, MPEC treats $\mathrm{EV}$ as a decision variable and the Bellman equation as a constraint. NFXP has a clear inner loop (full convergence each time); MPEC often avoids that and can be faster for large state spaces, but the coding is more involved (constrained optimizer, gradients, sparsity). The key insight of MPEC is that intermediate iterations don’t need an exactly solved Bellman—only the final solution must satisfy both optimality of $\theta$ and the Bellman equation.

References

Angrist, J. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Aguirregabiria, V. and Mira, P. (2010). “Dynamic Discrete Choice Structural Models: A Survey.” Journal of Econometrics, 156(1), 38–67.
Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press.
McFadden, D. (1978). “Modelling the Choice of Residential Location.” In Spatial Interaction Theory and Residential Location, ed. A. Karlqvist et al.
Rust, J. (1987). “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher.” Econometrica, 55(5), 999–1033.
Su, C.-L. and Judd, K. (2012). “Constrained Optimization Approaches to Estimation of Structural Models.” Econometrica, 80(5), 2213–2230.

← Back to Technical Showcases

Structural Estimation via NFXP: A Step-by-Step Guide to the Rust (1987) Bus Engine Model