The standard workflow for economic models is to specify a loss function (typically a negative likelihood or a quadratic form g’Wg derived from a moment condition g:= g(X;θ) = 0 to minimize), algebraically compute gradients, and solve the model using empirical risk minimization (ERM) with Newton-Raphson or Quasi-Newton algorithms. This approach nests regression, classification, IV, static choice models, dynamic choice models, search models, etc.

This requires quite a lot of work and scales poorly for large models, and also forces one to commit to simple parametric forms that we know how to differentiate and plug into quasi-Newton methods.

This doc outlines a strategy and python blueprints to use the modern deep learning ecosystem to solve economic models as an alternative, which offloads a lot of hard-work to computers.1 Torch is used to train most neural-network-based supervised, unsupervised, and generative ML models you’ve likely seen in the wild, and as such is an excellent DSL (Domain Specific Language) that can be used painlessly to estimate economic models.

A basic walkthrough

The Model: nn.module class

This is the base class for all models in pytorch [doc] that you should subclass off. nn stands for neural network, which encompasses almost any supervised learning method, including standard regression, classification, and structurally motivated models. It has a common basic signature that defines its parameters, a forward pass (computation of fitted values), and (optionally, but recommended) loss function2

 
class Model(nn.Module):
  def __init__(self, params):
      super().__init__()
      # define your parameters and starting values here
      self.A = torch.nn.Parameter()
      self.B = torch.nn.ParameterDict()
      # Use params here...
 
  def forward(self, data, other_params=None):
      """
      method that computes yhat using current parameter values,
        exog params to be toggled, and data
      """
      # this could set U = A @ B.T, for example
      # Implementation...
      return torch.softmax(U, 1) # eg choice for each user
 
  @staticmethod # nonstandard but recommended - attach loss to the model class as a staticmethod
  def loss(fitted_val, data, regularization_param=None):
    return -likelihood(data, fitted_val)

The training loop - stochastic gradient descent

This uses the most popular Adam optimizer as a starting point (but you probably should look into related optimizers with momentum, etc). A key distinction between traditional models and large-scale GPU-based models is that one typically doesn’t want to load all data into memory; the optimizer works with random samples of the data to update the gradient (hence the stochastic in SGD; this hinges on the fact that E[g(X_sample,θ)] = E[g(X,θ)], i.e. gradient updates based on a random sample are unbiased. In real examples, you will also want to use a Learning Rate scheduler.

You’ll note that we never call model.forward() , simply evaluating the model model(data) implicitly calls nn.Module’s __call__ method which we wrap by subclassing it.

 
def train(model, data, device, learning_rate=0.001, n_iter=1000):
  model.train() # toggle to trainable
  model.to(device)
  optimizer = torch.optim.Adam([p for p in model.parameters() if p.requires_grad],
            lr=learning_rate)
  losses = []
 
  for _ in tqdm(range(n_iter)):
    optimizer.zero_grad() # reset gradients - don't do this for RNNs and the like
    data_minibatch = data.sample(batch_size) # minibatch
    fitted_val = model(data_minibatch.to(device)) # moves data to GPU before doing forward pass
    loss = model.loss(fitted_val, data.to(device)) # compute loss and store
    losses.append(loss.item())
    loss.backward()  # backprop - update gradients - this updates parameter.grad
    optimizer.step() # update all the weights in model
 
  return model

The data.sample line gets a minibatch (subsample of rows) in each iteration, and this might be sufficient for small models (think of this as subsampling in the variance estimation setting).

However, you should learn to use Pytorch’s DataLoader class that lets you do this in a scalable way (e.g. if your training data doesn’t fit in memory or using Python’s multiprocessing library if batches take time to create, you can even read files in dataloader). This has a number of performance optimizations such as caching, memory pinning, etc.

Running it and saving checkpoints

The following fixes the number of iterations. Adding early-stopping is also easy.

choice_model = Model(params={"rank": rank, "dtype": dtype, ...})
 
trained_model = train(choice_model, data, device, learning_rate=0.001)
save_path = "model_trained_checkpoints.pth"
torch.save(trained_model.state_dict(), save_path)

Loading checkpoints and doing out-of-sample / counterfactual predictions

In a demand model, this may let us compute counterfactual market shares when the choice set is changed. The main difference here is that we want to fix the parameter values (set gradient to false is one way to do it, you can also use the no_grad decorator to achieve this).

# Load the pre-trained model
m2 = Model(params={"rank": rank, "dtype": dtype, ...})
m2.load_state_dict(torch.load(save_path))
m2.to(device)
m2.eval()  # Set to evaluation mode
 
# set weights as non-trainable
for param in m2.parameters():
    param.requires_grad = False
 
# pure out of sample
yhat_new = m2(X_new.to(device))
# out of sample, counterfactual predictions
yhat_new = m2(X_new.to(device), other_params=new_values)

Multinomial Logit

The above section featured pseudocode that was was general but unusable. The following is runnable code that implements a basic multinomial logit model as one would use to model the rational consumer’s choice of yogurt, for example.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from tqdm import tqdm
 
class MultinomialLogisticRegression(nn.Module):
    def __init__(self, params):
        super().__init__()
        self.input_dim = params['input_dim']
        self.num_classes = params['num_classes']
 
        # Define the parameters: weights and bias (i.e. coefficients)
        self.weights = nn.Parameter(torch.randn(self.input_dim, self.num_classes) * 0.01)
        self.bias = nn.Parameter(torch.zeros(self.num_classes))
        # this could be more elaborate, e.g. A @ B.T + <additional terms> + <regularization>
 
    def forward(self, x, other_params=None):
        """
        Computes the logits and probabilities for each class
 
        Args:
            x: Input features of shape [batch_size, input_dim]
            other_params: Optional parameters for counterfactual analysis
 
        Returns:
            Probabilities for each class [batch_size, num_classes]
        """
        # Linear transformation: X * W + b
        # single layer neural net
        logits = torch.matmul(x, self.weights) + self.bias
 
        # accommodate for if we want to modify parameters for counterfactual analysis
        if other_params is not None:
            if 'weights_scale' in other_params:
                logits = self.bias + torch.matmul(x, self.weights * other_params['weights_scale'])
        # Apply softmax to get probabilities
        probs = torch.softmax(logits, dim=1)
        return probs
 
    @staticmethod
    def loss(predicted_probs, targets):
        """
        Computes the negative log likelihood loss
 
        Args:
            predicted_probs: Predicted probabilities [batch_size, num_classes]
            targets: True class labels [batch_size]
 
        Returns:
            Negative log likelihood loss
        """
        # Use negative log likelihood loss (equivalent to cross entropy with softmax)
        return -torch.mean(
                torch.log(
                    predicted_probs[torch.arange(predicted_probs.size(0)), targets] +
                1e-10)
            )

Training loop

 
def train(model, train_loader, device, learning_rate=0.01, n_epochs=100):
    """
    Train the multinomial logistic regression model
 
    Args:
        model: The model to train
        train_loader: DataLoader containing training data
        device: Device to train on (cpu or cuda)
        learning_rate: Learning rate for optimization
        n_epochs: Number of epochs to train for
 
    Returns:
        Trained model and list of losses
    """
    model.to(device)
    optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad],
                lr=learning_rate)
    losses = []
 
    for epoch in range(n_epochs):
        epoch_loss = 0
        batch_count = 0
 
        for X_batch, y_batch in tqdm(train_loader,
                desc=f"Epoch {epoch+1}/{n_epochs}", leave=False):
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            # Zero gradients
            optimizer.zero_grad()
            # Forward pass
            predicted_probs = model(X_batch)
            # Compute loss
            loss = model.loss(predicted_probs, y_batch)
            epoch_loss += loss.item()
            batch_count += 1
            # Backward pass and optimize
            loss.backward()
            optimizer.step()
        avg_epoch_loss = epoch_loss / batch_count
        losses.append(avg_epoch_loss)
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {avg_epoch_loss:.4f}")
 
    return model, losses

Evaluate the model on new data

def evaluate(model, test_loader, device):
    """
    Evaluate the model on test data
 
    Args:
        model: Trained model
        test_loader: DataLoader containing test data
        device: Device to evaluate on
 
    Returns:
        Accuracy and predictions
    """
    model.eval()
    correct = 0
    total = 0
    all_preds = []
 
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
 
            # Forward pass
            probs = model(X_batch)
 
            # Get predictions
            _, predicted = torch.max(probs, 1)
            all_preds.extend(predicted.cpu().numpy())
 
            # Calculate accuracy
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()
 
    accuracy = correct / total
    print(f"Test Accuracy: {accuracy:.4f}")
 
    return accuracy, all_preds

simulate and run

torch.manual_seed(42)
np.random.seed(42)
 
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
# Generate some synthetic data
n_samples = 1000
n_features = 5
n_classes = 3
 
# Create random features and coefficients
X = np.random.randn(n_samples, n_features)
true_weights = np.random.randn(n_features, n_classes)
true_bias = np.random.randn(n_classes)
 
# Generate probabilities and sample classes
logits = np.dot(X, true_weights) + true_bias
probs = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
y = np.array([np.random.choice(n_classes, p=prob) for prob in probs])
 
# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(X)
y_tensor = torch.LongTensor(y)
 
# Create dataset and data loaders
dataset = TensorDataset(X_tensor, y_tensor)
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset,[train_size, test_size])
 
# subsampler
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
 
# Initialize model
model_params = {
    'input_dim': n_features,
    'num_classes': n_classes
}
 
mnl_model = MultinomialLogisticRegression(params=model_params)
 
# Train model
trained_model, training_losses = train(
    mnl_model,
    train_loader,
    device,
    learning_rate=0.01,
    n_epochs=50
)
 
# evaluate model
accuracy, predictions = evaluate(
    trained_model,
    test_loader,
    device
)
# Print predictions
print(accuracy)
# 0.83
 
# Save the trained model
save_path = "mnl_model_trained.pth"
torch.save(trained_model.state_dict(), save_path)

MNIST : Classification Neural Nets

AKA Multinomial logit, but with 3 layers.

Follows the blueprint example above.

# 
# ----------------
# DATA
# ----------------
# define a transformation
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)
# this downloads data
mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)
 
# train (55,000 images), val split (5,000 images)
# mnist_train, mnist_val = random_split(mnist_train, [55000, 5000])
# mnist_test = MNIST(os.getcwd(), train=False, download=True)
 
# dataloaders handle shuffling, batching, etc...
mnist_train = DataLoader(mnist_train, batch_size=64)
mnist_val = DataLoader(mnist_val, batch_size=64)
mnist_test = DataLoader(mnist_test, batch_size=64)
# 
# ----------------
# TRAINING LOOP
# ----------------
j = 0
num_epochs = 5
for epoch in range(num_epochs):
    pytorch_model.train()
    # TRAINING LOOP
    i = 0
    for train_batch in mnist_train:
        x, y = train_batch
        logits = pytorch_model(x)
        loss = pytorch_model.cross_entropy_loss(logits, y)
        print(f"{i} train loss: ", loss.item())
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        i += 1
    # EVALUATION MODE
    pytorch_model.eval()
    # VALIDATION LOOP
    with torch.no_grad(): # alternative to blueprint demo  setting parameter gradients off
        val_loss = []
        for val_batch in mnist_val:
            x, y = val_batch
            logits = pytorch_model(x)
            val_loss.append(pytorch_model.cross_entropy_loss(logits, y).item())
 
        val_loss = torch.mean(torch.tensor(val_loss))
    print("val_loss: ", val_loss.item())
    print("-----------------------------------------------------------")
    print(j)
    j += 1
# %%
# out of sample fit
def evaluate(model, dataloader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for x, y in dataloader:
            logits = model(x)
            _, predicted = torch.max(logits.data, 1)
            total += y.size(0)
            correct += (predicted == y).sum().item()
    return correct / total
 
evaluate(pytorch_model, mnist_test)
0.9733

Pretty, pretty good.

Footnotes

  1. For a similar exercise in R, see Ryan Giordano’s tutorial that goes over poisson regression with R torch

  2. The loss function is attached to the model here but traditional ML use involves initializing it separately (e.g. CrossEntropyLoss) and passing it separately to an optimizer. Torch has many, many loss functions that allow one to specify quite rich models + regularization schemes.