7 Introduction to PyTorch: Core Functionalities and Advantages
Overview
Welcome to the next exciting step on your deep learning journey! Having explored the fundamental building blocks – Data, Models, Loss Functions, and Optimization Algorithms – you now have a solid conceptual understanding of how deep learning works.
Now, it’s time to bring those concepts to life! Think of the last lesson as learning the rules of the road and understanding how a car works in principle. This lesson is where we get behind the wheel and learn to drive using a specific, powerful vehicle: PyTorch 1.
PyTorch is a popular and flexible framework that makes building and training neural networks much more manageable. This session will guide you through its essential components in a simple and friendly way, showing you how the concepts we discussed map directly onto practical code.
In this session, we’ll explore:
- PyTorch Fundamentals: Understand what PyTorch is and why it’s widely used. We’ll start with Tensors, PyTorch’s core data structure, and Autograd, its magic for automatically calculating gradients.
- Handling Data: Learn how PyTorch uses
Dataset
andDataLoader
to efficiently manage and feed data into your models. - Building and Training: Discover how to define Models using
nn.Module
, select Loss Functions, choose Optimizers, and combine everything into a working Training Loop.
By the end of this lesson, you’ll grasp these key PyTorch components and understand how they implement the deep learning concepts you’ve already learned. This will equip you to start building and experimenting with your own neural networks in the hands-on sections to come!
It could be overwhelming to take in all the information at once. Don’t worry if you don’t understand everything right away. The key is to get started and practice. Open a Jupyter Notebook on Google Colab and start playing with the code snippets. As you gain more experience, the concepts will become clearer.
7.1 What is PyTorch?
Now that we understand the conceptual building blocks of deep learning, let’s meet the tool we’ll use to put them into practice: PyTorch.
PyTorch is an open-source deep learning framework developed by Meta AI (formerly Facebook’s AI Research lab) 2. It is designed to provide flexibility and efficiency in building and deploying machine learning models.
Think back to our analogy: if deep learning concepts are the principles of how a car works, PyTorch is like a specific, well-designed car model – powerful, relatively easy to learn, and equipped with features that make driving (or in our case, building neural networks) smoother.
7.1.1 Why Use a Framework Like PyTorch?
You could implement neural network operations using standard numerical libraries like NumPy, but it quickly becomes complex, especially for deep networks and when calculating gradients for training. PyTorch abstracts away much of this complexity.
Consider implementing a basic convolutional layer:
Using NumPy (Conceptual Example): Requires manual implementation of sliding windows, dot products, and bias addition. You don’t need to follow every detail, but notice the amount of manual work required compared to the PyTorch equivalent.
# Conceptual NumPy implementation - verbose and complex import numpy as np # Input data (Batch, Channels, Height, Width) = np.random.randn(10, 3, 32, 32) X # Weights (Out_channels, In_channels, Kernel_H, Kernel_W) = np.random.randn(20, 3, 5, 5) W # Biases (Out_channels) = np.random.randn(20) b # Output placeholder (careful calculation of output size needed) = 28, 28 # Assuming stride=1, padding=0 output_h, output_w = np.zeros((10, 20, output_h, output_w)) out # Manual nested loops for convolution for n in range(10): # Batch for c_out in range(20): # Output channels for h in range(output_h): # Output height for w in range(output_w): # Output width # Extract region, perform dot product across input channels, add bias = h, w h_start, w_start = h_start + 5, w_start + 5 h_end, w_end = X[n, :, h_start:h_end, w_start:w_end] region = np.sum(region * W[c_out]) + b[c_out] convolution_sum = convolution_sum out[n, c_out, h, w] # NOTE: This is simplified; correct gradient calculation would add much more complexity!
Using PyTorch: Leverages optimized, pre-built layers
import torch import torch.nn as nn # Input data (Batch, Channels, Height, Width) = torch.randn(10, 3, 32, 32) X # Define a convolutional layer (weights/biases handled internally) = nn.Conv2d(in_channels=3, out_channels=20, kernel_size=5) conv_layer # Apply the layer - PyTorch handles the complex operation = conv_layer(X) out # Print output shape (PyTorch calculates it automatically) print(out.shape) # Output: torch.Size([10, 20, 28, 28])
This example highlights how PyTorch drastically simplifies deep learning development by providing high-level building blocks. How does PyTorch achieve this? Through several key features:
It Feels Like Python (Pythonic Integration)
If you’re comfortable with Python, PyTorch feels remarkably natural. Its API is designed to be intuitive, closely resembling standard Python code.
It integrates seamlessly with the Python ecosystem (NumPy, SciPy, etc.). You can use standard Python control flow (
if
,for
) and debugging tools (pdb
,print
) effectively. This makes learning, prototyping, and debugging faster.
Dynamic Computation Graphs (Define-by-Run)
PyTorch builds the graph representing your network’s computations on-the-fly as your Python code runs.
Think Lego: You add blocks (operations) dynamically, rather than needing a fixed blueprint upfront.
Benefits: This provides great flexibility for models with variable structures (like RNNs processing different length sentences without requiring complex padding upfront) and makes debugging more straightforward using standard Python tools.
Imagine you want a part of your neural network to behave differently depending on the length of the input sequence. Why might a dynamic graph framework (like PyTorch) make implementing this easier than a framework requiring a fixed graph defined upfront?
Hint: You can use standard Python if
statements within your model’s forward pass.
Automatic Differentiation (Autograd)
This is tightly linked to dynamic graphs and is essential for training. PyTorch’s autograd engine automatically calculates the gradients (slopes) of your loss function with respect to all your model’s parameters (weights and biases).
You simply define the forward pass (how inputs become outputs), and PyTorch figures out the backward pass (gradient calculation) needed for optimization, saving you from complex manual calculus. We’ll explore this magic in detail soon!
Remember the “Backward Pass / Backpropagation” step in our conceptual training loop? Which PyTorch feature directly handles the complex calculations needed for this step?
Hint: It automatically figures out the gradients.
Strong GPU Acceleration
Deep learning requires immense computational power (mostly matrix math). GPUs excel at this due to their parallel processing capabilities.
PyTorch seamlessly integrates with NVIDIA GPUs (via CUDA). Moving computations from the CPU to the GPU often requires minimal code changes (
.to('cuda')
) but can result in massive speedups (orders of magnitude) for training and inference.
Rich Ecosystem
PyTorch isn’t just the core library. It has a vibrant ecosystem with official libraries tailored for specific domains:
- TorchVision 3: For computer vision tasks, offering common datasets, pre-built model architectures, and image tranformation functions.
- TorchText 4: For natural language processing, providing tools for text processing, standard datasets, and common NLP model components.
- TorchAudio 5: For audio processing, including datasets, models, and functions for audio data manipulation.
Pre-trained Models and Community 6 7
Leveraging the concept of Transfer Learning is easy in PyTorch. A large community contributes state-of-the-art pre-trained models (especially via TorchVision and platforms like Hugging Face 8).
You can easily load these models and adapt them for your own tasks, often achieving great results with less data and training time.
In the next sections, we’ll dive into the specifics, starting with PyTorch’s fundamental data structure: the Tensor.
7.2 PyTorch Tensors: The Building Blocks of Data
In the previous section, we saw how PyTorch provides high-level tools to simplify deep learning. Now, let’s look under the hood at the most fundamental object you’ll work with: the Tensor.
What is a Tensor?
If you’ve used NumPy before, you’re already familiar with the concept of a multi-dimensional array (ndarray
). A PyTorch Tensor is very similar: it’s a multi-dimensional grid of numerical values. Tensors can represent various forms of data:
- A single number (a scalar or 0-dimensional tensor).
- A list of numbers (a vector or 1-dimensional tensor).
- A table of numbers (a matrix or 2-dimensional tensor).
- Or higher-dimensional data, like a color image (which can be represented as a 3D tensor:
height x width x color channels
) or a batch of images (a 4D tensor:batch size x height x width x channels
– although PyTorch often usesbatch size x channels x height x width
).
Why Tensors?
Tensors are the primary way we represent and manipulate data in PyTorch. They are optimized for:
- Numerical Computation: Performing mathematical operations efficiently.
- GPU Acceleration: Unlike NumPy arrays, Tensors can be easily moved to and processed on GPUs for massive speedups.
- Automatic Differentiation: PyTorch’s
autograd
system (which we’ll cover next) operates directly on Tensors to calculate gradients automatically.
7.2.1 Creating Tensors
There are several ways to create tensors in PyTorch:
You don’t need to memorize all these operations. You can always refer to the PyTorch documentation for a comprehensive list of tensor operations and functions.
Directly from data (Python lists or NumPy arrays)
import torch import numpy as np # From a Python list = [[1, 2], [3, 4]] list_data = torch.tensor(list_data) t1 print(t1) # tensor([[1, 2], # [3, 4]]) # From a NumPy array (shares memory!) = np.array([5, 6, 7]) numpy_array = torch.from_numpy(numpy_array) t2 print(t2) # tensor([5, 6, 7], dtype=torch.int64) # dtype often inferred
Creating tensors with specific values
# Tensor of zeros = (2, 3) shape = torch.zeros(shape) zeros_tensor print(zeros_tensor) # tensor([[0., 0., 0.], # [0., 0., 0.]]) # Tensor of ones = torch.ones(shape) ones_tensor print(ones_tensor) # tensor([[1., 1., 1.], # [1., 1., 1.]]) # Tensor with random values (uniform distribution 0 to 1) = torch.rand(shape) rand_tensor print(rand_tensor) # tensor([[0.1234, 0.5678, 0.9012], # Example random values # [0.3456, 0.7890, 0.2345]]) # Tensor with random values (standard normal distribution) = torch.randn(shape) randn_tensor print(randn_tensor) # tensor([[-0.5432, 1.2345, -0.9876], # Example random values # [ 0.6543, -1.5432, 0.1234]])
Creating tensors similar to other tensors: You can create new tensors that have the same shape and
dtype
(data type) as an existing tensor= torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) x_data # Create zeros with the same shape and type as x_data = torch.zeros_like(x_data) x_zeros print(x_zeros) # tensor([[0., 0.], # [0., 0.]]) # Create random numbers with the same shape and type as x_data = torch.rand_like(x_data) x_rand print(x_rand) # tensor([[0.1111, 0.2222], # Example random values # [0.3333, 0.4444]])
7.2.2 Tensor Attributes
Every tensor has important attributes that describe it:
shape
(or.size()
): A tuple representing the dimensions of the tensor.dtype
: The data type of the elements within the tensor (e.g.,torch.float32
,torch.int64
,torch.bool
) 9.float32
is the most common for neural network parameters.device
: The device where the tensor’s data is stored (e.g.,cpu
orcuda:0
for the first GPU).
= torch.randn(3, 4) # Shape (3, 4)
tensor
print(f"Shape of tensor: {tensor.shape}")
# Output: Shape of tensor: torch.Size([3, 4])
print(f"Datatype of tensor: {tensor.dtype}")
# Output: Datatype of tensor: torch.float32 (default float)
print(f"Device tensor is stored on: {tensor.device}")
# Output: Device tensor is stored on: cpu (default)
# Creating a tensor with specific dtype and on GPU (if available)
if torch.cuda.is_available():
= torch.ones(2, 2, dtype=torch.float64, device='cuda')
gpu_tensor print(f"\nGPU Tensor Device: {gpu_tensor.device}")
print(f"GPU Tensor Dtype: {gpu_tensor.dtype}")
else:
print("\nCUDA not available, GPU tensor not created.")
7.2.3 Common Tensor Operations
PyTorch supports hundreds of operations on tensors. Here are some basics:
Element-wise Operations: Standard math operations apply element by element.
= torch.tensor([[1., 2.], [3., 4.]]) t1 = torch.ones(2, 2) * 5 # Creates a tensor [[5., 5.], [5., 5.]] t2 # Addition print("Addition:\n", t1 + t2) # tensor([[ 6., 7.], # [ 8., 9.]]) # Multiplication (element-wise) print("Multiplication:\n", t1 * t2) # tensor([[ 5., 10.], # [15., 20.]]) # In-place operations (modify the tensor directly, often denoted by trailing _) # t1 is now modified t1.add_(t2) print("t1 after in-place add:\n", t1) # tensor([[ 6., 7.], # [ 8., 9.]])
Operations often support broadcasting (similar to NumPy) where PyTorch automatically expands tensors of smaller dimensions to match larger ones under certain rules, simplifying code.
Indexing and Slicing: Works just like NumPy indexing.
= torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) tensor print("First row:", tensor[0]) # tensor([1, 2, 3]) print("Second column:", tensor[:, 1]) # tensor([2, 5, 8]) print("Element at row 1, col 2:", tensor[1, 2]) # tensor(6) print("Sub-matrix (rows 0-1, cols 1-2):\n", tensor[0:2, 1:3]) # tensor([[2, 3], # [5, 6]])
Reshaping Tensors: Changing the shape without changing the data.
= torch.arange(6) # tensor([0, 1, 2, 3, 4, 5]) tensor # Reshape to 2 rows, 3 columns = tensor.reshape(2, 3) reshaped print("Reshaped:\n", reshaped) # tensor([[0, 1, 2], # [3, 4, 5]]) # .view() is similar but requires the new shape to be compatible # with the original's memory layout (often faster, shares memory) = tensor.view(3, 2) viewed print("Viewed:\n", viewed) # tensor([[0, 1], # [2, 3], # [4, 5]]) # Add a dimension (unsqueeze) = tensor.unsqueeze(dim=0) # Add dimension at the beginning unsqueezed print("Unsqueezed shape:", unsqueezed.shape) # torch.Size([1, 6]) # Remove dimensions of size 1 (squeeze) = unsqueezed.squeeze(dim=0) squeezed print("Squeezed shape:", squeezed.shape) # torch.Size([6])
Matrix Multiplication: Use the
@
operator ortorch.matmul()
.= torch.randn(2, 3) mat1 = torch.randn(3, 4) mat2 = mat1 @ mat2 # or torch.matmul(mat1, mat2) product print("Matrix product shape:", product.shape) # torch.Size([2, 4])
7.2.4 The NumPy Bridge
PyTorch tensors on the CPU can be converted to NumPy arrays and vice-versa very efficiently.
- Tensor to NumPy:
.numpy()
- NumPy to Tensor:
torch.from_numpy()
If the Tensor is on the CPU, the Tensor and the NumPy array share the same underlying memory location. This means changing one will change the other!
If the tensor is on the GPU, you must first move it to the CPU (
.cpu()
) before converting it to NumPy using.numpy()
.
# Tensor to NumPy
= torch.ones(5)
tensor_cpu = tensor_cpu.numpy()
numpy_arr print("NumPy array:", numpy_arr) # [1. 1. 1. 1. 1.]
1) # Modify the tensor
tensor_cpu.add_(print("NumPy array after tensor modified:", numpy_arr) # [2. 2. 2. 2. 2.] <- It changed!
# NumPy to Tensor
= np.zeros(3)
numpy_arr = torch.from_numpy(numpy_arr)
tensor_from_numpy print("Tensor:", tensor_from_numpy) # tensor([0., 0., 0.], ...)
5, out=numpy_arr) # Modify the NumPy array
np.add(numpy_arr, print("Tensor after NumPy array modified:", tensor_from_numpy) # tensor([5., 5., 5.], ...) <- It changed!
Why is the shared memory feature of the NumPy bridge both powerful and potentially dangerous if you’re not careful?
Hint: Think about efficiency vs. unintended side effects.
Based on the PyTorch documentation, can you find the differences between the following functions or attributes?
torch.view()
vs.torch.reshape()
torch.cat()
vs.torch.stack()
torch.unsqueeze()
vs.torch.squeeze()
torch.cuda.FloatTensor
vs.torch.FloatTensor
"cpu"
vs."cuda"
vs."cuda:0"
Hint: Try creating a tensor in Google Colab and playing around with these functions to see what they do.
Tensors are the starting point for everything else in PyTorch. Understanding how to create and manipulate them is essential before we move on to how PyTorch automatically computes gradients with them using Autograd
.
7.3 Autograd: Automatic Differentiation
Remember back in our Deep Learning overview, we discussed Optimization Algorithms like Gradient Descent? These algorithms need to know the gradient (the slope or derivative) of the loss function with respect to each model parameter (weights and biases) to update them correctly and minimize the loss. Calculating these gradients manually for complex models would be incredibly tedious and error-prone.
Check out 3Blue1Brown’s video to recap the idea of gradients, backpropagation, and the chain rule.
This is where PyTorch’s magic comes in: torch.autograd
, its automatic differentiation engine.
What Does Autograd Do?
Autograd automates the computation of gradients. You define the forward pass of your computation (how inputs produce outputs), and Autograd automatically figures out how to compute the gradients for the backward pass.
7.3.1 How Does Autograd Work? (The Concepts)
Tracking Operations: PyTorch keeps track of all the operations performed on tensors for which gradient tracking is enabled. It does this by building a dynamic computational graph behind the scenes. This graph represents the relationships between tensors and the operations that created them. Think of it as a recipe recording every step taken.
The
requires_grad
Flag: For Autograd to track operations on a tensor and compute gradients for it later, the tensor’srequires_grad
attribute must be set toTrue
.- Tensors representing learnable parameters (like the weights and biases in
nn.Linear
ornn.Conv2d
layers) automatically haverequires_grad=True
. - Input data tensors typically don’t need gradients, so they usually have
requires_grad=False
(the default for newly created tensors). - You can set it explicitly when creating a tensor:
torch.randn(3, 3, requires_grad=True)
or change it in-place later:my_tensor.requires_grad_(True)
.
- Tensors representing learnable parameters (like the weights and biases in
Starting the Backward Pass:
.backward()
: Once you have performed your forward pass and computed your final loss value (which must be a scalar – a single number), you call the.backward()
method on that scalar loss tensor (e.g.,loss.backward()
).Gradient Calculation & Storage: Calling
.backward()
triggers Autograd to traverse the computational graph backward from the loss scalar. Using the chain rule of calculus, it computes the gradient of the loss with respect to every tensor in the graph that hasrequires_grad=True
.The
.grad
Attribute: The computed gradients are then accumulated (added) into the.grad
attribute of the corresponding leaf tensors (the initial tensors in the graph that hadrequires_grad=True
, typically your model’s parameters).
A Simple Example
Let’s see it in action:
import torch
# Create a tensor 'x' that requires gradients
= torch.ones(2, 2, requires_grad=True)
x print("x:\n", x)
# Perform an operation
= x + 2
y print("y:\n", y)
# y was created by an operation involving x, so it has a 'grad_fn'
# Perform more operations
= y * y * 3
z = z.mean() # Calculate a scalar mean value
out print("out:", out) # out = tensor(27., grad_fn=<MeanBackward0>)
# Now, compute gradients using backpropagation
out.backward()
# The gradient dz/dx is computed and stored in x.grad
print("Gradient of out w.r.t x (x.grad):\n", x.grad)
# tensor([[4.5000, 4.5000],
# [4.5000, 4.5000]])
# Math check: out = (1/4) * sum(3 * (x+2)^2)
# d(out)/dx_ij = (1/4) * 3 * 2 * (x_ij+2) = 1.5 * (x_ij+2)
# Since x_ij = 1, d(out)/dx_ij = 1.5 * (1+2) = 4.5
In this example:
- We created
x
withrequires_grad=True
. - We performed operations (
+
,*
,mean
) to get a scalarout
. PyTorch built a graph tracking these. - Calling
out.backward()
calculated the gradient \(\frac{\partial \text{out}}{\partial x}\) using the chain rule. - The result was stored in
x.grad
.
7.3.2 Important Points about Autograd
Gradient Accumulation: As mentioned, gradients computed by
.backward()
are accumulated into the.grad
attribute. They don’t overwrite the previous value; they add to it. This is why, before each training iteration’s backward pass, you must explicitly zero out the gradients from the previous step usingoptimizer.zero_grad()
. Otherwise, gradients from multiple steps would mix, leading to incorrect parameter updates.Disabling Gradient Tracking: Sometimes you don’t want PyTorch to track operations (e.g., during model evaluation/inference, or when modifying parameters outside the optimizer). Tracking consumes memory and computation. You can disable it in two main ways:
with torch.no_grad():
: A context manager that disables gradient tracking for any operation within its block. This is the standard way to run inference code..detach()
: Creates a new tensor that shares the same data as the original but is detached from the computation history. It won’t require gradients, even if the original did. Useful if you need to use a tensor’s value without affecting gradient calculations later.
= torch.randn(3, requires_grad=True) x print("Requires grad:", x.requires_grad) # True # Using no_grad context with torch.no_grad(): = x * 2 y print("y requires grad inside no_grad:", y.requires_grad) # False # Using detach = x * 3 z = z.detach() z_detached print("z requires grad:", z.requires_grad) # True print("z_detached requires grad:", z_detached.requires_grad) # False
Backward on Scalars Only: You can only call
.backward()
implicitly on a tensor containing a single scalar value (like a loss). If you have a non-scalar tensor and need gradients, you typically provide a gradient argument to.backward()
specifying how to weight the gradients for each element (this is more advanced).
During model evaluation (inference), why is it crucial to use with torch.no_grad():
or .detach()
before passing data through the model? (Think about efficiency and correctness)
Hint: Do we need gradients when just making predictions? What resources does tracking gradients consume?
Autograd is the engine that enables efficient gradient-based optimization in PyTorch. By understanding requires_grad
, .backward()
, and .grad
, along with the concept of gradient accumulation and how to disable tracking, you have the core knowledge needed to understand how models learn during the training loop.
7.4 Moving Computations to the GPU
We’ve mentioned that one of PyTorch’s key strengths is its excellent GPU acceleration support. Deep learning often involves vast amounts of computation, especially large matrix multiplications. GPUs are designed for precisely this kind of parallel processing and can dramatically speed up model training and inference compared to using only the CPU.
PyTorch makes using a GPU remarkably simple using the .to()
method (if you have a compatible NVIDIA GPU and have installed the correct PyTorch version with CUDA support).
Check out this video to see the difference between how CPUs and GPUs compute. Deep learning involves tons of matrix multiplications, which are easy to parallelize - that’s why GPUs are so great for deep learning.
How to Move Tensors and Models to the GPU
Checking for GPU Availability and Setting the Device
First, you should check if a GPU is available and define a
device
object that your code can use. This makes your code portable – it will run on the GPU if available, otherwise defaulting to the CPU.import torch # Check if CUDA (GPU support) is available if torch.cuda.is_available(): # Set device to the first CUDA device (GPU 0) = torch.device("cuda") device print(f"CUDA is available. Using device: {device}") else: # Set device to CPU = torch.device("cpu") device print(f"CUDA not available. Using device: {device}")
Note: If you have multiple GPUs, you can specify a different device like
cuda:1
orcuda:0
to use a specific GPU.Moving Tensors to the Device
You can move a tensor to the selected device using the
.to()
method:# Assuming 'device' is defined as above # Create a tensor on the CPU (default) = torch.randn(3, 3) cpu_tensor print(f"Original tensor device: {cpu_tensor.device}") # Move the tensor to the determined device (GPU or CPU) = cpu_tensor.to(device) device_tensor print(f"Moved tensor device: {device_tensor.device}")
Note: The
.to()
method returns a new tensor on the target device (if it’s not already there). It doesn’t modify the original tensor in-place unless you reassign it (cpu_tensor = cpu_tensor.to(device)
).Moving Models to the Device
Similarly, you need to move your neural network model (which is an instance of
nn.Module
) to the device (We’ll learn more aboutnn.Module
later). This moves all the model’s parameters (which are themselves tensors) to the target device.import torch.nn as nn # Define a simple model class SimpleModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 2) # Example layer def forward(self, x): return self.linear(x) # Create an instance of the model (initially on CPU) = SimpleModel() model print(f"Model parameter device (before move): {next(model.parameters()).device}") # Move the entire model to the determined device model.to(device)print(f"Model parameter device (after move): {next(model.parameters()).device}")
Crucial Requirement: Same Device!
For any operation involving multiple tensors (e.g., passing input data through a model layer), all tensors involved must be on the same device. If you try to perform an operation between a tensor on the CPU and a tensor on the GPU, you will get a runtime error.
Therefore, a standard pattern in PyTorch training scripts is:
- Define the
device
. - Move the
model
to thedevice
. - Inside the training loop, move each batch of input
data
andlabels
to thedevice
before feeding them into the model.
# --- Inside a typical training loop ---
# Assuming model and device are already defined and model is on device
# Get a batch of data and labels from your DataLoader
# inputs, labels = data_batch # (DataLoader typically yields CPU tensors)
# Move data to the same device as the model <<< IMPORTANT STEP
# inputs = inputs.to(device)
# labels = labels.to(device)
# Now perform the forward pass (model and inputs are on the same device)
# outputs = model(inputs)
# ... rest of the loop (loss calculation, etc.) ...
Remember: Always ensure your model and the data being fed into it reside on the same device (cpu
or cuda
) to avoid runtime errors. Use the .to(device)
pattern consistently.
What happens if you consistently move your model and data to and from different devices?
Hint: Think about the performance implications. Moving stuff around costs time.
This simple .to(device)
mechanism is fundamental for unlocking the performance potential of PyTorch for deep learning tasks. Now, let’s move on to how PyTorch helps manage the data itself.
7.5 Data Handling: Dataset
, Transforms, and DataLoader
In the previous lecture, we emphasized the critical role of Data. Preparing input data, formatting outputs, splitting into training/validation/testing sets, handling potentially massive datasets, and feeding data efficiently to the model are all essential steps. Doing this manually, especially with operations like shuffling, batching, and data preprocessing/augmentation, can be complex and inefficient.
PyTorch provides elegant tools within the torch.utils.data
module to streamline this process: Dataset
, DataLoader
, and commonly used transforms (especially from torchvision.transforms
).
7.5.1 torch.utils.data.Dataset
The Dataset
class is an abstraction that represents your dataset. Think of it as a standardized way to access individual data points. PyTorch has two main types, but the most common is the map-style dataset. To create a custom map-style dataset, you typically subclass torch.utils.data.Dataset
and override two key methods (we’ll see an example later):
__len__(self)
: This method should return the total number of samples in your dataset.__getitem__(self, idx)
: This method is responsible for retrieving the single data sample (features and corresponding label/target) at the given indexidx
. This is often where you’ll implement the logic to load data from disk (e.g., read an image file, load text) and perform initial processing.
Libraries like torchvision.datasets
provide convenient pre-built Dataset
classes for many common public datasets (MNIST, CIFAR-10, ImageNet, etc.), handling downloading and setup automatically 10.
7.5.2 Preprocessing and Augmentation with Transforms
Raw data (like images on disk) is rarely in the exact format a neural network expects (e.g., specific size, numerical range, tensor structure). Furthermore, we often want to apply data augmentation during training to artificially increase the diversity of our dataset and make the model more robust. This is where transforms come in.
Transforms are functions/classes that perform operations on your data, usually applied within the Dataset
’s __getitem__
method. For images, the torchvision.transforms
module provides a wide array of useful transforms.
Common Preprocessing Transforms
transforms.Resize((height, width))
: Resizes the input image to a specific size.transforms.CenterCrop(size)
: Crops the center of the image.transforms.ToTensor()
: Crucial! Converts a PIL Image or NumPy array (H x W x C, range [0, 255]) into a PyTorch FloatTensor (C x H x W, range [0.0, 1.0]). It handles the necessary dimension reordering and scaling.transforms.Normalize(mean, std)
: Normalizes a tensor image with a specified mean and standard deviation for each channel. This helps stabilize training, as models often perform better with input features centered around zero with unit variance.mean
andstd
are often pre-computed on large datasets like ImageNet as we often use models pre-trained on them for transfer learning.
ToTensor()
is a crucial transform that it’s almost always required working with image data from PIL or NumPy, as it performs the necessary conversion and reshaping (HWC -> CHW) that models expect.
Common Augmentation Transforms (Usually only applied to training data)
transforms.RandomHorizontalFlip(p=0.5)
: Randomly flips the image horizontally with a given probabilityp
.transforms.RandomRotation(degrees)
: Randomly rotates the image by a certain angle range.transforms.ColorJitter(...)
,transforms.RandomResizedCrop(...)
, etc.
Chaining Transforms with Compose
Typically, you want to apply multiple transforms in sequence. transforms.Compose
allows you to chain them together neatly:
import torchvision.transforms as transforms
# Example transform pipeline for training
= transforms.Compose([
train_transform 256, 256)), # Resize
transforms.Resize((224), # Randomly crop to 224x224
transforms.RandomCrop(# Augmentation
transforms.RandomHorizontalFlip(), # Convert to tensor (scales to [0, 1])
transforms.ToTensor(), =[0.485, 0.456, 0.406], # ImageNet stats
transforms.Normalize(mean=[0.229, 0.224, 0.225]) # Normalize
std
])
# Example transform pipeline for validation/testing (no augmentation)
= transforms.Compose([
val_transform 224, 224)), # Resize directly to final size
transforms.Resize((# Convert to tensor
transforms.ToTensor(), =[0.485, 0.456, 0.406],
transforms.Normalize(mean=[0.229, 0.224, 0.225])
std ])
Conceptual Example of a Custom Dataset and Transforms
The transform pipeline is usually passed to the Dataset
during initialization and applied within __getitem__
.
from torch.utils.data import Dataset
# Assume necessary imports like os, pandas, PIL.Image, torch etc.
class CustomImageDataset(Dataset):
def __init__(self, annotations_file, img_dir, transform=None):
"""
Args:
annotations_file (string): Path to the csv file with annotations.
img_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied on a sample.
"""
self.img_labels = self._load_annotations(annotations_file) # e.g., load into pandas DataFrame
self.img_dir = img_dir
self.transform = transform
def _load_annotations(self, file_path):
# Implement logic to load image names and labels, e.g., from a CSV
# Return something like a list of tuples: [('image1.jpg', 0), ('image2.jpg', 1), ...]
pass
def __len__(self):
# Returns the total number of samples
return len(self.img_labels)
def __getitem__(self, idx):
# 1. Get image path and label based on index
= os.path.join(self.img_dir, self.img_labels[idx][0])
img_path = self.img_labels[idx][1]
label
# 2. Load image (e.g., using PIL)
= Image.open(img_path).convert("RGB") # Example loading
image
# 3. Apply transformations HERE before returning (if any) - e.g., resize, normalize, convert to tensor
if self.transform:
= self.transform(image)
image
# 4. Return the sample (image tensor, label tensor)
return image, torch.tensor(label, dtype=torch.long)
# Usage (conceptual):
# train_dataset = CustomImageDataset(annotations_file='labels.csv', img_dir='images/', transform=train_transform)
# val_dataset = CustomImageDataset(annotations_file='labels.csv', img_dir='images/', transform=val_transform)
# inputs, labels = train_dataset[0] # Get the first sample
Why do we typically apply data augmentation transforms (like RandomHorizontalFlip or RandomRotation) only to the training data and not to the validation or test data?
Hint: What is the goal of augmentation? What do we want to measure during validation/testing?
7.5.3 torch.utils.data.DataLoader
Now that our Dataset
(with transforms) can provide processed individual samples, we need an efficient way to iterate over these samples in batches for training. This is the job of the DataLoader
.
DataLoader
wraps a Dataset
and provides an iterator that yields batches of data automatically. It handles the complexities of:
- Batching: Grouping individual samples fetched from the
Dataset
into mini-batches. - Shuffling: Randomly shuffling the data indices at the beginning of each epoch (crucial for effective training).
- Parallel Loading: Using multiple subprocesses (
num_workers
) to load data in the background, preventing data loading from becoming a bottleneck during training.
num_workers
> 0 means that the data loading uses subprocesses for loading. It’s best to start from 0 (main process) or a small number (e.g., 2 or 4) and increase it cautiously, as too many workers can sometimes cause issues or increase overhead, depending on your system.
Creating and Using a DataLoader
from torch.utils.data import DataLoader
# Assume 'train_dataset' and 'val_dataset' are instances of a Dataset class
# (potentially using train_transform and val_transform respectively)
# Create a DataLoader for the training set
= DataLoader(
train_loader =train_dataset,
dataset=64, # How many samples per batch
batch_size=True, # Shuffle data every epoch (IMPORTANT for training)
shuffle=4 # Number of subprocesses for data loading (adjust based on system)
num_workers# pin_memory=True # Often used with GPU for faster memory transfers
)
# Create a DataLoader for the validation set
= DataLoader(
val_loader =val_dataset,
dataset=128, # Can often use larger batch size for validation
batch_size=False, # No need to shuffle validation data
shuffle=4
num_workers# pin_memory=True
)
# How to iterate over the DataLoader in a training loop:
= 10
num_epochs for epoch in range(num_epochs):
print(f"Epoch {epoch+1}/{num_epochs}")
# Training phase
# model.train()
for batch_idx, (inputs, labels) in enumerate(train_loader):
# 'inputs' is a batch of images, 'labels' is a batch of labels
# Move inputs and labels to the correct device (e.g., GPU)
# inputs, labels = inputs.to(device), labels.to(device)
# --- Your training steps ---
# ... (as shown previously) ...
# ---------------------------
if batch_idx % 100 == 0: # Print progress every 100 batches
print(f" Batch {batch_idx}/{len(train_loader)}")
# Validation phase (using val_loader)
# model.eval()
# with torch.no_grad():
# for inputs, labels in val_loader:
# inputs, labels = inputs.to(device), labels.to(device)
# ... evaluation logic ...
7.5.4 Summary: Dataset
, Transforms, and DataLoader
These three components form a powerful pipeline for feeding data to your models:
Dataset
: Defines access to individual raw data samples and applies necessary Transforms.Transforms (
torchvision.transforms
): Preprocess (resize, normalize, ToTensor) and optionally augment individual samples within theDataset
.DataLoader
: Efficiently wraps theDataset
to provide shuffled batches of processed data, often using parallel workers.
Using this pipeline makes your data loading code clean, efficient, standardized, and ready for training.
7.6 Model Building: nn.Module
, Layers, and Containers
In our journey through the “Building Blocks of Deep Learning,” we explored the concept of Models – the architectures composed of various layers (like Convolutional, Fully-Connected, Activation layers) that learn to map inputs to outputs. Now, we’ll see how to construct these models using PyTorch’s powerful torch.nn
module.
7.6.1 The torch.nn
Namespace
torch.nn
is PyTorch’s dedicated library for building neural networks. It provides implementations of common layers, activation functions, loss functions, and other essential building blocks. The most fundamental component within torch.nn
for creating any neural network is the nn.Module
base class.
7.6.2 nn.Module
: The Base for All Models
Every neural network model and every custom layer you build in PyTorch should be a class that inherits from nn.Module
. This base class provides a lot of essential functionality behind the scenes, such as tracking the model’s parameters (weights and biases) and offering helpful methods (like .to(device)
to move the model to a GPU, or .parameters()
to get all learnable weights).
When creating your custom model class, you typically need to implement two key methods:
__init__(self)
(The Constructor):This is where you define and instantiate the layers your network will use. You should assign these layers as attributes of your class (e.g.,
self.conv1 = nn.Conv2d(...)
,self.relu1 = nn.ReLU()
,self.fc1 = nn.Linear(...)
).Layers defined here are automatically registered as sub-modules, allowing
nn.Module
to track their parameters.
forward(self, x)
(The Forward Pass):This method defines how the input data
x
flows through the layers you defined in__init__
. You call the layers like functions, passing the output of one layer as the input to the next.The
forward
method specifies the actual computation of your network.
Conceptual Structure
import torch
import torch.nn as nn
import torch.nn.functional as F # Often used for functional APIs like activation functions
class MyCustomModel(nn.Module):
def __init__(self):
super().__init__() # IMPORTANT: Call parent class constructor first!
# Define layers here - these become tracked parameters
self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.fc1 = nn.Linear(in_features=..., out_features=10) # '...' depends on conv/pool output size
# calculating `...` based on the output dimensions of the preceding layers is a common practical step
def forward(self, x):
# Define the data flow through the layers
= self.conv1(x)
x = self.relu1(x)
x = self.pool1(x)
x
# Flatten the output for the fully-connected layer
# (e.g., x = torch.flatten(x, 1) # Flatten all dimensions except batch)
= x.view(x.size(0), -1) # Alternative flatten using view
x
= self.fc1(x)
x # No activation/softmax here - often applied outside or handled by the loss function
return x
# Instantiate the model
# model = MyCustomModel()
# print(model) # Prints the layers
# model.to(device) # Move model to GPU/CPU
See we didn’t need to define the backward
pass (gradient calculation). It is automatically handled by PyTorch’s Autograd system, as discussed previously. You don’t need to implement it manually when using nn.Module
correctly.
Nesting Modules
You can easily include instances of other nn.Module
classes within your model definition, promoting modularity.
# Define another model that uses CustomModel internally
class MyCustomModel2(nn.Module):
def __init__(self):
super().__init__()
self.model1 = MyCustomModel() # Use instance of the previous model
self.linear_out = nn.Linear(5, 1) # Takes output of model1 (size 5)
def forward(self, x):
= self.model1(x) # Pass data through the first model
x = self.linear_out(x) # Pass through the final layer
x return x
# Create an instance
= MyCustomModel2()
model2 print("\nNested Model Architecture:\n", model2)
# Apply the nested model
= model2(input_data)
output2 print(f"\nOutput shape from MyCustomModel2: {output2.shape}") # Output: torch.Size([32, 1])
7.6.3 Common Layers in torch.nn
11
torch.nn
provides a wide variety of pre-built layers. Here are some you’ll frequently encounter, linking back to concepts from the previous lecture:
Linear Layers
nn.Linear(in_features, out_features)
- Applies a linear transformation (fully-connected layer, dense layer, or dense connection).
Convolutional Layers
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
- Performs 2D convolution, common for image data.
nn.Conv1d
andnn.Conv3d
also exist.
Pooling Layers
nn.MaxPool2d(kernel_size, stride=None), nn.AvgPool2d(...)
- Downsamples feature maps.
nn.AdaptiveAvgPool2d
is also useful.
Activation Functions
nn.ReLU()
,nn.LeakyReLU()
,nn.Sigmoid()
,nn.Tanh()
,nn.Softmax(dim=...)
- Introduce non-linearity.
- Can be used as modules (e.g.,
nn.ReLU()
) or often via thetorch.nn.functional
API (e.g.,F.relu(...)
) within the forward method.
F.relu(...)
is a function call, useful for simple stateless operations like activations within forward
, while nn.ReLU()
is a module, necessary if the operation has internal state or parameters, though less common for base activations.
Regularization Layers
nn.Dropout(p=0.5)
: Randomly zeros elements during training.nn.BatchNorm1d(num_features)
,nn.BatchNorm2d(...)
: Normalizes activations across a batch.- Help prevent overfitting and stabilize training.
Recurrent Layers
nn.LSTM(input_size, hidden_size, batch_first=False)
,nn.GRU(...)
- For sequential data.
Transformer Layers
nn.Transformer(...)
,nn.TransformerEncoderLayer(...)
,nn.TransformerDecoderLayer(...)
,nn.MultiheadAttention(...)
- Building blocks for Transformer models.
For a complete list of all available layers, refer to the torch.nn documentation.
7.6.4 Organizing Models: Containers
For clarity and structure, especially in complex models, PyTorch provides container modules:
1. nn.Sequential
- A container that stacks layers sequentially. Data passed to it flows through each layer in the order they were added (no skipping, branching, or complex connections).
- Convenient for simple, linear architectures.
# Define a model using Sequential
= nn.Sequential(
sequential_model 10, 20),
nn.Linear(
nn.ReLU(),20, 5)
nn.Linear(
)print("\nSequential Model:\n", sequential_model)
= sequential_model(input_data)
output_seq print(f"Output shape from Sequential: {output_seq.shape}") # Output: torch.Size([32, 5])
2. nn.ModuleList
Holds modules in a Python list-like structure. Useful when you need to iterate over layers or access them by index, perhaps applying them within a loop or complex control flow in your
forward
method.Unlike a standard Python list, modules inside
ModuleList
are correctly registered (parameters are tracked by PyTorch).
# Define a model using ModuleList (layers applied manually in forward)
class ModuleListModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.ModuleList([
10, 20),
nn.Linear(
nn.ReLU(), 20, 5)
nn.Linear(
])
def forward(self, x):
for layer in self.layers: # Manually iterate and apply layers
= layer(x)
x return x
= ModuleListModel()
module_list_model print("\nModuleList Model:\n", module_list_model)
= module_list_model(input_data)
output_ml print(f"Output shape from ModuleListModel: {output_ml.shape}") # Output: torch.Size([32, 5])
3. nn.ModuleDict
- Holds modules in a Python dictionary-like structure. Allows you to access layers by name (key).
- Useful for organizing named components or selecting specific layers dynamically in the
forward
method. Modules are correctly registered.
# Define a model using ModuleDict
class ModuleDictModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.ModuleDict({
'input_layer': nn.Linear(10, 20),
'activation': nn.ReLU(),
'output_layer': nn.Linear(20, 5)
})
def forward(self, x):
= self.layers['input_layer'](x)
x = self.layers['activation'](x)
x = self.layers['output_layer'](x)
x return x
= ModuleDictModel()
module_dict_model print("\nModuleDict Model:\n", module_dict_model)
= module_dict_model(input_data)
output_md print(f"Output shape from ModuleDictModel: {output_md.shape}") # Output: torch.Size([32, 5])
When would you choose to define a model by subclassing nn.Module
versus using nn.Sequential
?
Hint: Think about the complexity of the data flow through the layers.
7.6.5 Accessing Model Parameters
Once you’ve defined your model (either via nn.Module
or a container), PyTorch makes it easy to access all of its learnable parameters (weights and biases). Some methods to help you do this are:
.parameters()
: Returns an iterator over all parameters..named_parameters()
: Returns an iterator over all parameters, yielding both the name and the parameter tensor..named_children()
: Returns an iterator over immediate children modules, yielding both the name and the module..state_dict()
: Returns a dictionary containing all model parameters (learnable and non-learnable). We’ll learn more about this later.
# Example of accessing parameters
for name, param in model.named_parameters():
if param.requires_grad:
print(f"Layer: {name} | Size: {param.size()} | Requires Grad: {param.requires_grad}")
# Example of accessing named parameters
for name, param in model.named_parameters():
if param.requires_grad:
print(f"Layer: {name} | Size: {param.size()} | Requires Grad: {param.requires_grad}")
# Example of accessing children modules
for name, child in model.named_children():
print(f"Child: {name} | Module: {child}")
# Example of accessing state_dict
print("\nState Dict:\n", model.state_dict().keys()) # Returns a dictionary of all parameters
By understanding nn.Module
, common layers, and containers, you now have the tools to translate the conceptual model architectures discussed earlier into concrete PyTorch code, ready to be trained.
7.7 Leveraging Pre-trained Models & Transfer Learning
We’ve just seen how to build neural network models from scratch using nn.Module
and various layers. While essential to understand, training large models (especially deep ones like ResNet or VGG) on large datasets (like ImageNet) requires significant data and computational resources (time, powerful GPUs).
Fortunately, we often don’t need to start from zero! Remember the concepts of Pre-trained Models and Transfer Learning from our “Building Blocks” lecture? The core idea is to take a model already trained on a large general dataset (like ImageNet for images) and adapt it for our specific, often smaller, dataset and task. This usually leads to:
- Faster development: Less training time needed.
- Lower data requirements: Works well even with smaller datasets.
- Better performance: Often achieves higher accuracy than training from scratch on limited data.
PyTorch makes using pre-trained models incredibly easy, primarily through the torchvision.models
module for computer vision tasks (similar libraries exist for other domains, like Hugging Face’s transformers
12 for NLP).
7.7.1 Loading Pre-trained Models with torchvision.models
The torchvision.models
submodule contains definitions for many popular model architectures (ResNet, VGG, AlexNet, MobileNet, Vision Transformer, etc.) and provides easy access to weights pre-trained on ImageNet.
There are two main ways to load a pre-trained model:
import torchvision.models as models
# --- Option 1: Using the newer 'weights' API (Recommended) ---
# This provides access to different pre-trained weight sets and associated metadata
# List available weights for resnet18
# print(models.ResNet18_Weights.DEFAULT) # Often points to IMAGENET1K_V1
# print(models.list_models(weights=models.ResNet18_Weights))
# Load resnet18 with the default ImageNet v1 weights
= models.ResNet18_Weights.DEFAULT # Or models.ResNet18_Weights.IMAGENET1K_V1
weights = models.resnet18(weights=weights)
model_v1
# --- Option 2: Using the older 'pretrained=True' argument ---
# This typically loads the original ImageNet weights the model was published with
# model_v2 = models.resnet18(pretrained=True) # Legacy way
# It's generally recommended to use the 'weights' API for clarity and future options.
= model_v1
model # Set the model to evaluation mode if just doing inference/inspection
eval() model.
Inspect the Model
Once loaded, you can print the model to see its architecture, paying close attention to the final layer(s), often called the “classifier” or “fully-connected head”.
# Print the ResNet18 architecture
print(model)
# Output will show layers like conv1, bn1, layer1, layer2, ..., avgpool, fc
# Notice the final layer:
# (fc): Linear(in_features=512, out_features=1000, bias=True)
# This layer outputs 1000 scores, corresponding to the 1000 ImageNet classes.
7.7.2 Adapting the Model for Your Task
The key step in transfer learning is adapting this pre-trained model for your specific task, which likely has a different number of output classes. We typically modify the final classification layer. There are two main strategies:
1. Feature Extraction
Treat the pre-trained model (except the final layer) as a fixed feature extractor. We freeze its weights and only train the weights of the new final layer(s) we add. This is suitable when your dataset is small or very similar to the original dataset (e.g., ImageNet).
# --- Feature Extraction Example ---
# 1. Freeze all parameters in the pre-trained model
for param in model.parameters():
= False # Freeze weights
param.requires_grad
# 2. Replace the final layer (the 'head')
# ResNet's final layer is named 'fc'. Others might be 'classifier'.
= model.fc.in_features # Get the input feature size of the original fc layer
num_features = 10 # Example: Your dataset has 10 classes
num_my_classes
# Create a new nn.Linear layer for your task
= nn.Linear(num_features, num_my_classes)
model.fc # NOTE: Parameters of this new layer automatically have requires_grad=True
# Now, only the parameters of 'model.fc' will be updated during training
# Optimizer should be created AFTER replacing the head:
# optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
The optimizer should be created after freezing parameters and replacing the head, and should typically only be passed the parameters of the new head. We’ll learn more about optimizers later.
- By creating the optimizer after replacing the head, we ensure the optimizer knows about the final set of parameters in your model.
- By passing only the parameters of the new head, we make the code intent more clear (we only want to update the new head) and slightly more efficient by telling the optimizer exactly which parameters need updating.
The name of the final layer (fc
in the case of ResNet) can vary depending on the model architecture. You can inspect the model by print(model)
to see the exact name.
2. Fine-tuning
Start with the pre-trained weights, but allow some or all of them (usually the later layers) to be updated during training on your new dataset, typically using a low learning rate. This adapts the learned features more closely to your specific task. It generally requires more data than feature extraction.
# --- Fine-tuning Preparation Example ---
# 1. (Optional) Start by freezing all layers as in feature extraction
# for param in model.parameters():
# param.requires_grad = False
# 2. Replace the head (as before)
= model.fc.in_features
num_features = 10
num_my_classes = nn.Linear(num_features, num_my_classes)
model.fc
# 3. (Later, or from the start) Unfreeze some layers for fine-tuning
# Example: Unfreeze parameters in the last two layers (layer4 and fc)
# for name, param in model.named_parameters():
# if "layer4" in name or "fc" in name:
# param.requires_grad = True
# 4. Create the optimizer to train ALL parameters where requires_grad=True
# Use a much smaller learning rate for the pre-trained parts than for the new head.
# optimizer = torch.optim.Adam([
# {'params': model.conv1.parameters(), 'lr': 1e-5}, # Example: Very low LR for early layers
# # ... potentially different LRs for different blocks ...
# {'params': model.layer4.parameters(), 'lr': 1e-4},
# {'params': model.fc.parameters(), 'lr': 1e-3} # Higher LR for the new head
# ], lr=1e-5) # Default LR if not specified in groups
# Fine-tuning requires careful setup of the optimizer and learning rates.
Fine-tuning usually involves a globally smaller learning rate than training from scratch, even for the unfrozen layers, to avoid destroying the pre-trained features too quickly.
After potential freezing/unfreezing steps, you can check which parameters require gradients by:
for name, param in model.named_parameters():
print(f"{name}: requires_grad={param.requires_grad}")
Input Preprocessing
Pre-trained models were trained with specific input preprocessing steps (image size, normalization mean/standard deviation). Normally, you’d need to apply these same transformations to your own data when using these models.
Luckily, the newer weights
API often provides the necessary transforms:
# Get the appropriate weights object
= models.ResNet18_Weights.DEFAULT
weights
# Get the preprocessing transforms recommended for these weights
= weights.transforms()
preprocess print("\nPreprocessing Transforms required by model:\n", preprocess)
# Apply these transforms to your input images in your Dataset's __getitem__
# Example usage within Dataset:
# image = Image.open(...)
# input_tensor = preprocess(image) # Apply the transforms
Using these standard transforms ensures your input data matches what the model expects.
You want to adapt a pre-trained ResNet18 model to classify 5 different types of flowers using a small dataset you collected. Which transfer learning strategy (Feature Extraction or Fine-tuning) would likely be the better starting point, and why? What’s the most critical change you need to make to the loaded model object?
Hint: Consider dataset size and the main goal of adapting the model.
Transfer learning with pre-trained models is a cornerstone of modern deep learning practice. PyTorch and torchvision
make it accessible, allowing you to leverage powerful models without the need for massive resources, accelerating your path to building effective applications.
7.8 Loss Functions in PyTorch (torch.nn
)
Recall from the “Building Blocks” lecture that the Loss Function is crucial for training. It measures how far the model’s predictions are from the actual target values (the ground truth). This calculated “loss” (a scalar value) tells us how poorly the model is performing on a given sample or batch, and its gradient provides the signal needed by the optimizer to update the model’s parameters.
PyTorch provides a variety of standard loss functions within the torch.nn
module. You typically instantiate a loss function object and then call it like a function, passing the model’s predictions and the true targets.
import torch
import torch.nn as nn
# General pattern:
# criterion = nn.SomeLossFunction()
# ... obtain model predictions and targets ...
# loss = criterion(predictions, targets)
By default, the loss function will compute the mean loss across the samples in a batch (controlled by the reduction='mean'
argument). This results in a single scalar loss value ready for .backward()
.
The specific loss function you choose depends heavily on the type of task (regression or classification) and the format of your model’s output.
For a full list of loss functions, see the PyTorch Loss Functions documentation.
7.8.1 Common Loss Functions
1. For Regression Tasks (Predicting Continuous Values)
nn.MSELoss()
: Computes the Mean Squared Error between each element in the prediction and target.- Prediction: Tensor of any shape containing predicted values.
- Target: Tensor of the same shape containing true values.
= nn.MSELoss() criterion_mse = torch.randn(10, 1, requires_grad=True) # e.g., 10 predictions predicted_values = torch.randn(10, 1) true_values = criterion_mse(predicted_values, true_values) loss_mse print(f"MSE Loss: {loss_mse.item()}")
nn.L1Loss()
: Computes the Mean Absolute Error (MAE). Less sensitive to outliers than MSE.- Prediction/Target: Same shape requirements as
MSELoss
.
= nn.L1Loss() criterion_l1 = criterion_l1(predicted_values, true_values) loss_l1 print(f"L1 (MAE) Loss: {loss_l1.item()}")
- Prediction/Target: Same shape requirements as
nn.SmoothL1Loss()
: A combination of L1 and MSE (Huber Loss), often used in object detection bounding box regression. Less sensitive to outliers than MSE but smoother near zero than L1.
2. For Classification Tasks (Predicting Categories)
nn.CrossEntropyLoss()
: The standard choice for multi-class classification. This function is particularly convenient because it combinesnn.LogSoftmax
andnn.NLLLoss
in one step.- Prediction: Expects raw, unnormalized scores (logits) directly from the model’s final linear layer. Shape:
(N, C)
whereN
is batch size andC
is the number of classes. - Target: Expects
class indices
(long integers) ranging from 0 to C-1. Shape:(N)
. Do not use one-hot encoded targets with this loss.
= nn.CrossEntropyLoss() criterion_ce # Example: 4 samples, 3 classes = torch.randn(4, 3, requires_grad=True) # Raw output from model logits # True class indices (e.g., sample 0 is class 1, sample 1 is class 0, ...) = torch.tensor([1, 0, 2, 1], dtype=torch.long) true_indices = criterion_ce(logits, true_indices) loss_ce print(f"\nCrossEntropy Loss: {loss_ce.item()}")
- Prediction: Expects raw, unnormalized scores (logits) directly from the model’s final linear layer. Shape:
nn.BCEWithLogitsLoss()
: The standard choice for binary classification (two classes) or multi-label classification (where each sample can belong to multiple classes). It combines aSigmoid
layer with the Binary Cross Entropy loss (nn.BCELoss
) for better numerical stability.- Prediction: Expects raw logits from the model. Shape typically
(N)
or(N, 1)
for binary, or(N, C)
for multi-label. - Target: Expects float values representing probabilities or target labels (usually 0.0 or 1.0). Must have the same shape as the input predictions.
= nn.BCEWithLogitsLoss() criterion_bce # Example: Binary classification, 4 samples = torch.randn(4, requires_grad=True) # Raw output for positive class binary_logits # True labels (0.0 or 1.0) = torch.tensor([1.0, 0.0, 1.0, 0.0]) binary_targets = criterion_bce(binary_logits, binary_targets) loss_bce print(f"BCEWithLogits Loss: {loss_bce.item()}") # Example: Multi-label classification, 2 samples, 3 classes = torch.randn(2, 3, requires_grad=True) multilabel_logits # Targets: sample 0 belongs to class 0 & 2; sample 1 belongs to class 1 = torch.tensor([[1.0, 0.0, 1.0], [0.0, 1.0, 0.0]], dtype=torch.float32) # Explicit dtype multilabel_targets = criterion_bce(multilabel_logits, multilabel_targets) loss_multilabel print(f"Multi-label BCEWithLogits Loss: {loss_multilabel.item()}")
- Prediction: Expects raw logits from the model. Shape typically
nn.BCELoss()
: Computes Binary Cross Entropy. Requires the input predictions to already be probabilities (i.e., passed through aSigmoid
layer). Less numerically stable thanBCEWithLogitsLoss
, which is generally preferred.nn.NLLLoss()
: Negative Log Likelihood loss. Typically used after applyingnn.LogSoftmax
to the model’s output.nn.CrossEntropyLoss
combines these two steps and is usually more convenient for multi-class classification.
Different loss functions have different requirements for their input shapes and targets. Ensure your model’s output and target tensors match the expected shapes and types for the chosen loss function.
7.8.2 Using the Loss Function in Training
The loss function is used within the training loop after obtaining the model’s predictions:
# --- Inside a typical training loop ---
# model = ... (Your nn.Module model)
# criterion = nn.CrossEntropyLoss() # Choose appropriate loss
# optimizer = ... (Your optimizer)
# inputs, targets = ... # Your data batch, on the correct device
# 1. Zero gradients
# optimizer.zero_grad()
# 2. Forward pass: Get model predictions (logits)
# outputs = model(inputs)
# 3. Calculate loss
# loss = criterion(outputs, targets) # <<< Use the loss function
# 4. Backward pass: Compute gradients
# loss.backward() # <<< Autograd calculates gradients based on the loss
# 5. Update weights
# optimizer.step()
# -------------------------------------
You are building a model to classify images into 10 categories (cat, dog, bird, …, truck). Your model’s final layer is nn.Linear(..., 10)
.
- Which loss function (
nn.CrossEntropyLoss
ornn.BCEWithLogitsLoss
) is appropriate? - What should the shape of the
targets
tensor be for a batch size of 32? What should itsdtype
be?
Hint: Think about multi-class vs. binary/multi-label, and what nn.CrossEntropyLoss
expects.
Choosing the correct loss function based on your task and ensuring your model’s output and target data formats match its requirements are crucial steps in building a successful PyTorch model.
7.9 Optimizers in PyTorch (torch.optim
)
In the “Building Blocks” lecture, we learned about Optimization Algorithms like Gradient Descent, SGD, Adam, etc. Their purpose is to take the error signal (represented by the loss) and the calculated gradients (telling us the direction of steepest ascent) and use this information to adjust the model’s learnable parameters (weights and biases) in a way that minimizes the loss.
PyTorch implements various optimization algorithms in the torch.optim
package.
7.9.1 Instantiating an Optimizer
To use an optimizer, you first need to create an instance of it, telling it which parameters it should manage and what learning rate to use.
The general pattern is:
import torch.optim as optim
# Assume 'model' is your nn.Module instance
# optimizer = optim.OptimizerName(params_to_optimize, learning_rate, ...)
# Example using Adam:
= 0.001
learning_rate = optim.Adam(model.parameters(), lr=learning_rate) optimizer
Key Arguments
params
: An iterable containing the parameters (tensors) the optimizer should update. The most common way to provide this is by passingmodel.parameters()
, which conveniently returns an iterator over all learnable parameters within yournn.Module
. For more complex scenarios, one can pass specific lists or dictionaries of parameters. See Fine-tuning in Transfer Learning for an example.lr
(Learning Rate): Controls the step size for parameter updates. This is arguably the most important hyperparameter to tune. Different optimizers often work best with different learning rate ranges.
7.9.2 Common Optimizers
torch.optim
provides many choices, mirroring the algorithms discussed conceptually:
optim.SGD(params, lr, momentum=0, weight_decay=0, ...)
- Implements Stochastic Gradient Descent.
- Often used with the
momentum
argument (e.g.,momentum=0.9
) which implements SGD with Momentum, typically leading to faster convergence than basic SGD. - Can optionally include
weight_decay
for L2 regularization. - Usually requires careful tuning of the learning rate and potentially a learning rate schedule.
optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, ...)
- Implements the Adam algorithm.
- Combines ideas from Momentum and RMSprop, adapting learning rates for each parameter.
- Often works well with default settings (lr=0.001) across a wide range of problems, making it a popular default choice.
optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, ...)
- Adam with decoupled Weight Decay.
- Generally preferred over standard Adam when using weight decay (L2 regularization), as it implements it in a potentially more effective way.
optim.RMSprop(params, lr=0.01, alpha=0.99, ...)
- Implements the RMSprop algorithm.
- Adapts learning rates based on the magnitude of recent gradients.
For a full list of optimizers, see the PyTorch Optimizers documentation.
7.9.3 Using the Optimizer in the Training Loop
The optimizer performs its main work in two steps within the training loop:
optimizer.zero_grad()
: This method must be called at the start of each training iteration (before the backward pass). It resets the.grad
attribute of all the parameters the optimizer is managing back to zero. This is crucial because, as we learned in the Autograd section,.backward()
accumulates gradients into the.grad
attribute. Withoutzero_grad()
, gradients from previous batches would add up, leading to incorrect updates.optimizer.step()
: This method is called after the gradients have been computedusing loss.backward()
. It updates the values of the parameters based on the computed gradients stored in their.grad
attribute and the specific update rule of the chosen optimization algorithm (e.g., applying momentum, using adaptive learning rates).
The Training Loop Revisited
Let’s look at the training loop fragment again, highlighting the optimizer’s role:
# --- Inside a typical training loop ---
# model = ...
# criterion = ...
# optimizer = optim.Adam(model.parameters(), lr=0.001) # Instantiate the optimizer
# inputs, targets = ... # Your data batch
# >>> Step 1: Reset gradients from previous iteration
# optimizer.zero_grad()
# Forward pass
# outputs = model(inputs)
# Calculate loss
# loss = criterion(outputs, targets)
# Backward pass (compute gradients for current batch)
# loss.backward()
# >>> Step 2: Update model parameters using computed gradients
# optimizer.step()
# -------------------------------------
What would likely happen during training if you forgot to call optimizer.zero_grad()
at the beginning of each iteration?
Hint: Remember that gradients accumulate in the .grad
attribute.
7.9.4 Learning Rate Scheduling
As discussed in the previous lecture, adjusting the learning rate during training can often improve performance and convergence. PyTorch provides tools for this in the torch.optim.lr_scheduler
module. You typically create a scheduler after creating your optimizer and call scheduler.step()
at the appropriate point in your training loop (often after each epoch, sometimes after each batch depending on the scheduler). Common schedulers include StepLR
, MultiStepLR
, ReduceLROnPlateau
, and CosineAnnealingLR
. Exploring schedulers is often a next step after getting a basic training loop working.
Example
# Create the optimizer
= optim.Adam(model.parameters(), lr=0.001)
optimizer
# Create the scheduler
= optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
scheduler
# In the training loop
for epoch in range(num_epochs):
for inputs, targets in train_loader:
# ... existing training loop code ...
# ... forward pass, loss calculation, backward pass ...
# ... update weights ...
# ... call scheduler.step() ...
The timing of scheduler.step()
depends on the scheduler type (e.g., some step per epoch, some per batch, ReduceLROnPlateau
steps based on a metric). The example shows it inside the epoch loop, which is common for many schedulers like StepLR
.
With the optimizer managing parameter updates based on gradients derived from the loss function, we now have almost all the pieces needed to actually train a PyTorch model. The next step is to put them all together in a complete training loop.
7.10 Training a Model in PyTorch (The Training Loop)
We’ve reached the heart of the process! In the “Building Blocks” lecture, we discussed the Training phase – an iterative cycle where the model processes data, calculates errors, and adjusts its parameters to improve. Let’s translate that conceptual loop into PyTorch code.
The Goal Revisited
Remember, the objective isn’t just to minimize loss on the training data, but to achieve good generalization – performance on new, unseen data. We monitor this using a separate validation dataset.
Assembling the Pieces
We’ll use the PyTorch components we’ve learned about:
- DataLoaders:
train_loader
andval_loader
(providing batches of inputs and targets). - Model: An
nn.Module
instance (e.g.,model = MyCustomModel()
). - Criterion: A loss function instance (e.g.,
criterion = nn.CrossEntropyLoss()
). - Optimizer: An optimizer instance linked to the model’s parameters (e.g.,
optimizer = optim.Adam(model.parameters(), lr=0.001)
). - Device: The
device
object (cuda
orcpu
) for hardware placement.
7.10.1 The Training Loop Structure
Training typically involves two nested loops:
- Outer Loop (Epochs): Iterates over the entire dataset multiple times. One pass over the full dataset is called an epoch.
- Inner Loop (Batches): Iterates over the mini-batches provided by the
DataLoader
. Parameter updates happen after processing each batch.
The Training Loop Steps (Inside the Inner Loop)
For each batch within an epoch, we perform the following crucial steps:
import torch
import torch.nn as nn
import torch.optim as optim
# Assume DataLoader, Model, Criterion, Optimizer, device are defined
# Also assume train_loader, val_loader are defined
# e.g.
# model, criterion, optimizer = ..., ..., ...
# train_loader, val_loader = ..., ...
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device) # Ensure model is on the correct device!
# scheduler = ... # Optional: define a learning rate scheduler
= 10 # Example number of epochs
num_epochs
for epoch in range(num_epochs):
# --- Training Phase ---
# 1. Set model to training mode (enables dropout, batchnorm updates)
model.train() = 0.0
epoch_train_loss = 0
epoch_train_samples
print(f"Epoch {epoch+1}/{num_epochs} - Training...")
# Inner loop: Iterates over batches from the DataLoader
for batch_idx, (inputs, targets) in enumerate(train_loader):
# 2. Move data to the correct device (must match model's device)
= inputs.to(device)
inputs = targets.to(device)
targets
# 3. Clear previous gradients stored in the optimizer
optimizer.zero_grad()
# 4. Forward pass: Get model outputs (logits)
= model(inputs)
outputs
# 5. Calculate loss
= criterion(outputs, targets)
loss
# 6. Backward pass: Compute gradients of the loss w.r.t. model parameters
loss.backward()
# 7. Update weights using the optimizer and computed gradients
optimizer.step()
# --- Track statistics for the epoch ---
# Accumulate loss (weighted by batch size)
# Use loss.item() to get the Python scalar value of the loss tensor
+= loss.item() * inputs.size(0)
epoch_train_loss += inputs.size(0)
epoch_train_samples
# (Optional: Print progress within the epoch)
# if batch_idx % 100 == 99: # Print every 100 batches
# print(f' Batch {batch_idx + 1}/{len(train_loader)} Current Avg Batch Loss: {loss.item():.4f}')
# Calculate average training loss for the epoch
= epoch_train_loss / epoch_train_samples
avg_epoch_train_loss
# --- Validation Phase ---
eval() # 1. Set model to evaluation mode (disables dropout, uses running batchnorm stats)
model.= 0.0
epoch_val_loss = 0
epoch_val_correct = 0
epoch_val_samples
print(f"Epoch {epoch+1}/{num_epochs} - Validation...")
with torch.no_grad(): # 2. Disable gradient calculations for efficiency
for inputs, targets in val_loader:
# 3. Move data to device
= inputs.to(device)
inputs = targets.to(device)
targets
# 4. Forward pass
= model(inputs)
outputs
# 5. Calculate loss
= criterion(outputs, targets)
loss += loss.item() * inputs.size(0) # Accumulate validation loss
epoch_val_loss
# 6. Calculate accuracy (example metric)
= torch.max(outputs.data, 1) # Get class index with highest score
_, predicted_indices += targets.size(0)
epoch_val_samples += (predicted_indices == targets).sum().item()
epoch_val_correct
# Calculate average validation loss and metrics for the epoch
= epoch_val_loss / epoch_val_samples
avg_epoch_val_loss = 100.0 * epoch_val_correct / epoch_val_samples
avg_epoch_val_accuracy
# (Optional: Step the learning rate scheduler, if defined)
# if scheduler:
# scheduler.step() # Or scheduler.step(avg_epoch_val_loss) for ReduceLROnPlateau
# --- Print Epoch Summary ---
print(f"Epoch {epoch+1} Summary:")
print(f" Avg Training Loss: {avg_epoch_train_loss:.4f}")
print(f" Avg Validation Loss: {avg_epoch_val_loss:.4f}")
print(f" Validation Accuracy: {avg_epoch_val_accuracy:.2f}%")
print("-" * 30)
print("Finished Training")
Key Differences: Training vs. Validation Mode
Notice the crucial differences when running the validation loop:
model.train()
vs.model.eval()
: These methods switch the behavior of certain layers.model.train()
enables dropout and makes BatchNorm use batch statistics.model.eval()
disables dropout and makes BatchNorm use its learned running statistics. It’s essential to switch modes correctly.Gradient Calculation: We wrap the validation loop in with
torch.no_grad()
:. This tells PyTorch not to track operations for gradient calculation, which saves significant memory and computation time, as gradients are not needed for evaluation.Optimizer Steps: We do not call
optimizer.zero_grad()
oroptimizer.step()
during validation because we are only evaluating the model, not updating its weights.
Why is it important to call model.eval()
before running the validation loop? What might happen if you forget and leave the model in train() mode during validation?
Hint: Consider layers like Dropout and Batch Normalization.
Monitoring Training
The average training loss, validation loss, and validation accuracy (or other relevant metrics) calculated each epoch are exactly what you would plot to monitor your training progress, just like the conceptual loss curves discussed in the “Building Blocks” lecture. These plots help you diagnose issues like overfitting (validation loss increasing while training loss decreases) or underfitting (both losses high) and decide when to stop training (e.g., using “early stopping” when validation performance plateaus or worsens).
This complete training loop structure is the foundation for teaching your PyTorch models. While variations exist, these core steps provide a solid starting point for almost any supervised learning task.
7.11 Evaluating a Model in PyTorch (Metrics & Test Loop)
In the “Building Blocks” lecture, we discussed the Inference phase and the importance of Evaluating Performance using metrics beyond just the loss function. While the validation loss calculated during training gives us a good indicator of generalization, a more formal evaluation using task-specific metrics on unseen data (validation or test sets) is crucial.
Why Evaluate?
Evaluation helps us:
- Assess Generalization: Understand how well the model performs on data it wasn’t trained on.
- Compare Models: Objectively compare different architectures or hyperparameters.
- Make Decisions: Decide if the model meets the requirements for its intended application or if further training/tuning is needed.
- Report Performance: Provide unbiased performance metrics (especially using the final test set).
Evaluation Mode: model.eval()
and torch.no_grad()
As highlighted in the Training Loop section, before performing evaluation or inference, you must remember to:
- Set the model to evaluation mode:
model.eval()
- This changes the behavior of layers like Dropout (disables it) and Batch Normalization (uses running statistics instead of batch statistics). Failing to do this can lead to inconsistent and worse results.
- Disable gradient computation:
with torch.no_grad():
- This tells PyTorch not to track gradients, saving memory and computation, as they are not needed for just making predictions.
7.11.1 The Evaluation Loop Structure
The loop structure for evaluation (on a validation or test set) is very similar to the validation phase shown in the training loop section.
import torch
# Assume model, criterion, device, and a DataLoader (e.g., val_loader or test_loader) are defined
# model.to(device) # Ensure model is on the correct device
# --- Evaluation Phase ---
eval() # 1. Set model to evaluation mode!
model.= []
all_targets = []
all_predictions = 0.0
eval_loss
print("Evaluating...")
with torch.no_grad(): # 2. Disable gradient calculations!
for inputs, targets in val_loader: # Or test_loader
# 3. Move data to device
= inputs.to(device)
inputs = targets.to(device)
targets
# 4. Forward pass
= model(inputs)
outputs
# (Optional) Calculate loss on the batch
= criterion(outputs, targets)
loss += loss.item()
eval_loss
# 5. Store predictions and targets for metric calculation
# (Convert to CPU if using external libraries like scikit-learn)
# For classification, store predicted indices or probabilities
# For regression, store predicted values
# Example for classification:
# _, predicted_indices = torch.max(outputs.data, 1)
# all_predictions.append(predicted_indices.cpu())
# all_targets.append(targets.cpu())
# Example for regression:
# all_predictions.append(outputs.cpu())
# all_targets.append(targets.cpu())
# Concatenate all batches
# all_predictions = torch.cat(all_predictions)
# all_targets = torch.cat(all_targets)
# 6. Calculate overall metrics after the loop
= eval_loss / len(val_loader)
avg_eval_loss print(f"Average Evaluation Loss: {avg_eval_loss:.4f}")
# --- Calculate Task-Specific Metrics (see below) ---
# accuracy = ...
# precision = ...
# recall = ...
# mae = ...
print("Finished Evaluation")
7.11.2 Calculating Evaluation Metrics
The core difference during evaluation is calculating meaningful performance metrics based on the collected outputs
and targets
. The choice of metrics depends heavily on your task:
Common Classification Metrics
Accuracy: The most straightforward metric – the proportion of correctly classified samples.
# --- Inside Evaluation (after loop, assuming classification) --- # total_samples = len(all_targets) # correct_predictions = (all_predictions == all_targets).sum().item() # accuracy = 100.0 * correct_predictions / total_samples # print(f"Accuracy: {accuracy:.2f}%")
Precision, Recall, F1-Score: Crucial for understanding model performance, especially with imbalanced datasets.
- Precision: Of the samples predicted as positive, how many actually were positive? \(\frac{TP}{TP + FP}\)
- Recall (Sensitivity): Of all the actual positive samples, how many did the model find? \(\frac{TP}{TP + FN}\)
- F1-Score: The harmonic mean of Precision and Recall, providing a single balanced score.
Confusion Matrix: A table showing counts of true vs. predicted classes, useful for identifying specific confusion patterns between classes.
AUC (Area Under the ROC Curve): Measures the ability of the model to distinguish between classes.
Common Regression Metrics
Mean Squared Error (MSE) / Root Mean Squared Error (RMSE): Average squared difference between predicted and true values. RMSE is the square root of MSE, putting the error back into the original units.
Mean Absolute Error (MAE): Average absolute difference. Less sensitive to outliers than MSE.
R-squared (R²): Coefficient of determination, indicating the proportion of variance in the target variable predictable from the input features.
Using Libraries for Metrics
Calculating many metrics (especially precision, recall, F1, AUC) correctly can be tricky. It’s highly recommended to use established libraries:
torchmetrics 13: A PyTorch-native library designed for efficient metric calculation, handling distributed training scenarios as well.
# Example using torchmetrics (install first: pip install torchmetrics) # See docs: https://torchmetrics.readthedocs.io/en/stable/ import torchmetrics # --- Before the evaluation loop --- # Define the metric object (e.g., for multi-class accuracy) # Move the metric object to the same device as your model and data! = torchmetrics.classification.Accuracy( metric ="multiclass", task=NUM_CLASSES # Replace NUM_CLASSES with your actual number num_classes ).to(device) # --- Inside the evaluation loop (within torch.no_grad()) --- # After getting model 'outputs' and 'targets' on the correct device # metric.update(outputs, targets) # Update the metric state with batch results # --- After the evaluation loop --- # Compute the final metric over all batches # final_accuracy = metric.compute() # print(f"Accuracy (torchmetrics): {final_accuracy:.4f}") # metric.reset() # Reset metric state if you plan to reuse it
scikit-learn.metrics 14: A widely used library. Requires converting PyTorch tensors to NumPy arrays (
.cpu().numpy()
) first.# Example using scikit-learn (install first: pip install scikit-learn) # See docs: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics from sklearn.metrics import accuracy_score, precision_recall_fscore_support # --- After the evaluation loop --- # Ensure predictions and targets are numpy arrays on the CPU # all_predictions_np = all_predictions.cpu().numpy() # all_targets_np = all_targets.cpu().numpy() # accuracy = accuracy_score(all_targets_np, all_predictions_np) # Calculate precision, recall, and F1-score # The 'average' parameter determines how scores are calculated for multi-class problems: # - 'weighted': Calculates metrics for each class and averages them, # weighted by the number of true instances for each class (support). # Good for imbalanced datasets if you care about overall weighted performance. # - 'macro': Calculates metrics for each class and finds their unweighted mean. # Treats all classes equally, regardless of size. # - 'micro': Calculates metrics globally by counting total true positives, # false negatives, and false positives across all classes. # Often equivalent to accuracy. # - None: Returns the scores for each class individually. = precision_recall_fscore_support( precision, recall, f1, _ all_targets_np, all_predictions_np,='weighted' # Choose average method average ) print(f"Accuracy (sklearn): {accuracy:.4f}") print(f"Precision (weighted): {precision:.4f}") print(f"Recall (weighted): {recall:.4f}") print(f"F1 Score (weighted): {f1:.4f}")
Why might accuracy alone be a misleading metric for evaluating a classifier trained on a highly imbalanced dataset (e.g., 99% of samples are class A, 1% are class B)? Which other metrics (Precision, Recall, F1) would give a better picture of performance on the rare class B?
Hint: A model predicting class A always would have high accuracy.
The Final Test Set
Remember the distinction between validation and test sets. The validation set is used during development to tune hyperparameters (like learning rate, model architecture choices) and for early stopping. The test set should be held aside and used only once at the very end of your project to get an unbiased estimate of your final model’s performance on completely unseen data.
Proper evaluation provides crucial insights into your model’s capabilities and limitations, guiding further development and deployment decisions.
7.12 Saving and Loading Models
Training a deep learning model can take a significant amount of time and computational resources. Once you have a trained model that performs well, you’ll definitely want to save it!
Why Save and Load?
- Resume Training: Save checkpoints during long training runs so you can resume later if interrupted.
- Avoid Retraining: Load a previously trained model for inference or further fine-tuning.
- Share Models: Share your trained model weights with others.
- Deployment: Deploy your model for real-world applications.
What to Save? The state_dict
PyTorch models have an internal state dictionary (state_dict
) that contains all their learnable parameters (weights and biases) and potentially persistent buffers (like the running mean/variance in BatchNorm layers).
While you can save the entire model object using torch.save(model, PATH)
, this is generally not recommended because it binds the saved file to the specific code structure used when saving. It can easily break if you refactor your code or use it in a different project.
The recommended and most common practice is to save only the model’s state_dict
. This is more lightweight, portable, and less likely to break.
7.12.1 Saving the state_dict
This saves only the model’s learned parameters.
import torch
import torch.nn as nn
# Assume 'model' is your trained nn.Module instance
# Assume 'PATH' is the desired file path, e.g., 'my_model_weights.pth' or '.pt'
# Example: Saving the state_dict
= "my_trained_model.pth"
PATH
torch.save(model.state_dict(), PATH)
print(f"Model state_dict saved to {PATH}")
The common extension for PyTorch models is .pth
or .pt
. There are some discussions about just using .pt
because .pth
is a special extension for Python.
7.12.2 Loading the state_dict
To load the parameters, you must first create an instance of the same model architecture you used during training. Then, you load the saved state_dict
into it.
# Assume 'YourModelClass' is the class definition for your model
# Make sure the class definition is available!
# 1. Instantiate the model structure
= YourModelClass(*args, **kwargs) # Use same args as original model
model_loaded
# 2. Load the saved state_dict
= "my_trained_model.pth"
PATH = torch.load(PATH)
state_dict
# 3. Load the state_dict into the model instance
model_loaded.load_state_dict(state_dict)# By default, load_state_dict uses strict=True, meaning the keys in the
# state_dict must exactly match the keys returned by the model's state_dict() method.
# Setting strict=False can be useful in some transfer learning scenarios
# if you only want to load partial weights, but requires caution.
# 4. CRUCIAL: Set the model to evaluation mode if using for inference
eval()
model_loaded.
print("Model state_dict loaded successfully.")
# Now you can use model_loaded for inference:
# with torch.no_grad():
# predictions = model_loaded(some_input_data.to(device))
Remember to call model.eval()
after loading the weights if you intend to use the model for inference, to ensure layers like Dropout and BatchNorm are in the correct mode.
7.12.3 Saving Checkpoints for Resuming Training
Sometimes, you need to save more than just the model weights to resume training effectively. A common practice is to save a checkpoint dictionary containing:
- The model’s
state_dict
. - The optimizer’s
state_dict
(to resume optimization state like momentum). - The current epoch number.
- The last recorded loss.
- Any other necessary information (e.g.,
lr_scheduler.state_dict()
).
# --- Example: Saving a Checkpoint ---
# Assume epoch, loss, optimizer are defined
= {
checkpoint 'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
# Add anything else needed: 'scheduler_state_dict': scheduler.state_dict(), etc.
}= f"model_epoch_{epoch}.pth"
CHECKPOINT_PATH
torch.save(checkpoint, CHECKPOINT_PATH)print(f"Checkpoint saved to {CHECKPOINT_PATH}")
# --- Example: Loading a Checkpoint to Resume Training ---
# model = YourModelClass(*args, **kwargs)
# optimizer = optim.Adam(model.parameters(), lr=...) # Create optimizer *before* loading state
# CHECKPOINT_PATH = "model_epoch_X.pth" # Path to the checkpoint file
# checkpoint = torch.load(CHECKPOINT_PATH)
# model.load_state_dict(checkpoint['model_state_dict'])
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
# start_epoch = checkpoint['epoch'] + 1 # Resume from next epoch
# last_loss = checkpoint['loss']
# # Load scheduler state if saved: scheduler.load_state_dict(...)
# model.train() # Set model to train mode to resume training
# # Or model.eval() if loading just for evaluation
# print(f"Checkpoint loaded. Resuming from epoch {start_epoch}")
7.12.4 Handling Devices (CPU/GPU)
By default, torch.save
saves tensors on the device they currently reside on. To make your saved models more portable (e.g., load a GPU-trained model on a CPU-only machine), it’s good practice to save the state_dict
after moving the model to the CPU.
When loading, use the map_location
argument in torch.load
to specify where you want the tensors to be loaded.
# --- Saving for Portability (Recommended) ---
# Move model to CPU before getting state_dict
'cpu').state_dict(), PATH)
torch.save(model.to(
# --- Loading with map_location ---
# 1. Load onto CPU explicitly
= torch.load(PATH, map_location=torch.device('cpu'))
state_dict_cpu # model.load_state_dict(state_dict_cpu)
# 2. Load onto the current 'device' (GPU if available, else CPU)
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# state_dict_mapped = torch.load(PATH, map_location=device)
# model = YourModelClass(...) # Instantiate model
# model.load_state_dict(state_dict_mapped) # Load state dict
# model.to(device) # Ensure model is on the correct device
For more details and advanced scenarios, refer to the official PyTorch documentation on saving and loading models.
Saving and loading models, especially using the state_dict
, is a fundamental skill for any PyTorch practitioner, enabling persistence, sharing, and deployment.
7.13 Common Pitfalls and Best Practices
As you start building and training models with PyTorch, you might run into a few common challenges. Here are some of the most common pitfalls and best practices to keep in mind:
Pitfall: Tensor Shape Mismatches
Problem: Layers expect inputs of specific dimensions (e.g.,
nn.Linear
expects(BatchSize, InFeatures)
,nn.Conv2d
expects(BatchSize, InChannels, Height, Width)
). Feeding a tensor with an incorrect shape will cause runtime errors. This often happens when flattening convolutional outputs before a linear layer or forgetting the batch dimension.Best Practice:
- Print Shapes Frequently: Sprinkle
print(tensor.shape)
throughout your model’sforward
method during debugging to track how dimensions change. - Read Documentation: Carefully check the expected input/output shapes for each PyTorch layer you use.
- Use
torch.flatten(x, 1)
orx.view(x.size(0), -1)
: Be mindful when reshaping/flattening. Usingview
with-1
infers one dimension, which is handy but ensure the other dimensions are correct.
- Print Shapes Frequently: Sprinkle
Pitfall: Device Mismatches (CPU vs. GPU)
Problem: Trying to perform an operation involving tensors located on different devices (e.g., input data on CPU, model on GPU) results in a runtime error.
Best Practice:
- Define
device
Early: Use thedevice = torch.device(...)
pattern shown previously. - Move Model: Move your model to the
device
once (model.to(device)
). - Move Data in Loop: Consistently move input data and targets to the same
device
inside your training/evaluation loop (inputs.to(device)
,targets.to(device)
). - Check
.device
: When debugging, check the.device
attribute of tensors involved in the failing operation.
- Define
Pitfall: Missmatching Data Types
Problem: Some loss functions expect a different data type than the one provided (e.g., using
torch.float32
targets withBCEWithLogitsLoss
that expectstorch.float64
targets). Operations on tensors of different data types can lead to unexpected results or errors.Best Practice: Check the data type of the tensors consistently especially when the operation is your own.
Pitfall: Forgetting
optimizer.zero_grad()
Problem: PyTorch accumulates gradients by default (adds them to the
.grad
attribute on each.backward()
call). If you forgetoptimizer.zero_grad()
at the start of your training loop iteration, gradients from previous batches will interfere with the current update, leading to incorrect training.Best Practice: Make it a habit: Always call
optimizer.zero_grad()
right at the beginning of your training loop iteration before the forward pass.
Pitfall: Forgetting
loss.backward()
oroptimizer.step()
Problem: Forgetting
loss.backward()
means no gradients are computed. Forgettingoptimizer.step()
means gradients are computed but the model’s weights are never updated. In either case, the model doesn’t learn.Best Practice: Ensure the standard training sequence is followed within the loop:
zero_grad()
->forward
->calculate loss
->backward()
->step()
.
Pitfall: Incorrect Evaluation Mode (
model.eval()
,torch.no_grad()
)Problem: Forgetting
model.eval()
during validation/testing means layers like Dropout and BatchNorm behave as they do in training, leading to inaccurate performance assessment. Forgettingwith torch.no_grad():
means unnecessary computation and memory usage for tracking gradients.Best Practice: Always call
model.eval()
before evaluation and wrap the evaluation loop inwith torch.no_grad():
. Remember to callmodel.train()
when switching back to training.
Pitfall: Incorrect Loss Function Inputs/Targets
Problem: Feeding inputs or targets with incorrect shapes, data types, or formats to the loss function (e.g., probabilities instead of logits for
BCEWithLogitsLoss
, one-hot encoded targets forCrossEntropyLoss
, wrongdtype
for targets).Best Practice: Carefully read the documentation for your chosen loss function. Pay close attention to:
- Expected input format (logits vs. probabilities).
- Expected target format (class indices vs. probabilities/labels).
- Expected target
dtype
(torch.long
for indices,torch.float
for BCE targets). - Expected input/target shapes.
Pitfall: Unintentionally Breaking the Computation Graph
Problem: Performing operations that prevent Autograd from tracking history correctly, often by converting a tensor that requires gradients to NumPy too early, or using non-PyTorch operations mid-graph where gradients are needed.
Best Practice: Keep computations within PyTorch tensors as long as gradients are required. Use
.detach()
explicitly when you need a tensor’s value without its history, or use the.item()
method to get the Python scalar value from a single-element tensor after the backward pass or within ano_grad()
block.
Pitfall: Memory Issues (Especially on GPU)
Problem: Running out of GPU memory (CUDA Out of Memory error). Often caused by using excessively large batch sizes, large models, or holding onto unnecessary tensors and their computation history.
Best Practice:
- Reduce
batch_size
. - Use
with torch.no_grad():
during evaluation. - Use
del tensor_variable
if large intermediate tensors are no longer needed. - Use
.detach()
on tensors where history is no longer required. - Consider gradient accumulation or model parallelism for very large models (more advanced).
- Monitor memory usage (
torch.cuda.memory_allocated()
,torch.cuda.memory_summary()
).
- Reduce
Best Practice: Debugging
- Don’t underestimate simple
print()
statements to check tensor shapes, dtypes, devices, and values at various points. - Use Python’s standard debugger (
pdb
or IDE debuggers) – PyTorch’s dynamic nature makes this very effective. Set breakpoints and inspect tensors.
- Don’t underestimate simple
Best Practice: Start Simple and Iterate
- When building a new model or trying a new technique, start with a very small version of your dataset and a simple model architecture to verify the code runs end-to-end without errors.
- Gradually increase complexity, checking results along the way.
Being aware of these common points can help you troubleshoot more effectively and build your PyTorch skills faster. Every developer encounters these issues, so persistence and careful debugging are key!
7.14 Conclusion: Bringing Concepts to Code
Congratulations! You’ve successfully navigated the core components of PyTorch, bridging the gap between the fundamental concepts of deep learning and their practical implementation in a powerful framework.
Let’s quickly recap the key PyTorch tools and techniques we’ve explored, seeing how they map back to the deep learning building blocks:
PyTorch Fundamentals: We learned what PyTorch is and why it’s useful, focusing on Tensors as the core data structure (representing our Data) and Autograd as the engine for automatic gradient calculation (powering Backpropagation for Optimization).
Data Handling Pipeline: We saw how
Dataset
,Transforms
, andDataLoader
work together to efficiently load, preprocess, augment, and batch our Data, preparing it for the model.Model Definition: We explored how to define Models using
nn.Module
, commonnn.Layers
, and containers likenn.Sequential
, translating conceptual architectures into code. We also saw how to leverage Pre-trained Models fromtorchvision.models
for Transfer Learning.Training Components: We learned how to instantiate Loss Functions (
nn.CrossEntropyLoss
,nn.MSELoss
, etc.) fromtorch.nn
to measure error, and how to use Optimizers (torch.optim
) like Adam or SGD to update model parameters based on gradients.The Workflow: We put everything together in the Training Loop, saw how to Evaluate model performance using metrics, and learned the practical necessity of Saving and Loading models. We also discussed common pitfalls and best practices to help smooth your development process.
Understanding these PyTorch components gives you the foundational toolkit needed to implement and experiment with a wide variety of neural networks. You’ve seen how the abstract concepts of data flow, error calculation, and gradient-based learning become concrete operations within this framework.
Next Steps: Hands-On Labs!
We’ve covered a lot of ground conceptually. The best way to solidify this knowledge is through practice! In the upcoming hands-on labs, you’ll apply everything we’ve discussed! Get ready to dive into the code and bring these powerful ideas to life!