Overview
Welcome to your first step into deep learning! This lession will help you understand deep learning in a simple and friendly way. Think of it as your first journey into an exciting new world of artificial intelligence.
In this session, we’ll explore:
- Data and models: Learn why data is super important and how models work like smart brains to make sense of information.
- Loss functions and optimization algorithms: Discover how computers learn by understanding the mistakes. We’ll explore how loss functions and optimization algorithms help computers get better and smarter.
By the end of this lesson, you’ll understand these key ideas. These building blocks will help you see how artificial intelligence works and how you can use these cool skills in real-life situations.
Introduction
Think of deep learning as a tool to help us achieve goals and solve problems, similar to how you drive a car to get to your destination. Just as you start driving by learning only the basics (without diving into all the complex mechanics), your journey into deep learning begins with understanding its essential components.
Key questions and building blocks of deep learning
To find the right function using deep learning, we can break down the process into four key questions and building blocks:
- What are the inputs and outputs? This relates to the data we use.
- What functions can we possibly use? This relates to the models that define how inputs connect to outputs.
- How do we evaluate the function? This is where loss functions help us.
- How do we find the best possible function? This is done through optimization algorithms.
These questions and building blocks create the core of deep learning. Two more important parts — training and inference — help connect these building blocks and make the models work in real situations.
Remember, the building blocks of deep learning are closely connected. They affect each other, and there are always trade-offs to consider.
Let’s begin our exploration into the building blocks of deep learning.
Data
Data is the start point of deep learning, forming the inputs and outputs that define the function we want to learn.
Please see the AI-ready Data section for discussions on data for Arctic research.
Outputs
Defining outputs is just as important as preparing inputs. Outputs show how the model makes predictions and must match your project’s goals. Consider:
- What type of output do you need? (labels, numbers, detailed results)
- How should the outputs look? (probability list, single number, detailed information)
- Are there any special requirements for the outputs, e.g., a specific range?
Here are some key steps for preparing output data:
Choose the right type of output based on your specific problem.
- Use classification for tasks like sorting images or checking sentiment.
- Use regression to predict exact numbers, like sea ice concentration.
- Use structured outputs for complex tasks, like finding objects in an image .
Choose a format that works with your model and loss functions.
- Use one-hot encoding for category-based tasks.
- Normalize continuous outputs to match the scale of the model’s inputs.
- Consider using embeddings for structured outputs to capture relationships between categories.
Add labels to your data to provide a reference for training.
See the Data annotation section for more details.
Balance the detail of your outputs with the model’s complexity, available data, and computational resources. For example, predict sea ice concentration precisely (0-100%) or use broader categories like low (<15%) and high (>85%).
In a classification problem, what’s the difference between one-hot encoding and label encoding?
Quantity and quality
Large datasets often improve performance, but they don’t guarantee success. Research shows that creating compute-optimal models means balancing data size, model complexity, and computational power .
For various model sizes, we choose the number of training tokens such that the final FLOPs is a constant. The cosine cycle length is set to match the target FLOP count. We find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train
Quality is often more important than quantity. Common issues include:
- Incorrect or inconsistent labels
- Noise and irrelevant information
- Poorly filtered datasets
Research highlights the importance of data quality:
Our data pipeline (Section A.1.1) includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. We find that successive stages of this pipeline improve language model downstream performance (Section A.3.2), emphasising the importance of dataset quality.
- Hoffmann et al., (2022) .
Nonetheless, large language models face several challenges, including their overwhelming computational requirements (the cost of training and inference increase with model size) (Rae et al., 2021; Thoppilan et al., 2022) and the need for acquiring more high-quality training data. In fact, in this work we find that larger, high quality datasets will play a key role in any further scaling of language models.
Models
Models are the foundation of deep learning. They work as function sets that transform inputs into outputs.
The exact function is created by first choosing a model architecture and then getting a specific set of parameters by training the model on data.
Layers
Deep learning models are built from layers. A layer works like a step that processes data and sends it to the next layer. Different layer types have different jobs. Here are some examples:
To get started, don’t get overwhelmed by all the different layer types. Just get a sense of their basic purposes and focus on using complete models for practical tasks. You can always learn more about individual layers later.
A fully-connected layer connects every input to every output through learnable weights, allowing the network to combine features and make predictions.
How it works:
- Each input connects to every output.
- Each connection has a weight (a number that can be adjusted).
- For each output:
- The layer multiplies each input by its connection weight.
- It adds all these multiplied values together.
Think of it like a voting system: each input “votes” for different outputs with different strengths (weights).
- Uses:
- Adjust the size of your data (dimensionality).
- Combine features to make decisions, e.g., identifying classes or predicting values .
- Advantages:
- They learn patterns across all features.
- They are simple to add to your network.
- Challenges:
- They need lots of computing power and memory due to the dense connections.
- They don’t understand spatial relationships well (unlike layers designed for images or sequences).
A convolutional layer applies filters to input data to detect patterns like edges, textures, and shapes, making it ideal for processing images and other spatial data .
How it works:
- Small filters (also called kernels) slide across the input data.
- Each filter looks at a small area at a time.
- For each position:
- The layer multiplies each input value by the corresponding filter value.
- It adds all these multiplied values together.
- Stride controls how many positions the kernel moves each step.
- Padding adds zeros around the input to control output dimensions.
- Output size = (Input size + 2 × Padding - Kernel size) / Stride + 1
Think of it like a spotlight that moves across an image, highlighting specific patterns whenever they appear.
- Uses:
- Extract features from spatial data (like images).
- Detect patterns regardless of where they appear in the input.
- Reduce the data size while keeping important information.
- Advantages:
- They need fewer parameters than fully-connected layers.
- They preserve spatial relationships in the data.
- They can find the same pattern anywhere in the input.
- Challenges:
- They may miss global patterns that span the entire input.
- Setting the right filter size and number requires careful design.
A pooling layer reduces the size of data by keeping only the most important information, making processing faster and helping the network focus on key features.
How it works:
- The layer divides input data into small regions.
- For each region, it keeps only one value (e.g., the maximum or average value).
- The creates a smaller output with fewer details.
Think of it like summarizing a detailed picture by keeping only the brightest point in each region.
- Uses:
- Reduce data size to save memory and computation.
- Make the network less sensitive to small input changes.
- Focus on the most important features.
- Advantages:
- They significantly reduce data size.
- They make the network more resistant to small input changes.
- They help extract key features regardless of exact position.
- Challenges:
- They permanently lose some information.
- They might discard details that are important for the task.
Types of pooling layers:
- Max pooling: Keeps the maximum value in a region.
- Average pooling: Calculates the average value in a region.
- Global pooling: Averages information across the entire feature map, often used to reduce each feature map to a single value .
An activation layer adds non-linearity to the network, allowing it to learn complex patterns that go beyond simple calculations.
How it works:
- The layer takes each input value individually.
- It applies a mathematical function (like ReLU, sigmoid, or tanh).
- This transforms values in a non-linear way.
Think of it like adding decision points in the network: “If the value is below 0, ignore it. If above, keep it” (for ReLU).
- Uses:
- Enable to network to learn complex, non-linear relationships.
- Control the range of output values.
- Advantages:
- They allow networks to learn complicated patterns.
- They control how information flows through the network.
- Different activations work well for different problems .
- Challenges:
- Some activations can cause training problems (like vanishing gradients) .
- Choosing the right activation requires understanding the problem.
Some examples:
- ReLU: \(f(x) = \max(0, x)\).
- Sigmoid: \(f(x) = \frac{1}{1 + e^{-x}}\).
- Tanh: \(f(x) = \frac{e^{2x} - 1}{e^{2x} + 1}\).
A recurrent layer processes sequences by maintaining a memory of previous inputs, making it suitable for text, speech, and time-series data.
How it works:
- It maintains an internal state (memory) between processing steps.
- For each item in a sequence:
- It combines the current input with its internal state.
- It updates its state based on this combination.
- This allows information to persist across the sequence.
Think of it like reading a book while keeping track of the story so far, using previous context to understand each new sentence.
- Uses:
- Process sequential data like text or time series.
- Remember information from earlier in a sequence.
- Generate sequential outputs based on context.
- Advantages:
- They can capture dependencies across sequence elements.
- They can process sequences of variable length.
- Variants like LSTM and GRU can remember information for long periods.
- Challenges:
- They can be slow to train due to sequential processing.
- They may suffer from vanishing or exploding gradients .
The key components of an LSTM cell are:
- Three gates (shown as X symbols in circles) from left to right:
- Forget gate: This decides what information to throw away or keep from memory.
- Input gate: This decides what new information to add to memory.
- Output gate: This decides what information to share with the next cell.
- Inputs:
- Current input \(x_t\)
- Previous hidden state \(h_{t-1}\)
- Previous cell state \(C_{t-1}\)
- Outputs:
- Current hidden state \(h_t\)
- Current cell state \(C_t\)
- The blue box represents the sigmoid function, which outputs a value between 0 and 1. It controls how much information passes through each gate, like a filter that can be partially open or closed.
- The purple box represents the tanh function, which outputs a value between -1 and 1. It scales the input values.
The LSTM cell works as follows:
- \(h_{t-1}\) and \(x_t\) are combined together and passed through sigmoid functions as gate control signals.
- The forget gate determines how much of the previous cell state \(C_{t-1}\) (the previous memory) is passed to the next cell. Think of this like deciding which old memories to keep or discard.
- The input gate determines how much of the current input \(x_t\) is added to the cell state. This is like deciding which new information is worth remembering.
- The output gate determines how much of the current cell state \(C_t\) is passed to the next hidden state \(h_t\). This is like deciding which parts of your memory to actively think about right now.
An attention layer helps a network focus on relevant parts of the input data, similar to how humans concentrate on important details rather than everything at once.
How it works:
- It calculates how important each input element is for the current task.
- It assigns attention weights to each element based on this importance.
- It creates a weighted combination of the inputs according to these weights.
Think of it like reading with a highlighter: marking and focusing on key phrases rather than every word equally.
- Uses:
- Find relationships between different parts of the input.
- Focus on relevant information for a specific task.
- Handle long-range dependencies in sequences.
- Advantages:
- They significantly improve performance on complex tasks.
- They create interpretable weightings that show what the network focuses on.
- They enable processing of very long sequences effectively.
- Challenges:
- They can be computationally expensive, especially for long sequences.
- Self-attention specifically scales quadratically with sequence length.
- Designing the right attention mechanism requires careful consideration.
Attention mechanism is originally proposed for natural language processing tasks. The following visualization shows how it works for images. You may check how it works for text here: How LLMs work and Attention in transformers.
Models
Deep learning models are architectures composed of layers. Each model architecture has unique characteristics and is suited for particular tasks. Here are some examples:
Best for image-related tasks.
- Convolutional Neural Networks (CNNs) are designed to process and analyze images. They are characterized by:
- Convolutional Layers: Detecting patterns like edges, textures, and shapes.
- Pooling Layers: Reducing spatial dimensions to prevent overfitting.
- Fully-Connected Layers: Combining features for predictions.
- Applications of CNNs:
- Image classification, object detection, and segmentation.
- Medical imaging analysis.
- Remote sensing and satellite image processing.
Designed for sequential data like time series or text.
- Long Short-Term Memory Networks (LSTMs) are recurrent neural networks that maintain a hidden state to capture temporal dependencies. They are characterized by:
- Memory Cells: Capturing long-term dependencies in sequential data.
- Gates: Regulating the flow of information to prevent vanishing gradients.
- Hidden State: Maintaining a memory of past inputs to inform future predictions.
- Applications of LSTMs:
- Time series forecasting.
- Natural language processing tasks like language modeling and machine translation.
- Speech recognition and synthesis.
The backbone of modern natural language processing and vision models.
- Transformers are models that process sequences of data using self-attention mechanisms. They are characterized by:
- Self-Attention: Computing relationships between elements in the data.
- Multi-Head Attention: Capturing different types of relationships in the data.
- Positional Encoding: Incorporating positional information into the model.
- Applications of Transformers:
- Natural language processing tasks like machine translation, text generation, and sentiment analysis.
- Image analysis and computer vision tasks like object detection and image captioning.
Designed for graph-structured data like social networks, molecular structures, and knowledge graphs.
- Graph Neural Networks (GNNs) are specialized models for processing graph-structured data. They are characterized by:
- Graph Convolutional Layers: Propagating information between nodes in the graph.
- Node Embeddings: Learning representations for nodes in the graph.
- Graph Pooling: Aggregating information from subgraphs.
- Applications of GNNs:
- Social network analysis and link prediction.
- Drug discovery and molecular property prediction.
- Knowledge graph completion and recommendation systems.
Used for unsupervised learning and dimensionality reduction.
- Autoencoders are neural networks that learn to encode and decode data, enabling tasks like:
- Dimensionality Reduction: Learning compact representations of data.
- Anomaly Detection: Identifying outliers or unusual patterns in the data.
- Generative Modeling: Generating new data samples similar to the input.
- Variants of autoencoders include:
- Variational Autoencoders (VAEs): Learn probabilistic encodings for generative modeling.
- Denoising Autoencoders: Train on noisy data to learn robust representations.
- Sparse Autoencoders: Encourage sparsity in the learned representations.
- Applications of autoencoders:
- Image denoising and reconstruction.
- Anomaly detection in cybersecurity and fraud detection.
- Generative modeling for data augmentation and synthesis.
Pre-trained models and transfer learning
In the realm of deep learning, building models from scratch can be both time-consuming and resource-intensive. Fortunately, pre-trained models and transfer learning offer a pratical solution to these challenges. They enables scientists to leverage existing models and achieve better performance with minimal efforts.
Pre-trained models are deep learning models that have been previously trained on extensive datasets. These models can serve as a solid foundation for solving similar tasks in different domains. By utilizing the knowledge captured in pre-trained models, you can achieve faster training times and often better performance.
Transfer learning is a technique where a model developed for a particular task is reused as the starting point for a model on a second task. This approach is particularly beneficial when the second task has limited data. Instead of training a new model from scratch, you can adapt an existing model that has already learned useful features from a large dataset.
- Faster Training: Pre-trained models provide a head start by leveraging knowledge from previous tasks, reducing the time and resources needed for training.
- Improved Performance: Transfer learning allows you to benefit from the generalization capabilities of pre-trained models, often leading to better performance on new tasks.
- Domain Adaptation: Pre-trained models can be fine-tuned on domain-specific data to adapt to new environments or tasks.
- Select a Pre-trained Model: Choose a pre-trained model that is well-suited for your task. It depends on the nature of the data and the target task.
- Customize the Model: Adapt the pre-trained model to your specific task. The customization may occur at various stages:
- Input adaptation: Adjust the model to handle different types of input data. This might involve changing the input layer to accommodate data with more channels, such as multispectral images, or adapting it for temporal data like time series.
- Output adaptation: Modify the output layers to match your task requirements. This could mean changing the number of output classes for classification tasks. You can also use the pre-trained model as a backbone and build additional task-specific modules on top of it, such as object detection heads for image segmentation tasks.
- Fine-tune the Model: Train the adapted model on your dataset. You can choose to freeze some of the earlier layers to preserve the learned features, while tuning the later layers to adapt to your task.
- Image Classification: Use pre-trained models like ResNet or Swin Transformer for classifying images into different categories.
- Object Detection: Utilize pre-trained models like Faster R-CNN, YOLO, or RetinaNet for object detection tasks.
- Natural Language Processing: Apply pre-trained models like BERT, GPT, or RoBERTa for text classification, sentiment analysis, or question answering.
Model customization
In deep learning, models can be thought of as consisting of three core components: input adaptation, a feature extractor, and output adaptation. Understanding and customizing these components is crucial for effectively applying pre-trained models to new tasks.
The feature extractor is the heart of the model, transforming data into informative representations that highlight essential patterns relevant to the task. Pre-trained models often excel in this role, as they have already learned rich feature sets from large datasets. By using a pre-trained model as a feature extractor, you can leverage existing knowledge and focus on adapting it to your specific needs.
Input adaptation involves transforming your data into a format that the feature extractor can process. This might mean:
- Adjusting the input layer to accommodate different data types, such as adding channels for multispectral images or handling temporal sequences for time series data.
- Preprocessing data to match the scale or format expected by the pre-trained model.
Output adaptation transforms the extracted features into usable outputs for your specific task. This often involves:
- Modifying the output layer to match the number of classes in your classification task.
- Adding specialized layers, such as segmentation heads for image segmentation tasks, or regression layers for predicting continuous values.
- Task Similarity: The extent of adaptation needed depends on how closely the pre-trained model’s original task aligns with your target task. More divergent tasks may require extensive customization and additional data for fine-tuning. Therefore, selecting a pre-trained model that closely resembles your task can simplify the adaptation process.
Loss functions
A loss function quantifies the difference between the predicted outputs and the actual target values, providing essential feedback for optimization.
For example, consider a classification task using a softmax output layer:
- Predicted Output:
[0.6, 0.2, 0.2]
- Target Output:
[1, 0, 0]
Using the Cross-Entropy loss function, the loss value is calculated as:
\[
\text{Cross-Entropy Loss} = -\sum_{i} y_i \log(p_i) = -\log(0.6) = 0.51
\]
where \(y_i\) is the target output and \(p_i\) is the predicted output.
Mean Absolute Error (MAE) can also be used to evaluate the prediction:
\[
\text{MAE} = \frac{1}{n} \sum_{i} |y_i - p_i| = \frac{|1 - 0.6| + |0 - 0.2| + |0 - 0.2|}{3} = \frac{0.4 + 0.2 + 0.2}{3} = 0.27
\]
In the example above, both Cross-Entropy and MAE can evaluate the prediction’s accuracy. Consider these questions:
- How do the values of the two loss functions change when predictions are closer to or further from the target?
- What is the impact of each loss function on the model training process?
Types of Loss Functions
Selecting the right loss function is essential for optimizing model performance across various tasks. Here are some common types of loss functions:
- Regression: Measure error for continuous outputs.
- Mean Squared Error (MSE): Computes the average squared difference between predictions and targets.
- Mean Absolute Error (MAE): Calculates the average absolute difference between predictions and targets.
- Huber Loss: Combines MSE and MAE, less sensitive to outliers than MSE.
- Classification: Evaluate probability distributions.
- Cross-Entropy Loss: Measures the difference between predicted and target distributions, used with softmax outputs.
- Binary Cross-Entropy: Specifically for binary classification tasks.
- Hinge Loss: Used for “maximum-margin” classification, mainly with support vector machines.
- Sequence Prediction: Handle variable-length outputs.
- Connectionist Temporal Classification (CTC): Aligns input and output sequences, used in tasks like speech recognition.
- Sequence-to-Sequence Loss: Often combines cross-entropy with attention mechanisms.
- Imbalanced Data:
- Focal Loss: Mitigates class imbalance by focusing on hard-to-classify examples.
- Weighted Cross-Entropy: Assigns different weights to classes to balance their impact.
- Multi-Objective Tasks:
- Multi-Task Loss: Combines multiple loss functions with weighting factors to optimize for several objectives simultaneously.
- Robustness to Outliers:
- Log-Cosh Loss: Similar to MSE but less sensitive to outliers, using the hyperbolic cosine of prediction errors.
- Image Processing:
- Dice Loss: Used for image segmentation tasks to measure overlap between predicted and target areas.
- IoU Loss (Intersection over Union): Measures the overlap between predicted and actual bounding boxes, often used in object detection.
Training and validation loss
Training and validation loss are metrics used to evaluate and fine-tune the performance of machine learning models. They provide insights into how well a model is learning and can indicate potential issues like overfitting or underfitting.
Training Loss: This is the error calculated on the training dataset after each iteration. It reflects how well the model is learning the training data.
Validation Loss: This is the error calculated on a separate validation dataset that the model has not seen during training. It provides an indication of how well the model generalizes to unseen data.
The relationship between training and validation loss can reveal important information about the model’s performance:
- Both losses decrease: If both training and validation losses decrease and stabilize at a low value, it suggests that the model is learning well and generalizing effectively to the validation set.
- Training loss decreases, validation loss increases: This pattern indicates overfitting. The model is learning the training data too well, including its noise, and is not generalizing effectively to new data. Regularization techniques or a simpler model might be needed.
- Both losses are high: If both losses remain high, it may indicate underfitting. The model is not complex enough to capture the underlying patterns in the data. Consider increasing model capacity or improving feature engineering.
- Training loss stable, validation loss fluctuates: Fluctuating validation loss with stable training loss may suggest that the model is sensitive to the specific validation data. This could be due to a small validation set size or data noise.
To address common issues related to training and validation loss, consider the following strategies:
- Regularization: Techniques like L1/L2 regularization, dropout, and early stopping can help mitigate overfitting.
- Data Augmentation: Increasing the diversity of the training data can improve model generalization.
- Cross-Validation: Using k-fold cross-validation provides a more reliable estimate of model performance on unseen data.
Optimization Algorithms
Optimization algorithms adjust model parameters to minimize the loss function, guiding the model towards better performance.
Introduction to gradient descent
Gradient Descent is the foundational algorithm used in deep learning for optimization:
- Objective: The aim is to find the minimum of a function by iteratively adjusting parameters.
- Gradient: Represents the direction of the steepest ascent. In optimization, we move in the opposite direction to find the minimum.
- Learning rate: A crucial hyperparameter that controls the size of the steps taken towards the minimum.
Variants of gradient descent
To enhance the efficiency and performance of gradient descent, several variants have been developed:
Update parameters using a single training example per iteration, leading to faster but noisier convergence.
It strikes a balance between batch and stochastic gradient descent by updating parameters using a small subset (mini-batch) of the training data, improving convergence stability and speed.
This method accelerates convergence by considering past gradients, helping the algorithm navigate through ravines and avoid oscillations.
Adam combines the benefits of momentum and RMSprop, adjusting learning rates for each parameter based on historical gradients, making it one of the most popular optimization methods.
Key Hyperparameters
Optimization algorithms rely on hyperparameters that need to be carefully tuned for optimal performance:
The learning rate determines how quickly or slowly the model learns. It needs to be carefully selected to balance convergence speed and stability.
The batch size refers to the number of training samples used in one iteration. Smaller batch sizes can lead to faster convergence but noisier updates, while larger batch sizes provide smoother updates but require more computational resources.
The momentum rate determines the influence of past gradients on the current update, helping to smooth the optimization path.
A factor used to prevent overfitting by penalizing complex models, ensuring simpler and more generalizable solutions.
Learning rate scheduling
Adjusting the learning rate over time can impact model performance:
Maintains a constant learning rate throughout training, simplifying the optimization process.
Reduces the learning rate at regular intervals, allowing the model to refine its parameters as it approaches convergence.
Gradually decreases the learning rate exponentially, enabling fine-tuning of the model as training progresses.
Vary the learning rate cyclically, encouraging exploration of different regions of the loss landscape for potentially better minima.
Adaptive learning rates
These methods automatically adjust the learning rate during training:
Adaptive moment estimation that combines momentum and RMSprop, providing an efficient and effective optimization approach.
Divides the learning rate by a moving average of the squared gradients, adapting the learning rate for each parameter dynamically.
Training and Inference
Training and inference are the key processes that integrate the essential components for deep learning applications: data, models, loss functions, and optimization algorithms.
Training
Training is the phase where the model learns from the data by optimizing its parameters to minimize the loss function.
- Data preparation: Gather and preprocess data into a suitable format for the model.
- Forward pass: The model processes the input data to generate predictions.
- Loss calculation: The predictions are compared against the target outputs using a loss function to quantify the error.
- Backward pass: Compute gradients of the loss with respect to the model parameters using backpropagation.
- Parameter ppdate: Utilize optimization algorithms (e.g., Gradient Descent, Adam) to update the model’s weights based on the computed gradients, iteratively improving the model’s performance.
Inference
Inference is the phase where the trained model is used to make predictions on new, unseen data.
- Data preparation: Prepare new data in the same way as the training data for consistency.
- Forward pass: The model processes the input data to generate predictions, leveraging the learned parameters.
- Output generation: Convert raw model outputs into interpretable results, such as class labels or continuous values.
- Post-processing: Apply additional processing steps like thresholding for binary classification or filtering to refine results.
- Result interpretation: Analyze the model’s outputs to make informed decisions, often integrating domain-specific knowledge or business logic.