Basic Training Process

Suppose the goal is to understand the types of cars present in your city by conducting a survey that gathers information such as the car’s brand and certain characteristics (e.g., color, type, size). This can be solved by building two types of deep learning models:

Detection model: To detect the presence and location of cars in the images/videos.
Classification model: To classify the brand and other characteristics of each detected car.

My work has primarily involved video datasets, with a focus on building detection and classification models. However, this general deep learning development pipeline can be applied across various systems beyond just videos.

The typical process involves the following stages:

1. Data Collection

Gather raw video footage from different sources across the city.
Since the raw video contains a mixture of useful and irrelevant footage, it is essential to carefully select relevant frames.
This often requires manual effort: cherry-picking frames where cars are clearly visible and ignoring noisy or irrelevant data.

2. Annotation

After selecting frames, the next step is manual annotation:
- Drawing bounding boxes around cars (for detection).
- Labeling brand names and characteristics (for classification).
This step is time-consuming and crucial because model performance heavily depends on the quality of the annotations.

3. Model Training

Train deep learning models on the annotated dataset.
The training process usually involves:
- Tuning hyperparameters,
- Performing multiple iterations of training and evaluation,
- Conducting stress testing under various conditions (e.g., different lighting, angles, occlusions) to ensure robustness.

Data Cleaning and Annotation in Self-Driving Car Datasets

In the context of self-driving cars, not all data collected is relevant for training deep learning models. It is essential to clean the data and discard irrelevant or redundant information. Below, we outline some common scenarios and techniques for data cleaning and annotation.

1. Data Cleaning

In a self-driving car dataset, certain situations may involve irrelevant data that should be discarded. Some examples include:

Example Scenarios

Standing at Traffic Signals: If the car is stationary at a traffic light for an extended period, the frames captured during this time may not contain significant changes, rendering them useless for model training.
Running on Empty Streets: Similarly, when the car is running on an empty street, there may be little to no interaction with other vehicles or pedestrians. These frames could provide minimal information and should be discarded.

Techniques for Data Cleaning

Several computer vision techniques can assist in cleaning the dataset by identifying and removing redundant or irrelevant frames:

Optical Flow: Optical flow can be used to track the movement between consecutive frames. If there is little to no movement between frames, it indicates minimal changes, which could be discarded. This helps in capturing only the “important” frames that contain meaningful visual changes.
Embedding Space Analysis:
- Image Embeddings: Models like ResNet or CLIP can be used to create embeddings for each frame. By comparing the embeddings of consecutive frames, we can identify near-duplicate frames. If two frames are too similar in the embedding space, one of them can be discarded.
- Text Embeddings via VQA Models: Visual Question Answering (VQA) models can be used to extract textual information from images. By using VQA models and performing a semantic comparison of text embeddings, we can identify frames with repetitive or irrelevant content based on the questions and answers extracted from the frames.

2. Annotation

Annotation is a critical step in training deep learning models for computer vision tasks, especially when working with image or video datasets. Accurate and consistent annotations are essential for model performance.

Techniques for Annotation

To automate or assist in the annotation process, we can use several advanced models and tools:

Grounding Models for Annotation:
- GDINO and GSAM are grounding models that can automatically annotate images by associating objects or regions of interest within the image with textual descriptions or labels.
VQA Models for Confirmation:
- Visual Question Answering (VQA) models can be leveraged to confirm or enhance annotations. By asking the model specific questions related to the content of an image (e.g., “What objects are present?” or “Is there a car in the image?”), we can confirm or refine the accuracy of manual annotations.

1. Data Pruning

In the realm of self-driving cars, training deep learning models on massive datasets of images and videos can be highly resource-intensive. Not only does it require significant computational power, but it also consumes large amounts of time and money. This raises an important question: Can we reduce the number of training images without compromising model performance? The answer lies in data pruning.

What is Data Pruning?

Data pruning involves selecting a subset of the dataset that is both representative and diverse while discarding irrelevant or redundant images. The goal is to retain a dataset that captures all the key variations and nuances of the real-world environment without the unnecessary bulk. This helps optimize the training process by reducing the computational cost, speeding up training, and, in some cases, preventing overfitting.

How Data Pruning Works

The most common approach to pruning data is by utilizing embedding space techniques. Here’s how this can be applied in practice:

Embedding Generation: By generating embeddings of images using models such as VQA (Visual Question Answering) or traditional core vision models like ResNet or CLIP, each image or frame is mapped to a high-dimensional space. These embeddings effectively capture the essential features of each image, enabling a comparison of how similar or diverse the images are with respect to each other.
Cluster Sampling: Once the embeddings are generated, the next step is to perform clustering on these embeddings. Clustering algorithms (e.g., k-means, DBSCAN) group similar images together. By analyzing these clusters, we can select samples from only the most representative clusters, ensuring diversity while minimizing redundancy. This is crucial because self-driving cars encounter a wide variety of scenarios, from driving in urban streets, navigating intersections, to detecting pedestrians or cyclists. The diversity of these scenarios must still be captured, but redundant frames from similar situations (such as multiple frames from empty streets) can be pruned.

Benefits of Data Pruning in Self-Driving Car Models

Data pruning in the context of self-driving car models can provide several key advantages:

Reduced Training Time: By selecting only the most relevant and diverse images from the dataset, we drastically reduce the size of the training data, which in turn speeds up the training process.
Cost Efficiency: With reduced data requirements, the computational cost associated with training decreases, which is essential for maintaining cost-effective operations.
Maintained Accuracy and Diversity: Pruning does not mean simply removing images at random; it ensures that the data selected maintains high representational accuracy. As a result, the model continues to perform well in recognizing and predicting scenarios it may encounter in real-world driving environments (e.g., recognizing cars at various angles, handling occlusions, understanding traffic signs, etc.).
Avoiding Overfitting: Using a large but redundant dataset can lead to overfitting, where the model memorizes the training data rather than generalizing to new, unseen data. Data pruning helps combat this by ensuring that the model is trained on a diverse yet compact set of images that are not too repetitive, allowing the model to generalize better.

By adopting data pruning techniques, we can create an efficient, scalable, and cost-effective training pipeline for self-driving car models, ensuring the model is both accurate and practical for deployment.

2. Training the Model

Training a deep learning model for self-driving car systems follows the traditional deep learning pipeline, but with additional considerations unique to the challenges in autonomous driving. Below is an overview of the typical steps involved:

1. Data Augmentation

Data augmentation is a crucial step in improving the model’s ability to generalize. This involves artificially increasing the size of the training dataset by applying transformations to the existing data. For self-driving car models, the following augmentations are commonly used:

Motion Blur: Simulating the effect of blurred images due to fast-moving objects or shaky cameras. This helps the model learn to handle real-world conditions where blur might occur due to vehicle movement or external factors.
Night-Time Images: Adding variations to simulate driving in low-light conditions, such as night-time or poorly lit environments. This helps the model perform well under different lighting conditions.
Pixelation: Simulating lower-resolution images or pixelated frames that may be captured in real-world scenarios (e.g., due to network latency or foggy conditions). This ensures the model can still detect important objects even in degraded image quality.
Weather Conditions: Augmenting the data with images affected by different weather conditions such as rain, fog, snow, or sunlight glare. This helps the model learn how to detect objects and navigate the environment under challenging weather conditions, which is a common real-world scenario for self-driving cars.

2. Model Hyperparameter Tuning

Once the model architecture is chosen, the next step is to tune the hyperparameters. Hyperparameter tuning is a critical process in achieving optimal performance. Common hyperparameters include:

Learning Rate: Controls the step size during optimization. A learning rate that is too high may lead to unstable training, while a rate that is too low can slow down the learning process.
Batch Size: Determines how many training examples are used in each forward/backward pass. Larger batches may speed up training but require more memory, while smaller batches may lead to better generalization.
Epochs: The number of times the entire training dataset is passed through the model. More epochs may increase accuracy but could also risk overfitting.
Regularization Parameters: Techniques such as Dropout or L2 regularization help prevent overfitting, which is especially important in self-driving car models where generalization to various environments and conditions is critical.

Hyperparameter tuning often involves techniques like grid search, random search, or more advanced methods like Bayesian optimization to find the most optimal set of hyperparameters for the given task.

3. Model Testing

After training, the model is evaluated using testing and validation datasets. This step assesses the model’s performance on unseen data and ensures that it generalizes well to real-world scenarios. Testing for self-driving car models may involve:

Accuracy: Evaluating how well the model detects and classifies objects (e.g., other vehicles, pedestrians, traffic signs) in various conditions.
Performance under Edge Cases: Testing the model under challenging edge cases, such as low visibility (fog, heavy rain), high-speed driving, or complex urban environments.
Real-World Simulations: Using simulation tools (e.g., CARLA, NVIDIA DriveSim) to test the model’s decision-making abilities in virtual environments that simulate various real-world driving scenarios.

4. Pushing to Server for Production

Once the model performs well on the test set, the final step is to push it to the production server for deployment. This involves:

Deployment: The model is integrated into the self-driving car’s software stack, where it is tested in real-world environments. Continuous monitoring of the model’s performance is essential to ensure safety and accuracy. Any issues identified may lead to further iterations of training and testing.
Model Versioning: It’s important to keep track of model versions and updates to ensure that any improvements or bug fixes are properly deployed, and the vehicle is running the most up-to-date and safe model.

By following this process, self-driving car systems can be trained, tested, and deployed effectively, enabling them to handle a wide range of driving scenarios and improve over time.

3. Model Pruning

In real-world applications, especially in resource-constrained environments such as edge devices or embedded systems (e.g., self-driving cars, mobile devices), model size and computation power are significant constraints. To address this, model pruning is employed to reduce the complexity of the model without sacrificing much of its accuracy.

The goal is to shrink the model’s computational footprint, allowing it to run faster and on lower-power devices while maintaining acceptable performance.

Techniques for Model Pruning

Here are several common techniques used for model pruning and optimization:

1. Model Quantization

Model quantization involves reducing the precision of the numbers used in the model (typically from 32-bit floating-point to 16-bit or even 8-bit integers). This technique has the following benefits:

Reduced Model Size: Quantizing the weights and activations decreases the model size significantly, allowing it to fit into memory-limited devices.
Faster Inference: Lower precision operations are computationally faster, leading to a speed-up in inference time, which is critical in real-time systems like self-driving cars.
Power Efficiency: Reduced computation load also leads to lower power consumption, which is crucial for devices that are battery-powered or have limited computational resources.

2. Model Pruning and Sparsity

Model pruning is the process of removing certain weights from the network that are deemed unnecessary for the model’s performance. This reduces the number of parameters in the model and leads to a more sparse network. Pruning can be done in various ways:

Weight Pruning: This involves removing weights that have little effect on the output (often weights with small values are pruned).
Neuron Pruning: Involves removing entire neurons or layers from the network that contribute little to the model’s predictions. For example, neurons that are inactive during training or have low activations can be pruned.
Structured Pruning: Rather than pruning individual weights or neurons, structured pruning removes entire filters, channels, or blocks. This can lead to significant reductions in computation and memory usage.

Benefits:

Model Size Reduction: Pruning significantly reduces the model’s size, making it more deployable on devices with limited storage.
Speedup: The reduced number of parameters means fewer computations, leading to faster inference times.
Memory Efficiency: Sparse models can be stored more efficiently, using specialized data structures like sparse matrices to save memory.

3. Neural Network Architecture Search (NAS)

Neural Architecture Search (NAS) is an automated technique used to find the most efficient model architecture for a given task. NAS searches for optimal architectures by exploring different configurations of layers, neurons, and connections, often within predefined constraints like model size or computation.

Benefits:

Optimized Model Architecture: NAS helps discover more efficient architectures that require fewer resources while maintaining or improving performance.
Automated Search: It automates the design of models, making it easier to find architectures that perform well under computational constraints.

4. Model Distillation

Model distillation involves training a smaller, more efficient model (called the “student”) to mimic the behavior of a larger, more complex model (called the “teacher”). The smaller model learns from the outputs of the larger model, effectively inheriting the knowledge of the teacher while being much lighter.

Benefits:

Smaller Models: The student model is typically much smaller than the teacher model, making it suitable for deployment on resource-constrained devices.
Retaining Accuracy: Distillation can often lead to a smaller model that retains much of the accuracy of the larger model, making it a highly effective technique for model compression.
Faster Inference: The distilled model is designed to be efficient, thus speeding up inference, which is especially useful in real-time systems like autonomous driving.

Combining Techniques

These pruning techniques can be combined to achieve further optimization:

Pruning + Quantization: After pruning, quantization can further reduce the size and improve the performance of the model.
Distillation + Pruning: A smaller distilled model can be further pruned for even more efficiency.

Conclusion

For systems like self-driving cars, where real-time inference is critical, model pruning and optimization techniques are necessary to ensure the model runs efficiently on embedded systems. By reducing the model size, improving speed, and lowering power consumption, pruning helps strike the right balance between accuracy and computational efficiency.

4. Model Dependability

Model dependability refers to the ability to trust a model’s predictions and understand when the model might fail or give unreliable results. Ensuring dependability is critical, especially in high-stakes applications like self-driving cars where safety and accuracy are paramount.

Methods to Assess Model Dependability

Grad-CAM (Gradient-weighted Class Activation Mapping)

Grad-CAM is a popular technique that helps understand the areas in an image that a model is focusing on when making predictions. It produces a heatmap that highlights which regions of an image contribute most to the model’s output. This can help in the following ways:
- Understanding Model Attention: By visualizing which parts of an image the model is attending to, we can gain insights into its decision-making process.
- Debugging: If the model is focusing on irrelevant or unintended parts of an image (e.g., a shadow instead of the traffic sign), we can identify potential sources of error or bias.
- Validating Model Trustworthiness: Grad-CAM helps confirm that the model is relying on meaningful features (e.g., car headlights, road signs) rather than noise or unrelated elements.
Grad-CAM can be particularly useful in self-driving cars where understanding what the model sees and reacts to can be the difference between safe and unsafe driving decisions.
Out-of-Distribution (OOD) Testing and Morphing Images

OOD testing is an essential aspect of assessing how well a model handles extreme cases or unseen scenarios that are far from the data distribution on which it was trained. Morphing images or creating synthetic data pushes the model to deal with cases that it might not have encountered during training. The following are common areas of focus during OOD testing:
- Shape: Altering the shape of objects in the image (e.g., skewing or distorting cars, pedestrians, or traffic signs) to test if the model can generalize to new forms.
- 3D Pose: Modifying the 3D pose of objects (e.g., rotating cars or pedestrians) to ensure the model can correctly interpret objects in various orientations.
- Texture: Changing the texture of objects (e.g., altering the texture of road surfaces or vehicles) to check if the model can still recognize objects under different visual conditions.
- Context: Testing how well the model performs when the context around the object changes (e.g., a car appearing in an unusual background like a foggy environment or a snowy landscape).
- Weather: Creating scenarios with different weather conditions (e.g., rain, snow, fog) to determine how environmental factors affect the model’s performance. For instance, self-driving cars need to understand traffic signals even in poor visibility conditions.
- Occlusion: Simulating occlusion (e.g., parts of an object being hidden by another object, like a pedestrian behind a car or a traffic sign partially blocked by a tree) to see how the model handles partial information and still makes correct predictions.
OOD testing allows us to better understand where and how the model might fail, ensuring robustness and reliability in real-world scenarios, especially when the system encounters unexpected or extreme situations.

Why is Model Dependability Important?

For critical systems like autonomous driving, medical imaging, or aviation systems, it is essential that the model is dependable and can be relied upon in all situations. If the model cannot perform well under unusual conditions (e.g., rain, snow, night time), it could lead to catastrophic consequences.

By using techniques like Grad-CAM and OOD testing, we can improve model dependability, ensure that the model behaves as expected in diverse real-world scenarios, and minimize the risk of failures.

ECCV 2024

Building Better Deep Learning Models