Building Deep Learning Models

In my previous blog, I discuss the AutoDL pipeline, which you can explore here: AutoDL Blog. The final module of this pipeline focuses on model dependability—helping us understand when a model performs well and when it fails. One key insight is that no matter how much you manipulate or augment poor-quality data, it will not improve the model’s output. Therefore, it’s crucial to break free from this cycle and identify where the issue lies if the model’s accuracy plateaus.

Deep learning models are inherently probabilistic and heavily dependent on the data they are trained on. This makes them particularly vulnerable to out-of-distribution data, posing a significant challenge both for the model and for human interpretation.

To overcome these challenges, improving the priors and representations within the model is essential. For instance, in self-driving cars, while RGB cameras may struggle in low-light conditions, thermal cameras can provide valuable insights by detecting heat signatures.

Advancements in model architectures and algorithms can help push the boundaries of performance, enabling more robust handling of edge cases and out-of-distribution scenarios. Incorporating multimodal inputs, such as text, speech, or novel sensors (e.g., thermal cameras), can further enhance model robustness by providing better priors and improving the model’s ability to handle complex, real-world applications.

Overview of Key Technologies

MultiModal Stack

In this approach, new modalities are integrated alongside existing ones, such as text, speech, or advanced cameras, to enhance model performance. The goal is to leverage complementary information from different modalities—information that may not be present in the current modality but exists in another. The model then selects the most relevant information from both sources to make a more informed decision.

Model Explainability

On the left, the person is not visible in the low-light RGB image, but is clearly seen in the thermal image. On the right, the SPAD camera captures high-resolution output without read noise due to its hardware, offering enhanced visibility even in low-light conditions, such as at night.

Computer Vision Stack

Lensless Imaging
Lensless imaging leverages computational methods to reconstruct images without traditional optical lenses. It captures light patterns using a sensor array and processes the data using algorithms to generate high-quality 3D images.
Thermal Cameras
Thermal cameras detect infrared radiation emitted by objects, converting it into visible images. These cameras are especially useful in low-light conditions and for detecting temperature anomalies, commonly used in medical imaging, surveillance, and night vision.
SPAD Cameras (Single-Photon Avalanche Diode)
SPAD cameras are highly sensitive sensors that detect single photons, enabling ultra-low light imaging. They are used in applications such as time-of-flight (ToF) imaging, LiDAR systems, and quantum optics, providing high-resolution depth information.
Depth Cameras (LiDAR)
LiDAR uses laser pulses to measure distances, creating precise 3D maps of environments. It is a key technology in autonomous vehicles, robotics, and any application requiring accurate depth sensing, providing detailed environmental awareness.

Algorithm Stack

Generative AI

GANs (Generative Adversarial Networks)
GANs consist of two networks—a generator and a discriminator—that work in opposition. The generator creates data, while the discriminator evaluates it. This adversarial process improves the quality of generated content over time, commonly used for image synthesis and enhancement tasks.
Diffusion Models
Diffusion models generate data by progressively denoising random noise, reversing a process of gradual degradation. They are known for their ability to create high-quality, diverse images and are applied to tasks like image synthesis and inpainting.
Flow-Based Models
Flow-based models transform a simple distribution (e.g., Gaussian noise) into a complex distribution by learning invertible transformations. They are particularly suited for tasks requiring exact likelihood computation, such as density estimation and image generation.
NeRFs (Neural Radiance Fields)
NeRFs model 3D scenes by representing the interaction of light within the scene, enabling the generation of highly realistic 3D views from 2D images. They are commonly used in virtual reality, 3D rendering, and photorealistic scene generation.
Gaussian Splatting
Gaussian splatting involves representing 3D scenes or objects using points with associated Gaussian distributions. This technique provides an efficient and accurate way to synthesize 3D objects, enhancing volumetric rendering quality.

Image Restoration/Cleaning

U-Net Architectures
U-Net-based architectures are widely used for image restoration and segmentation. They excel in capturing fine spatial details and contextual information. Notable U-Net variants include:
- Restormer: Optimized for image denoising and deblurring tasks, providing enhanced restoration quality.
- UFormer: Combines U-Net with transformers to improve feature extraction and restoration accuracy, especially for high-quality image reconstruction.
- AutoDir: An autoencoder-based U-Net variant designed to perform image restoration in an unsupervised manner.

Slides

Auto Deep Learning

Deep learning Stack