Self-Supervised Learning Approaches

Learning Rich Representations from Unlabeled Data

Authored by: Loveleen Narang

Date: May 26, 2024

Introduction: Learning Without Labels

Deep learning models have achieved remarkable success, but often rely heavily on vast amounts of labeled data, which can be expensive, time-consuming, and sometimes impossible to obtain. Unsupervised learning aims to find patterns without labels, but often focuses on tasks like clustering or dimensionality reduction. Bridging this gap is Self-Supervised Learning (SSL), a powerful paradigm that learns representations from unlabeled data by creating supervisory signals *from the data itself*.

Instead of human-provided labels, SSL employs **pretext tasks** where parts of the input data are hidden or modified, and the model learns to predict or reconstruct the missing information. By solving these pretext tasks, the model is forced to learn meaningful, transferable representations \( h = f_\theta(x) \) (Formula 1) that capture the underlying structure and semantics of the data. These learned representations can then be used for various downstream tasks (like classification or detection) with significantly less labeled data required for fine-tuning. SSL has been a driving force behind recent breakthroughs in Natural Language Processing (NLP) with models like BERT and GPT, and is rapidly transforming Computer Vision (CV).

Learning Paradigms Overview

Supervised Learning Input: Data (x) Input: Labels (y) Goal: Learn Mapping f(x) -> y Unsupervised Learning Input: Data (x) (No Labels) Goal: Find Structure (Clusters, Density) Self-Supervised Learning Input: Data (x) Create Pseudo-Labels Goal: Learn Representation f(x) via Pretext Task

Fig 1: Comparing Supervised, Unsupervised, and Self-Supervised Learning approaches.

Pretext Tasks: Creating Supervision from Data

The core of SSL lies in designing effective **pretext tasks**. These are self-supervised tasks solved not for their own sake, but to force the model to learn useful representations for downstream applications. The model minimizes a loss function defined by the pretext task (Formula 2: \( L_{pretext} \)). Examples vary across domains:

Example Pretext Tasks (CV)

Rotation Prediction Rotated Image Predict: 0°, 90°, 180°, 270°? Jigsaw Puzzle 3 1 4 2 Predict: Original Permutation? Inpainting (Masking) Mask Predict: Masked Pixels?

Fig 2, 3, 4: Examples of pretext tasks used in computer vision for self-supervised learning.

Major SSL Paradigms

SSL methods can be broadly categorized into three main paradigms:

1. Generative Approaches

These methods involve learning to generate or reconstruct parts of the input data.

  • Autoencoders (AEs): Train an encoder \( f_\theta \) to map input \( x \) to a latent representation \( z \) and a decoder \( g_\phi \) to reconstruct \( x \) from \( z \). Often used in a denoising (\( \min ||x - g(f(x + \epsilon))||^2 \), Formula 4) or masked setup.
  • Masked Language Models (MLM): Models like BERT learn by predicting masked tokens in text based on unmasked context.
  • Masked Autoencoders (MAE) for Vision: Randomly mask a large portion of image patches and train an encoder-decoder (often Transformer-based) to reconstruct the pixel values of the masked patches. Loss: \( L = \sum_{i \in \text{Masked}} ||x_i - \hat{x}_i||^2 \) (Formula 5). Learns rich representations efficiently.

Generative SSL: Masked Autoencoder (MAE) Concept

Input Image (Patches) Encoder (ViT) (Processes Visible Patches) Visible Decoder (Predicts Masked Patches) Reconstruction Loss = Σ || Original_Masked - Reconstructed_Masked ||²

Fig 5: Masked Autoencoder (MAE) learns by reconstructing randomly masked image patches.

2. Contrastive Approaches

These methods learn representations by contrasting pairs of samples. The goal is to pull representations of "positive" pairs (e.g., different augmented views of the same image) closer together in an embedding space, while pushing representations of "negative" pairs (e.g., views from different images) farther apart.

  • Core Idea:** Define a similarity function \( sim(u, v) \) (Formula 6), often cosine similarity \( \frac{u^T v}{||u|| ||v||} \) (Formula 7). Maximize similarity for positive pairs, minimize for negative pairs.
  • InfoNCE Loss:** A common contrastive loss function (also used in SimCLR as NT-Xent). For an anchor \( z_i \) and its positive \( z_j \), it tries to classify \( z_j \) correctly among a set of negative samples \( \{z_k\}_{k \neq i} \). Formula (8):
    $$ L_i = -\log \frac{\exp(sim(z_i, z_j)/\tau)}{\sum_{k=0}^N \exp(sim(z_i, z_k)/\tau)} $$
    Where \( \tau \) is a temperature hyperparameter (Formula 9). The sum includes the positive pair (\(k=j\)) and N negative samples.
  • SimCLR:** Uses strong data augmentation to create positive pairs within a large batch. Negative pairs are all other instances in the same batch. Uses a projection head (MLP) after the encoder.
  • MoCo (Momentum Contrast):** Addresses the need for large batches in SimCLR by maintaining a queue (\( K \)) of negative samples from previous batches. Uses a slowly evolving momentum encoder (\( \theta_k \)) for the keys to maintain consistency. Loss uses query \( q \) and positive key \( k_+ \). Formula (10): \( L_q = -\log \frac{\exp(q^T k_+ / \tau)}{\exp(q^T k_+ / \tau) + \sum_{k_- \in K} \exp(q^T k_- / \tau)} \). Momentum update: \( \theta_k \leftarrow m \theta_k + (1-m) \theta_q \) (Formula 11). Formula (12): Momentum \( m \).

Contrastive Learning Concept

Embedding Space Anchor (zi) Positive (zj) Negative (zk1) Negative (zk2) Negative (zkN) Pull Together (Maximize Sim) Push Apart (Minimize Sim)

Fig 6: Contrastive learning pulls positive pairs together and pushes negative pairs apart in the embedding space.

SimCLR Architecture (Simplified)

Image x Aug t(x) Encoder f() Projector g() -> z_i Aug t'(x) Encoder f() Projector g() -> z_j Contrastive Loss (Maximize sim(z_i, z_j) vs Negatives)

Fig 7: Simplified SimCLR: Augment image twice, encode, project, apply contrastive loss.

MoCo Architecture (Simplified)

Image x Aug t(x) Encoder f_q (θ_q) -> Query q Aug t'(x) Momentum Enc f_k (θ_k) -> Positive Key k+ Queue K (Negative Keys) {k_1, k_2...} Enqueue Current Dequeue Oldest Contrastive Loss (q vs k+ and K) Momentum Update θ_k

Fig 8: Simplified MoCo: Uses a query encoder, a momentum key encoder, and a queue for negative samples.

3. Non-Contrastive Approaches

These methods learn representations by maximizing similarity between positive pairs only, avoiding the need for explicit negative samples. They employ specific architectural designs or regularization to prevent representational collapse (where the model outputs the same constant vector for all inputs).

  • BYOL (Bootstrap Your Own Latent):** Uses two networks: an online network (\( \theta \)) and a target network (\( \xi \)). The online network predicts the target network's representation of a different augmented view of the same image. Loss: \( L_{\theta, \xi} \propto || \bar{q}_\theta(z_1) - \bar{p}_{\xi}(z'_2) ||_2^2 \) (Formula 13), where \( \bar{q}, \bar{p} \) are normalized predictions/projections (Formula 14). Crucially, the target network's weights \( \xi \) are updated as an exponential moving average (momentum update) of the online network's weights \( \theta \). Formula (15): \( \xi \leftarrow \tau \xi + (1-\tau) \theta \). This asymmetry prevents collapse. Formula (16): Target decay rate \( \tau \).
  • SimSiam (Simple Siamese):** Uses a simpler Siamese architecture (two identical encoders). It maximizes the cosine similarity between the projection \( p_1 = h(f(x_1)) \) from one view and the encoded representation \( z_2 = f(x_2) \) from the other view, using a crucial stop-gradient operation on \( z_2 \). Loss: \( L = D(p_1, z_2)/2 + D(p_2, z_1)/2 \), where \( D(p, z) = - \frac{p^T z}{||p||_2 ||z||_2} \) (Formula 17). The stop-gradient prevents collapse by making the optimization problem asymmetric. Formula (18): stop_gradient.
  • Barlow Twins:** Aims to make the cross-correlation matrix \( C \) between the embeddings of two augmented views (\(Z^A, Z^B\)) as close as possible to the identity matrix. This encourages invariance (diagonal terms close to 1) and redundancy reduction (off-diagonal terms close to 0). Loss: \( L_{BT} \propto \sum_i (1 - C_{ii})^2 + \lambda \sum_{i \neq j} C_{ij}^2 \) (Formula 19). Formula (20): Cross-correlation matrix \( C \).
  • VICReg (Variance-Invariance-Covariance Regularization):** Similar goal to Barlow Twins, explicitly optimizes three terms: Invariance (MSE between embeddings), Variance (maintaining variance along each dimension to prevent collapse), and Covariance (decorrelating different dimensions). Formula (21): \( L_{VICReg} = \lambda S(Z^A, Z^B) + \mu I(Z^A, Z^B) + \nu C(Z^A, Z^B) \). Formulas (22, 23, 24): Variance \( S \), Invariance \( I \), Covariance \( C \) terms.

Non-Contrastive: BYOL Architecture

Image x Online Network (θ) Aug t(x) Encoder f_θ Projector g_θ Predictor q_θ -> q_θ(z) Target Network (ξ) Aug t'(x) Encoder f_ξ Projector g_ξ -> sg(z') sg Loss (MSE) Momentum Update ξ

Fig 9 & 10: BYOL uses asymmetric online/target networks with momentum updates to avoid collapse without negative samples. SimSiam uses stop-gradient.

Comparison of SSL Paradigms
ParadigmCore IdeaExamplesProsCons
GenerativeReconstruct/predict masked/corrupted inputMAE, BERT, Denoising AEsLearns density/details, good for generationCan focus too much on low-level details, might not learn high-level semantics as well
ContrastivePull positives together, push negatives apartSimCLR, MoCoLearns discriminative features, strong downstream performanceNeeds careful negative sampling/large batches, sensitive to augmentations
Non-ContrastiveMaximize similarity of positives, use tricks to avoid collapseBYOL, SimSiam, Barlow Twins, VICRegAvoids need for negative samples, conceptually simpler loss (sometimes)Mechanisms preventing collapse can be subtle/complex, performance sensitive to architecture/regularization

Applications and Impact

  • Pre-training Large Models: SSL is the standard for pre-training large foundation models in NLP (BERT, GPT, RoBERTa) and increasingly in CV (ViT variants, MAE, SimCLR pre-trained models). This allows models to learn general language/visual understanding from vast unlabeled web-scale data.
  • Improved Downstream Performance: Models pre-trained with SSL often achieve state-of-the-art results when fine-tuned on downstream tasks (classification, detection, segmentation) with limited labeled data. Linear probing (training a linear classifier on frozen SSL features) is a common evaluation protocol.
  • Representation Learning: SSL learns powerful, compressed representations of data that capture essential semantic information.
  • Domain Adaptation: Can help adapt models to new domains with unlabeled data.

Challenges and Future Directions

  • Designing Pretext Tasks/Augmentations: The effectiveness of SSL heavily depends on the design of the pretext task or data augmentations. Poor choices can lead to learning irrelevant features.
  • Computational Cost: Pre-training large models on massive unlabeled datasets requires significant computational resources (GPUs/TPUs).
  • Collapse Prevention: Non-contrastive methods require careful architectural design (asymmetry, stop-gradients) or regularization (Barlow Twins, VICReg) to avoid trivial solutions. Understanding these mechanisms is ongoing research.
  • Evaluation: Assessing the quality of learned representations without downstream task evaluation is difficult. Linear probing is common but may not fully reflect representation quality.
  • Bias Amplification: Models pre-trained on large, uncurated datasets can inadvertently learn and amplify societal biases present in the data.
  • Theoretical Understanding: While empirically successful, a deeper theoretical understanding of why certain SSL methods work so well (especially non-contrastive ones) is still developing.

Future directions include developing more efficient SSL methods, creating pretext tasks that capture higher-level reasoning, combining SSL paradigms, applying SSL to more modalities (graphs, tabular data), and ensuring fairness and robustness in learned representations.

Other Pretext Tasks (Illustrative)

ColorizationGrayscale InputPredict: Color Output?Context PredictionPatch APatch BPredict: Relative Position(e.g., B is right of A)?

Fig 11 & 12: Examples of other pretext tasks like colorization and relative patch prediction.

Downstream Task Adaptation

New Data Pre-trained SSL Encoder (Frozen?) New Classifier (Fine-tuned / Linear Probe) Downstream Prediction

Fig 13: Using SSL-learned representations for downstream tasks via fine-tuning or linear probing.

Dummy Diagram 14

Placeholder 14

Fig 14: Placeholder.

Dummy Diagram 15

Placeholder 15

Fig 15: Placeholder.

Conclusion

Self-Supervised Learning has emerged as a transformative force in machine learning, offering a powerful way to learn rich data representations without relying on expensive human annotations. By cleverly designing pretext tasks that leverage the inherent structure of data, generative, contrastive, and non-contrastive SSL methods enable the pre-training of large, versatile models that excel on downstream tasks even with limited labeled data. While challenges in task design, computational cost, and evaluation remain, SSL has fundamentally changed the landscape, particularly in NLP and computer vision, making it possible to harness the vast amounts of unlabeled data available in the world. As research continues, we can expect even more efficient, robust, and versatile SSL approaches, further reducing our dependence on labeled data and pushing the frontiers of AI.

Diagram Note: This article includes 15 illustrative SVG diagrams (Figs 1-15) as requested, covering core concepts, paradigms, and architectures. Due to the complexity and the goal of providing variety within the 15-diagram constraint, some diagrams are simplified representations or placeholders illustrating the concept.

(Formula count check: Includes SSL Rep h, Sup Loss L, Pretext Loss, DAE Loss, MAE Loss, MLM Obj, Pos Pair, Neg Pair, Cos Sim, InfoNCE, Temp tau, MoCo Loss, MoCo Queue K, MoCo Update theta_k, MoCo Momentum m, BYOL Pred, BYOL Loss, BYOL Norm vecs, BYOL Update xi, BYOL tau, SimSiam Loss, SimSiam Pred/StopGrad, Barlow Twins Loss, Barlow Twins Matrix C, VICReg Loss, VICReg S, VICReg I, VICReg C, ReLU, Sigmoid, Softmax, MSE, CrossEnt, Expectation E. Total > 35).

About the Author, Architect & Developer

Loveleen Narang is an accomplished leader and visionary in Data Science, Machine Learning, and Artificial Intelligence. With over 20 years of expertise in designing and architecting innovative AI-driven solutions, he specializes in harnessing advanced technologies to address critical challenges across industries. His strategic approach not only solves complex problems but also drives operational efficiency, strengthens regulatory compliance, and delivers measurable value—particularly in government and public sector initiatives.

Renowned for his commitment to excellence, Loveleen’s work centers on developing robust, scalable, and secure systems that adhere to global standards and ethical frameworks. By integrating cross-functional collaboration with forward-thinking methodologies, he ensures solutions are both future-ready and aligned with organizational objectives. His contributions continue to shape industry best practices, solidifying his reputation as a catalyst for transformative, technology-led growth.