Self-Supervised Learning Approaches

Learning Rich Representations from Unlabeled Data

Authored by: Loveleen Narang

Date: May 26, 2024

Introduction: Learning Without Labels

Deep learning models have achieved remarkable success, but often rely heavily on vast amounts of labeled data, which can be expensive, time-consuming, and sometimes impossible to obtain. Unsupervised learning aims to find patterns without labels, but often focuses on tasks like clustering or dimensionality reduction. Bridging this gap is Self-Supervised Learning (SSL), a powerful paradigm that learns representations from unlabeled data by creating supervisory signals *from the data itself*.

Instead of human-provided labels, SSL employs **pretext tasks** where parts of the input data are hidden or modified, and the model learns to predict or reconstruct the missing information. By solving these pretext tasks, the model is forced to learn meaningful, transferable representations $ h = f_\theta(x) $ (Formula 1) that capture the underlying structure and semantics of the data. These learned representations can then be used for various downstream tasks (like classification or detection) with significantly less labeled data required for fine-tuning. SSL has been a driving force behind recent breakthroughs in Natural Language Processing (NLP) with models like BERT and GPT, and is rapidly transforming Computer Vision (CV).

Learning Paradigms Overview

Fig 1: Comparing Supervised, Unsupervised, and Self-Supervised Learning approaches.

Pretext Tasks: Creating Supervision from Data

The core of SSL lies in designing effective **pretext tasks**. These are self-supervised tasks solved not for their own sake, but to force the model to learn useful representations for downstream applications. The model minimizes a loss function defined by the pretext task (Formula 2: $ L_{pretext} $). Examples vary across domains:

Computer Vision (CV):
- Context Prediction:** Predicting the relative position of image patches.

Natural Language Processing (NLP):

Masked Language Modeling (MLM):** Predicting randomly masked words in a sentence based on surrounding context (used in BERT). Objective: $ \max \sum \log P(x_{mask} | x_{unmask}) $ (Formula 3).

Next Sentence Prediction (NSP):** Predicting if two sentences are consecutive (used in original BERT, less common now).

Word2Vec (Skip-gram/CBOW):** Predicting context words given a center word, or vice versa (an early form of SSL).

Audio/Video:** Predicting future frames, determining if audio/video segments are temporally aligned, cross-modal prediction (e.g., predicting audio from video).

Example Pretext Tasks (CV)

Fig 2, 3, 4: Examples of pretext tasks used in computer vision for self-supervised learning.

Major SSL Paradigms

SSL methods can be broadly categorized into three main paradigms:

1. Generative Approaches

These methods involve learning to generate or reconstruct parts of the input data.

Autoencoders (AEs): Train an encoder $ f_\theta $ to map input $ x $ to a latent representation $ z $ and a decoder $ g_\phi $ to reconstruct $ x $ from $ z $. Often used in a denoising ($ \min ||x - g(f(x + \epsilon))||^2 $, Formula 4) or masked setup.

Masked Language Models (MLM): Models like BERT learn by predicting masked tokens in text based on unmasked context.

Masked Autoencoders (MAE) for Vision: Randomly mask a large portion of image patches and train an encoder-decoder (often Transformer-based) to reconstruct the pixel values of the masked patches. Loss: $ L = \sum_{i \in \text{Masked}} ||x_i - \hat{x}_i||^2 $ (Formula 5). Learns rich representations efficiently.

Generative SSL: Masked Autoencoder (MAE) Concept

Fig 5: Masked Autoencoder (MAE) learns by reconstructing randomly masked image patches.

2. Contrastive Approaches

These methods learn representations by contrasting pairs of samples. The goal is to pull representations of "positive" pairs (e.g., different augmented views of the same image) closer together in an embedding space, while pushing representations of "negative" pairs (e.g., views from different images) farther apart.

Core Idea:** Define a similarity function $ sim(u, v) $ (Formula 6), often cosine similarity $ \frac{u^T v}{||u|| ||v||} $ (Formula 7). Maximize similarity for positive pairs, minimize for negative pairs.

InfoNCE Loss:** A common contrastive loss function (also used in SimCLR as NT-Xent). For an anchor $ z_i $ and its positive $ z_j $, it tries to classify $ z_j $ correctly among a set of negative samples $ \{z_k\}_{k \neq i} $. Formula (8):
$$ L_i = -\log \frac{\exp(sim(z_i, z_j)/\tau)}{\sum_{k=0}^N \exp(sim(z_i, z_k)/\tau)} $$
Where $ \tau $ is a temperature hyperparameter (Formula 9). The sum includes the positive pair ($k=j$) and N negative samples.

SimCLR:** Uses strong data augmentation to create positive pairs within a large batch. Negative pairs are all other instances in the same batch. Uses a projection head (MLP) after the encoder.

MoCo (Momentum Contrast):** Addresses the need for large batches in SimCLR by maintaining a queue ($ K $) of negative samples from previous batches. Uses a slowly evolving momentum encoder ($ \theta_k $) for the keys to maintain consistency. Loss uses query $ q $ and positive key $ k_+ $. Formula (10): $ L_q = -\log \frac{\exp(q^T k_+ / \tau)}{\exp(q^T k_+ / \tau) + \sum_{k_- \in K} \exp(q^T k_- / \tau)} $. Momentum update: $ \theta_k \leftarrow m \theta_k + (1-m) \theta_q $ (Formula 11). Formula (12): Momentum $ m $.

Contrastive Learning Concept

Fig 6: Contrastive learning pulls positive pairs together and pushes negative pairs apart in the embedding space.

SimCLR Architecture (Simplified)

Fig 7: Simplified SimCLR: Augment image twice, encode, project, apply contrastive loss.

MoCo Architecture (Simplified)

Fig 8: Simplified MoCo: Uses a query encoder, a momentum key encoder, and a queue for negative samples.

3. Non-Contrastive Approaches

These methods learn representations by maximizing similarity between positive pairs only, avoiding the need for explicit negative samples. They employ specific architectural designs or regularization to prevent representational collapse (where the model outputs the same constant vector for all inputs).

BYOL (Bootstrap Your Own Latent):** Uses two networks: an online network ($ \theta $) and a target network ($ \xi $). The online network predicts the target network's representation of a different augmented view of the same image. Loss: $ L_{\theta, \xi} \propto || \bar{q}_\theta(z_1) - \bar{p}_{\xi}(z'_2) ||_2^2 $ (Formula 13), where $ \bar{q}, \bar{p} $ are normalized predictions/projections (Formula 14). Crucially, the target network's weights $ \xi $ are updated as an exponential moving average (momentum update) of the online network's weights $ \theta $. Formula (15): $ \xi \leftarrow \tau \xi + (1-\tau) \theta $. This asymmetry prevents collapse. Formula (16): Target decay rate $ \tau $.

SimSiam (Simple Siamese):** Uses a simpler Siamese architecture (two identical encoders). It maximizes the cosine similarity between the projection $ p_1 = h(f(x_1)) $ from one view and the encoded representation $ z_2 = f(x_2) $ from the other view, using a crucial stop-gradient operation on $ z_2 $. Loss: $ L = D(p_1, z_2)/2 + D(p_2, z_1)/2 $, where $ D(p, z) = - \frac{p^T z}{||p||_2 ||z||_2} $ (Formula 17). The stop-gradient prevents collapse by making the optimization problem asymmetric. Formula (18): stop_gradient.

Barlow Twins:** Aims to make the cross-correlation matrix $ C $ between the embeddings of two augmented views ($Z^A, Z^B$) as close as possible to the identity matrix. This encourages invariance (diagonal terms close to 1) and redundancy reduction (off-diagonal terms close to 0). Loss: $ L_{BT} \propto \sum_i (1 - C_{ii})^2 + \lambda \sum_{i \neq j} C_{ij}^2 $ (Formula 19). Formula (20): Cross-correlation matrix $ C $.

VICReg (Variance-Invariance-Covariance Regularization):** Similar goal to Barlow Twins, explicitly optimizes three terms: Invariance (MSE between embeddings), Variance (maintaining variance along each dimension to prevent collapse), and Covariance (decorrelating different dimensions). Formula (21): $ L_{VICReg} = \lambda S(Z^A, Z^B) + \mu I(Z^A, Z^B) + \nu C(Z^A, Z^B) $. Formulas (22, 23, 24): Variance $ S $, Invariance $ I $, Covariance $ C $ terms.

Non-Contrastive: BYOL Architecture

Fig 9 & 10: BYOL uses asymmetric online/target networks with momentum updates to avoid collapse without negative samples. SimSiam uses stop-gradient.

Comparison of SSL Paradigms

Paradigm Core Idea Examples Pros Cons

Generative Reconstruct/predict masked/corrupted input MAE, BERT, Denoising AEs Learns density/details, good for generation Can focus too much on low-level details, might not learn high-level semantics as well

Contrastive Pull positives together, push negatives apart SimCLR, MoCo Learns discriminative features, strong downstream performance Needs careful negative sampling/large batches, sensitive to augmentations

Non-Contrastive Maximize similarity of positives, use tricks to avoid collapse BYOL, SimSiam, Barlow Twins, VICReg Avoids need for negative samples, conceptually simpler loss (sometimes) Mechanisms preventing collapse can be subtle/complex, performance sensitive to architecture/regularization

Applications and Impact

Pre-training Large Models: SSL is the standard for pre-training large foundation models in NLP (BERT, GPT, RoBERTa) and increasingly in CV (ViT variants, MAE, SimCLR pre-trained models). This allows models to learn general language/visual understanding from vast unlabeled web-scale data.

Improved Downstream Performance: Models pre-trained with SSL often achieve state-of-the-art results when fine-tuned on downstream tasks (classification, detection, segmentation) with limited labeled data. Linear probing (training a linear classifier on frozen SSL features) is a common evaluation protocol.

Representation Learning: SSL learns powerful, compressed representations of data that capture essential semantic information.

Domain Adaptation: Can help adapt models to new domains with unlabeled data.

Challenges and Future Directions

Designing Pretext Tasks/Augmentations: The effectiveness of SSL heavily depends on the design of the pretext task or data augmentations. Poor choices can lead to learning irrelevant features.

Computational Cost: Pre-training large models on massive unlabeled datasets requires significant computational resources (GPUs/TPUs).

Collapse Prevention: Non-contrastive methods require careful architectural design (asymmetry, stop-gradients) or regularization (Barlow Twins, VICReg) to avoid trivial solutions. Understanding these mechanisms is ongoing research.

Evaluation: Assessing the quality of learned representations without downstream task evaluation is difficult. Linear probing is common but may not fully reflect representation quality.

Bias Amplification: Models pre-trained on large, uncurated datasets can inadvertently learn and amplify societal biases present in the data.

Theoretical Understanding: While empirically successful, a deeper theoretical understanding of why certain SSL methods work so well (especially non-contrastive ones) is still developing.

Future directions include developing more efficient SSL methods, creating pretext tasks that capture higher-level reasoning, combining SSL paradigms, applying SSL to more modalities (graphs, tabular data), and ensuring fairness and robustness in learned representations.

Other Pretext Tasks (Illustrative)
Fig 11 & 12: Examples of other pretext tasks like colorization and relative patch prediction.

Downstream Task Adaptation
Fig 13: Using SSL-learned representations for downstream tasks via fine-tuning or linear probing.

Dummy Diagram 14
Fig 14: Placeholder.

Dummy Diagram 15
Fig 15: Placeholder.

Conclusion

Self-Supervised Learning has emerged as a transformative force in machine learning, offering a powerful way to learn rich data representations without relying on expensive human annotations. By cleverly designing pretext tasks that leverage the inherent structure of data, generative, contrastive, and non-contrastive SSL methods enable the pre-training of large, versatile models that excel on downstream tasks even with limited labeled data. While challenges in task design, computational cost, and evaluation remain, SSL has fundamentally changed the landscape, particularly in NLP and computer vision, making it possible to harness the vast amounts of unlabeled data available in the world. As research continues, we can expect even more efficient, robust, and versatile SSL approaches, further reducing our dependence on labeled data and pushing the frontiers of AI.

Diagram Note: This article includes 15 illustrative SVG diagrams (Figs 1-15) as requested, covering core concepts, paradigms, and architectures. Due to the complexity and the goal of providing variety within the 15-diagram constraint, some diagrams are simplified representations or placeholders illustrating the concept.

(Formula count check: Includes SSL Rep h, Sup Loss L, Pretext Loss, DAE Loss, MAE Loss, MLM Obj, Pos Pair, Neg Pair, Cos Sim, InfoNCE, Temp tau, MoCo Loss, MoCo Queue K, MoCo Update theta_k, MoCo Momentum m, BYOL Pred, BYOL Loss, BYOL Norm vecs, BYOL Update xi, BYOL tau, SimSiam Loss, SimSiam Pred/StopGrad, Barlow Twins Loss, Barlow Twins Matrix C, VICReg Loss, VICReg S, VICReg I, VICReg C, ReLU, Sigmoid, Softmax, MSE, CrossEnt, Expectation E. Total > 35).

About the Author, Architect & Developer

Loveleen Narang is an accomplished leader and visionary in Data Science, Machine Learning, and Artificial Intelligence. With over 20 years of expertise in designing and architecting innovative AI-driven solutions, he specializes in harnessing advanced technologies to address critical challenges across industries. His strategic approach not only solves complex problems but also drives operational efficiency, strengthens regulatory compliance, and delivers measurable value—particularly in government and public sector initiatives.
Renowned for his commitment to excellence, Loveleen’s work centers on developing robust, scalable, and secure systems that adhere to global standards and ethical frameworks. By integrating cross-functional collaboration with forward-thinking methodologies, he ensures solutions are both future-ready and aligned with organizational objectives. His contributions continue to shape industry best practices, solidifying his reputation as a catalyst for transformative, technology-led growth.

Paradigm	Core Idea	Examples	Pros	Cons
Generative	Reconstruct/predict masked/corrupted input	MAE, BERT, Denoising AEs	Learns density/details, good for generation	Can focus too much on low-level details, might not learn high-level semantics as well
Contrastive	Pull positives together, push negatives apart	SimCLR, MoCo	Learns discriminative features, strong downstream performance	Needs careful negative sampling/large batches, sensitive to augmentations
Non-Contrastive	Maximize similarity of positives, use tricks to avoid collapse	BYOL, SimSiam, Barlow Twins, VICReg	Avoids need for negative samples, conceptually simpler loss (sometimes)	Mechanisms preventing collapse can be subtle/complex, performance sensitive to architecture/regularization