TMLR: On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
Gabriel Mongaras
TMLR: On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
41:42
Hierarchical Reasoning Models
Gabriel Mongaras
Hierarchical Reasoning Models
42:04
Energy-Based Transformers are Scalable Learners and Thinkers
Gabriel Mongaras
Energy-Based Transformers are Scalable Learners and Thinkers
39:07
Fast and Simplex: 2-Simplicial Attention in Triton
Gabriel Mongaras
Fast and Simplex: 2-Simplicial Attention in Triton
39:20
Hardware-Efficient Attention for Fast Decoding
Gabriel Mongaras
Hardware-Efficient Attention for Fast Decoding
40:58
ATLAS: Learning to Optimally Memorize the Context at Test Time
Gabriel Mongaras
ATLAS: Learning to Optimally Memorize the Context at Test Time
59:58
Coding Stable Diffusion 3 From Scratch
Gabriel Mongaras
Coding Stable Diffusion 3 From Scratch
2:07:02
Intro to Attention and Its Forms
Gabriel Mongaras
Intro to Attention and Its Forms
2:13:01
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Gabriel Mongaras
RWKV-7 "Goose" with Expressive Dynamic State Evolution
47:19
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Gabriel Mongaras
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
29:34
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Gabriel Mongaras
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
40:08
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
Gabriel Mongaras
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
28:26
DeepSeek-V3
Gabriel Mongaras
DeepSeek-V3
1:21:39
Coding Stable Diffusion 3 (Imagenet 2012)
Gabriel Mongaras
Coding Stable Diffusion 3 (Imagenet 2012)
2:21:07
Titans: Learning to Memorize at Test Time
Gabriel Mongaras
Titans: Learning to Memorize at Test Time
59:24
MiniMax-01: Scaling Foundation Models with Lightning Attention
Gabriel Mongaras
MiniMax-01: Scaling Foundation Models with Lightning Attention
48:21
Memory Layers at Scale
Gabriel Mongaras
Memory Layers at Scale
46:17
Byte Latent Transformer: Patches Scale Better Than Tokens
Gabriel Mongaras
Byte Latent Transformer: Patches Scale Better Than Tokens
45:05
Scaling up Masked Diffusion Models on Text
Gabriel Mongaras
Scaling up Masked Diffusion Models on Text
40:03
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Gabriel Mongaras
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
25:22
Round and Round We Go! What makes Rotary Positional Encodings useful?
Gabriel Mongaras
Round and Round We Go! What makes Rotary Positional Encodings useful?
32:31
Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt
Gabriel Mongaras
Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt
1:13:10
Attending to Topological Spaces: The Cellular Transformer
Gabriel Mongaras
Attending to Topological Spaces: The Cellular Transformer
42:25
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Gabriel Mongaras
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
35:52
WARP: On the Benefits of Weight Averaged Rewarded Policies
Gabriel Mongaras
WARP: On the Benefits of Weight Averaged Rewarded Policies
52:39
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
Gabriel Mongaras
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
28:52
Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality
Gabriel Mongaras
Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality
1:14:43
CoPE - Contextual Position Encoding: Learning to Count What's Important
Gabriel Mongaras
CoPE - Contextual Position Encoding: Learning to Count What's Important
38:55
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Gabriel Mongaras
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
45:48
xLSTM: Extended Long Short-Term Memory
Gabriel Mongaras
xLSTM: Extended Long Short-Term Memory
43:26
KAN: Kolmogorov-Arnold Networks
Gabriel Mongaras
KAN: Kolmogorov-Arnold Networks
37:09
LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
Gabriel Mongaras
LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
30:07
Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction
Gabriel Mongaras
Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction
37:00
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Gabriel Mongaras
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
32:49
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Gabriel Mongaras
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
40:14
Q* AGI Achieved (Apr Fools)
Gabriel Mongaras
Q* AGI Achieved (Apr Fools)
4:54
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Gabriel Mongaras
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
1:02:30
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Gabriel Mongaras
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
37:08
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
Gabriel Mongaras
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
46:25
DoRA: Weight-Decomposed Low-Rank Adaptation
Gabriel Mongaras
DoRA: Weight-Decomposed Low-Rank Adaptation
31:15
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
Gabriel Mongaras
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
1:02:38
A Decoder-only Foundation Model For Time-series Forecasting
Gabriel Mongaras
A Decoder-only Foundation Model For Time-series Forecasting
33:55
Lumiere: A Space-Time Diffusion Model for Video Generation
Gabriel Mongaras
Lumiere: A Space-Time Diffusion Model for Video Generation
37:30
Exphormer: Sparse Transformers for Graphs
Gabriel Mongaras
Exphormer: Sparse Transformers for Graphs
28:56
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Gabriel Mongaras
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
25:56
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
Gabriel Mongaras
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
40:23
Cached Transformers: Improving Transformers with Differentiable Memory Cache
Gabriel Mongaras
Cached Transformers: Improving Transformers with Differentiable Memory Cache
29:38
Translatotron 3: Speech to Speech Translation with Monolingual Data
Gabriel Mongaras
Translatotron 3: Speech to Speech Translation with Monolingual Data
39:02
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gabriel Mongaras
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
44:02
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Gabriel Mongaras
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
47:32
Adversarial Diffusion Distillation
Gabriel Mongaras
Adversarial Diffusion Distillation
28:39
Unsupervised Discovery of Semantic Latent Directions in Diffusion Models
Gabriel Mongaras
Unsupervised Discovery of Semantic Latent Directions in Diffusion Models
40:51
DALL-E 3 - Improving Image Generation with Better Captions
Gabriel Mongaras
DALL-E 3 - Improving Image Generation with Better Captions
18:45
LRM: Large Reconstruction Model for Single Image to 3D
Gabriel Mongaras
LRM: Large Reconstruction Model for Single Image to 3D
38:18
CodeFusion: A Pre-trained Diffusion Model for Code Generation
Gabriel Mongaras
CodeFusion: A Pre-trained Diffusion Model for Code Generation
30:46
Matryoshka Diffusion Models Explained
Gabriel Mongaras
Matryoshka Diffusion Models Explained
22:14
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Gabriel Mongaras
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
36:04
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Gabriel Mongaras
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
57:43
StreamingLLM - Efficient Streaming Language Models with Attention Sinks Explained
Gabriel Mongaras
StreamingLLM - Efficient Streaming Language Models with Attention Sinks Explained
33:27
FreeU: Free Lunch in Diffusion U-Net Explained
Gabriel Mongaras
FreeU: Free Lunch in Diffusion U-Net Explained
28:51
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation Explained
Gabriel Mongaras
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation Explained
26:26
Llama/Wizard LM Finetuning with Huggingface on RunPod
Gabriel Mongaras
Llama/Wizard LM Finetuning with Huggingface on RunPod
50:20
2x Faster Language Model Pre-training via Masked Structural Growth
Gabriel Mongaras
2x Faster Language Model Pre-training via Masked Structural Growth
50:14
Bayesian Flow Networks (BFN) Explained
Gabriel Mongaras
Bayesian Flow Networks (BFN) Explained
53:53
WizardLM: Empowering Large Language Models to Follow Complex Instructions Explained
Gabriel Mongaras
WizardLM: Empowering Large Language Models to Follow Complex Instructions Explained
33:54
From Sparse to Soft Mixtures of Experts Explained
Gabriel Mongaras
From Sparse to Soft Mixtures of Experts Explained
43:59
BK-SDM: Architecturally Compressed Stable Diffusion for Efficient T2I Generation Explained
Gabriel Mongaras
BK-SDM: Architecturally Compressed Stable Diffusion for Efficient T2I Generation Explained
42:16
Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained
Gabriel Mongaras
Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained
36:25
Universal and Transferable Adversarial Attacks on Aligned Language Models Explained
Gabriel Mongaras
Universal and Transferable Adversarial Attacks on Aligned Language Models Explained
31:51
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis Explained
Gabriel Mongaras
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis Explained
45:45
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations Explained
Gabriel Mongaras
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations Explained
47:16
ReLoRA: Stack More Layers Differently: High-Rank Training Through Low-Rank Updates Explained
Gabriel Mongaras
ReLoRA: Stack More Layers Differently: High-Rank Training Through Low-Rank Updates Explained
35:57
MiniLLM: Knowledge Distillation of Large Language Models
Gabriel Mongaras
MiniLLM: Knowledge Distillation of Large Language Models
43:49
RetNet: A Successor to Transformer for Large Language Models Explained
Gabriel Mongaras
RetNet: A Successor to Transformer for Large Language Models Explained
1:09:57
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Explained
Gabriel Mongaras
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Explained
54:21
Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for LLMs Explained
Gabriel Mongaras
Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for LLMs Explained
39:17
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Explained
Gabriel Mongaras
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Explained
1:00:14
LongNet: Scaling Transformers to 1,000,000,000 Tokens Explained
Gabriel Mongaras
LongNet: Scaling Transformers to 1,000,000,000 Tokens Explained
37:21
Extending Context Window of Large Language Models via Positional Interpolation Explained
Gabriel Mongaras
Extending Context Window of Large Language Models via Positional Interpolation Explained
29:17
RoFormer: Enhanced Transformer with Rotary Position Embedding Explained
Gabriel Mongaras
RoFormer: Enhanced Transformer with Rotary Position Embedding Explained
39:52
RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation Explained
Gabriel Mongaras
RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation Explained
37:47
MusicGen: Simple and Controllable Music Generation Explained
Gabriel Mongaras
MusicGen: Simple and Controllable Music Generation Explained
43:15
Encodec: High Fidelity Neural Audio Compression Explained
Gabriel Mongaras
Encodec: High Fidelity Neural Audio Compression Explained
52:55
QLoRA: Efficient Finetuning of Quantized LLMs Explained
Gabriel Mongaras
QLoRA: Efficient Finetuning of Quantized LLMs Explained
29:00
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold Explained
Gabriel Mongaras
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold Explained
35:18
Stable/Latent Diffusion - High-Resolution Image Synthesis with Latent Diffusion Models Explained
Gabriel Mongaras
Stable/Latent Diffusion - High-Resolution Image Synthesis with Latent Diffusion Models Explained
44:05
LoRA: Low-Rank Adaptation of LLMs Explained
Gabriel Mongaras
LoRA: Low-Rank Adaptation of LLMs Explained
27:19
Align your Latents - High-Resolution Video Synthesis Explanation
Gabriel Mongaras
Align your Latents - High-Resolution Video Synthesis Explanation
36:16
Attention Is All You Need Explanation
Gabriel Mongaras
Attention Is All You Need Explanation
1:10:42
ViT: An Image is Worth 16x16 Words Explained
Gabriel Mongaras
ViT: An Image is Worth 16x16 Words Explained
37:19
Talking To My AI Girlfriend
Gabriel Mongaras
Talking To My AI Girlfriend
15:33