Ever get stuck on a docstring? I did, with Flash Attention's `flash_attn_varlen_func` and its causal mask! This video is my deep dive into figuring out that causal mask's "bottom-right alignment," how KV Caching in HuggingFace (`DynamicCache`) really works during `model.generate()`, and what it all means for different Q and K sequence lengths.
Join me as I use monkey-patching and hooks on a Llama-style model (SmolLM2-135M) to inspect Q, K, V shapes live, trace the attention calls (spoiler: `flash_attn_func` was the star!), and finally connect it all back to that tricky causal mask example. If you want to see a real-time debugging journey into the guts of attention mechanisms and KV cache, this one's for you! We'll even get into the "unified timeline" (`p_i`, `p_j`) concept to make sense of it all.
Blog post: vishalbakshi.github.io/blog/posts/2025-06-03-flash…
コメント