HuggingFace's Default KV Cache and the flash_attn_varlen

「ツール」は右上に移動しました。

利用したサーバー: wtserver2

3いいね 45回再生

HuggingFace's Default KV Cache and the flash_attn_varlen_func Docstring

Ever get stuck on a docstring? I did, with Flash Attention's `flash_attn_varlen_func` and its causal mask! This video is my deep dive into figuring out that causal mask's "bottom-right alignment," how KV Caching in HuggingFace (`DynamicCache`) really works during `model.generate()`, and what it all means for different Q and K sequence lengths.

Join me as I use monkey-patching and hooks on a Llama-style model (SmolLM2-135M) to inspect Q, K, V shapes live, trace the attention calls (spoiler: `flash_attn_func` was the star!), and finally connect it all back to that tricky causal mask example. If you want to see a real-time debugging journey into the guts of attention mechanisms and KV cache, this one's for you! We'll even get into the "unified timeline" (`p_i`, `p_j`) concept to make sense of it all.

Blog post: vishalbakshi.github.io/blog/posts/2025-06-03-flash…

HuggingFace's Default KV Cache and the flash_attn_varlen_func Docstring

コメント