In this video, we discuss the fundamentals of model quantization, the technique that allows us to run inference on massive LLMs like DeepSeek-R1 or Qwen.
Among others, we'll discuss:
⚆ What quantization really means (hint: it’s more than just rounding)
⚆ Why integers are faster than floats (with a deep dive into their internal structure)
⚆ How quantization preserves model accuracy
⚆ When to quantize: during training vs after training (PTQ vs QAT)
⚆ A hands-on explanation of scale, zero point, clipping ranges, and fixed-point math
If you enjoyed this, consider subscribing for upcoming videos on:
⚆ Post-training quantization (PTQ)
⚆ Quantization-aware training (QAT)
⚆ Training in low precision (e.g., FP4)
⚆ 1-bit LLMs
#Quantization #MachineLearning #AIOptimization #LLM #NeuralNetworks #QAT #PTQ #DeepLearning #EdgeAI #FixedPoint #BFloat16 #TensorRT #ONNX #AIAccelerators
00:00 Intro
00:50 What
02:10: Why
03:50: Integer vs floating point formats
06:45 When
09:21 How
14:40 Fixed point arithmetic
18:00 Matrix multiplications
20:07 Outro
コメント