@InsideOfMyOwnMind

Here's my two minute old favorite. "Never ask a computer a stupid question because the stupid answer you get back will sound so legit that you won't ever find out that you're stupid.
I think this was the most funnest episode I've ever seen from you. I hope it blows up because it deserves to.

@myleft9397

He's wearing a jacket. He must be serious this time.

@GlenHHodges

Dave hit 1,000,000 subscribers!   Thanks to everyone who forms part of that group!

@pokeyhollis

Really enjoy your videos Dave, keep up the great work!

@Michael-i4y8y

Impressive demo on RAM usage! I used a product called AICarma that allows to compare different models (both local like DeepSeek and cloud like Claude)

@TheBattleRabbit860

This was another awesome video Dave, thanks for putting this knowledge out there. I work with generative AI software mostly with Stable Diffusion, from Stability AI and some Flux.1 from Black Forest Labs. A lot of people think you need an RTX 4090/5090 to even run anything. When I tell them they can do quality images with as little as 4GB of VRAM, they don't believe me. One thing I've hated, though, is the monopoly Nvidia has had. Thankfully, there are versions of various programs to support AMD (like Stable Diffusion web-UI Forge-AMD), and others are offering native support for AMD, Intel, and Apple silicon now.

@marcas6876

Just my 50 cents: The card depicted in the video is a P4 (not a P40 as stated). The P40 provides 24 GB and is a 2-slot card with an additional 8-pin  power connector. The P4 only provides 8GB, is a single slot card and is powered by PCI-E directly without the need of an additional power supply.

@cpspot

Dave - I love your videos, including this one.  Thank you.  I think there may have been a technical oversight of the root reason for the poor performance mentioned between 6:09 and 6:41.  You could confirm by using btop to look at the CPU utilization and RAM utilization at the same time as the GPU utilization.  Most likely the model didn't fit completely in VRAM, so ollama also loaded the entire model in system RAM as well, and used both the GPU and CPU for inference, giving performance levels significantly lower than strictly GPU inference would have.  I hope this helps.  and keep the great videos coming.  Thank you!!

@kenoakes6725

Dave, you just changed my life.  I feel like I was just introduced to computers for the first time like back in 1983!

@ADTinman

That million subscriber Youtube button is just a few hours away!!!! Also as an example - I just have a 16gb video card, but using LM Studio it's easy to assign layers to the gpu... then with the large cache of an x3D CPU it's fairly quick to swap layers from the CPU (with 64gb of 6000 speed CL30 ram) --- which allows me a "decent" tokenization speed of around 5-10 tokens per second on model sizes up to around 24-27gb ... and with using quant 6 or quant 8 (sometimes) this allows for models with larger parameter sizes than otherwise.  I can use even larger models, but it's a matter of going to refill a cup of coffee and drinking it before looking at the prompt response. Maybe useful for coding, but otherwise not acceptable.

@zhouly

Nice video, appreciate you taking the time to make it. But I’m certain Ollama supports loading LLM models across multiple GPUs. My rig is an Epyc 7413 with 3 x RTX 3090 and an RTX 4070 Ti Super. Total of 88GB VRAM. I’ve run Deepseek r1-70b on it, and even Llama 4 Scout with 109b parameters. nvtop does show the models nicely distributed across all GPUs VRAMs if the model size requires it.

@flashwashington2735

Impressive explanation and modeling. Thank you.

@VicenteOcanaplus

IMHO it is not the speed , but the quality of the output that defines my appreciation for LLMs capability. I’d love to see a video evaluating just that! A summarize from long text, the quality of the instructions to create a whisper capable web based app, the quality of an expert advice for a healthy eating ptogram for 1 week… I am convinced there is a place for general use in those 8 - 14B models, but I still have not watched a proper demonstration.

@TaylorWheeler

Thanks great video - a note for anyone who cares. Yes these figures are accurate but the second you add a large context window a) the vram needed increases by a lot and b) the tokens per second falls depending on how much context you have loaded.

@MrCOPYPASTE

Almost 1 million subs!!!!

@alastairross9628

Very dapper, Dave.  Always enjoy your videos.

@shinorix278

Congrats to 1m Subscribers :)

@wayfarerzen

Great video, helped me a lot, and congrats on 1 million!

@realMrTibby

Pretty sure you meant Tesla P4 bud as the Tesla P40 cards are physically large and 24GB of VRAM. Also, yes, Ollama can use multi cards as one. I have a Dell server setup that way now with 2x GPU on a Debian 12 VM (Proxmox) with GPU passthrough. Nvidia SMI shows both cards, but Ollama uses them as one.

@danispringer

first video of yours I have watched. you, sir, must have a story to tell! but for now: thanks for a great video!

-dani