The youtube algo identified me as a gay, blind, wolfram alpha user. Awesome video
7:40 - 7:55 made me laugh so hard I cried. Completely unexpected from the voice tone so far, incredible work.
Hey wait, I’m that music guy Great video
As a computer vision engineer, I develop an train ai models daily, and use SGD and Adam. This video was really good, loved the visualizations. I wish I saw this kind of stuff during my studies :)
This video is amazingly well presented and informative. Love your humour, mate. That line at 14:14 "We start by using the derivative law known as 'Use Wolfram Alpha'" slayed me. 🤣
This is awesome! I am a long time "studier" of ANN and this is the coolest representation of finding the minimum I have seen.
The content quality of this guy is incredible
I am an evolutionary algorithm researcher. As you mention in the video non continuous problems are the real challenges for neural networks, while evolutionary algorithms are slow when you have hundreds of thousands of parameters. Well.. We are working on hybrid architectures actually. On combinatorial optimization problems it starts to be quite nice.
I've also thought about re-investigating evolution versus gradient descent myself. A few notes that I have: - In some ways evolutionary strategies can be more memory efficient with than optimizers; the gradients don't take up a small amount of space. This is notable if you're in a situation where you can't take advantage of modern optimized algorithms like flash attention or cross cut entropy, etc. You might think that saving a full copy of each different variant of the model is too expensive, but you can also store it as a type of noise, or as a difference of weights which isn't nearly as bad. - ES (evolutionary strategies) struggle with dimensionality... But there's different ways to parameterize neural networks. You could imagine breaking the weights down with something like a discrete wavelet transform, and doing evolution on the signs at each scale, for instance. Or, you could imagine producing hierarchical "prefixes" to the weights; At the base level where you have all parameters, divide them into to chunks of W/2. Then divide those into half again, for four chunks of W/4, and so on. You could apply a small perturbation to W/1, and keep the best performer, scaling all the weights slightly, and then apply a small perturbation to W/2, so each half gets a slightly different perturbation, and so on. This gets really powerful if you think about augmenting each scale with ie: momentum, or even applying Adam to it... Not gradient descent, just the first and second moments that made Adam so good, because... - A lot of things that made optimizers better also applies to ES. There's no reason you can't have momentum, or an Adam-like optimizer in ES, it's just that people kind of gave up on them. - Some design strategies that worked well for GD weren't necessarily optimal. For instance, Tanh was state of the art for language models for a long time, but as soon as you have to propagate more than about 6 layers it falls off in performance. I suspect there's a lot of little decisions like this that we made to optimize language models which would work differently for ES. - Not all changes need be made to the entire parameter space. You could imagine evolving for applying different noise functions at different steps, so that you have fewer things to optimize. In general, anything that keeps your dimensionality to about 100 or so tends to get good results with ES. There's a lot of other things that actual researchers have done to make ES good, but there are just some observations I've had over time.
I'm at a loss for words on how good the production quality and humor is.
Absolutely unreal production quality! Content like this deserves so much more attention than any generic LLM-hype "AGI is near" YouTuber.
absolute crazy video quality, awesome man
This is one of the better videos I've watched on NN - great job breaking it down into something anyone can understand.
All these networks have a fixed architecture. Evolutionary algorithms can alter the architecture and the activation function as well as the weights. Suppose choice of activation function is part the parameter space, and the sine function is an option. Then you might find an approximation that generalizes outside the training domain. A typical MLP is going to fail outside the training domain because it is a finite piecewise function and the real domain is infinite. Most artificial networks today are feedforward without memory, but a network with feedback is more computationally powerful. Natural neural networks use feedback and can be extremely expressive with very few parameters. Look at the nematode connectome for instance.
This is the most straightforward, densely packed, enjoyable explanation on NNs I've ever watched, thank you my dude
i swear you are my favorite ai channel. widely underrated although i'm glad you got some recognition since you started ! :)
Your visualizations, and the clarity with which you describe the matter, is a gift. Thank you.
Very nice :-) I remember seeing your submission to SoME3 and that was a nice watch as well. Thank you for making these! I really like how you focus on small networks to be able to visualize the parameter space. I use a similar approach in my 'understanding AI' course, where we build small neural networks by hand so we get a better understanding of how they make decisions. I used the step function in that one, and when optimizing automatically, I used a genetic approach, so hearing you'll focus on that in a future video makes me very excited. Keep up the good work :-)
The concept of a "loss landscape" is the most fascinating thing I've ever heard. I really can't put into words how interesting that idea is to me.
@EmergentGarden