@jeremyseay

I love this, but I'm reminded of Kernighan's law: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

@szymoniak75

DISCLAIMER FOR ALL BEGINNER PROGRAMMERS!
Do not attempt it at home! Unless you really, really have to. In most cases you won't outsmart your compiler. For high level programming, you should focus on keeping your code clean and easy to read and maintain. Usually you'll notice speedup in your code by using correct algorithms and data structures, not by doing micro-optimizations. And if speed is your concern - benchmark first! Do not try optimizing code that may not need it.

@Mackinstyle

I LOVE how you introduced the concept by showing a way to manually optimize and how it actually fails to be better than the straightforward if-based version.   That's such an important point to share up front any time we talk about clever optimization.

@RyanTosh

When a normal person says something is slow: 5-10 minutes, maybe an hour or two
When a programmer says something is slow: 2 nanoseconds

@Caesim9

I think it should be noted that branchless programming is VERY important in cryptography. It turns out that if you ise conditional statements -> sometimes your code runs faster sometimes slower (specific branches just are faster/ slower), the attacker can get info on your private key. So all cryptographic sound functions must use branchless programming.

Very interesting topic

I've designed CPUs for over 30 years and we pay a LOT of attention breaks in code streams. Branch prediction, speculative execution, and even register reassignments are all about minimizing branch and pipeline impacts. What you may be missing in the above examples is that what seems to be avoiding branches ... don't. Every conditional test is a branch. Every conditional assignment is a branch. It is hard to beat the compiler optimization that understands the pipeline of the processor it's targeting. Clever manual decimation can be effective for long pipelines, such in GPUs, but even then compilers are pretty smart. The most efficient code is when tasks are parallelized so that multiple operations can be done based off of one or a small number of decisions.

In trying to avoid branches, the cost is paid in non-maintainable code and is very high. If you really want to improve efficiency, don't rely on multiple software packages which actually may include interpreted languages somewhere underneath the shiny API. Layering of packages and the use of interpreted languages (Python, PERL, etc.) waste much of the increasing performance of processors. Yes, it means recompiling for a new processor, but one does that if efficiency is required.

In one example, we recoded a numeric-heavy program that was synthesizing frequencies for an electronic piano. Recasting it to vector operations allowed the program to run on a very inexpensive Beagle-Bone Black instead of a MAC. On the MAC it consumed 35% of the processor. On the BBB it used 15% of a much less powerful processor by vectorizing the operations. These are the sorts of changes that matter.

@barmetler

So what we've learnt: If you write normal code and normal patterns that everybody knows, most likely it will be a pattern that the developers of gcc (or other compilers) thought of. That way you have a higher chance of the compiler doing good things.

@MultiMrAsd

I want to add that the advantage of branchless programming is many times bigger if you are programming for the GPU and not the CPU. GPUs have multiple cores that share a instruction pipeline, where each core runs the same instruction but with other numbers. That often leads to both sides of a branch being executed and then one is discarded. I have seen performance improvements above 100x by going branchless!

@kylek.3689

Cryptographic libraries also use branchless programming a lot. This is to prevent timing side channel attacks, where the timing of an operation can provide a hint as to what was encrypted.

@christianbarnay2499

05:00 The lesson here: never over-optimize your code beforehand. Start by writing natural and simple code. Let the compiler do its tricks. Check the result and only look for optimizations in the parts that the compiler didn't already optimize.

@rorytulip9343

Leading with a negative example was smart - I think a lot of beginners would have tried applying this to homework assignments that really didn't need it if you hadn't.

@frogge6443

Your mousepad is a masterpiece.

@programaths

Shaders. Branchless techniques are mandatory in those!

@tommyshelby3384

Wow, didnt knew Nicolas cage had a programming channel

@unformedvoid2223

It's useful when you do shader programming where branching is very expensive. I had to «invent» some of those examples by myself when did shader programming. Great video!

@MM-24

I want to buy this guy a mouse pad and maybe a second monitor...

@khhnator

I'm surprised with the amount of negative comments, he is not telling people do always code like this... there is a difference between time critical code and not.
get the damn profile out and find where your program is spending most of their time, and THEN you consider doing this or not.


the only negative thing about optimization is that it costs development time. the idea that "the computers will be fast enough" is just being naive. 
effective optimization can be is the difference between a product that reaches some of the market or most of the market or a program that can be extended for the decades to come or something that needs to be redone in the next few years.

@RFC-3514

For someone who started coding in assembly before pipelining and speculative execution were a thing, and when multiplications were super expensive, the idea of multiplying by a boolean to make a selection always feels slightly wrong.  And a part of me still wants to replace every multiplication with shifts + adds, or look-up tables. :-P

@krytharn

Branching in a shader on the GPU is also a thing, but for a different reason. Here, the GPU runs the code in lockstep, which means that the ALU runs the same instruction for multiple pixels or vertices at once. To stay synchronised, it runs both branches of the conditional, then throws away the result that did not meet the conditional. So, in general, it will have to run all the code all the time and branching does not give any advantage. Note: if all pixels meet the same condition, only one of the branches is executed.

@josiahmanson

A fast branchless way to calculate ToUpper is to use a lookup table. The table is 256 bytes long, and easily fits in L1 cache, so character conversion takes a single instruction and will be a fast memory access. I think this is what is done in the standard library.