What is SIMD?
SIMD stands for Single Instruction Multiple Data. It’s pretty clear. We use one instruction to transform multiple data. SIMD has been around for while now even modern consoles support it too. We can also see SIMD support on some ARM chips.
Why should I care?
First of all, if you care about efficiency on your games then you should learn about SIMD. This technology is perfect for games and simulations that iterate constantly through the same type of data. For example a particle system would benefit a lot from this.
How does it work?
The CPU has vector registers. Depending on the system it can be 128, 256 and more modern processors support 512 bit. This means we can do ONE instruction for 16 floats.
In this case I demonstrate with 128-bit register, which means 1 instruction for 4 floats.
Implementing it on my engine
The implementation on each engine would depend on the needs you have. Personally I decided to redo my matrix and vector structures to support SIMD instructions.
For my matrix class I originally had something like this:
and my matrix addition function looked like this:
I just unrolled the common loop because I knew this matrix was always going to be 4x4.
How does it look now once I gave it an SIMD treatment. The class data looks like this:
As you can see we have the __m128 data type. This data type belongs to the instrisic instructions for SSE (Streaming SIMD Extensions). We have an array of 4 __m128 because each __m128 can hold up to 4 floats or 2 doubles. This way our new mat4 class will have the same size that our previous one but with this new SSE benefit. The other members you see are mostly to give a some abstraction to the class. They are inside a union which means they use the same memory space, so there’s no overhead or unused space.
One thing to have in mind if you are doing 128-bit access memory it’s a good practice to have the data with a 16-byte alignment. That’s why I’ve also included the alignas(16) at the top of the union. You can still do an unaligned loading but this will come with a performance hit.
So you may be wondering how you get access to this new data types. You can get access to them using the compiler intrinsics or go hardcore mode and do some inline assembly. The intrinsics instructions can usually be found on the “xmmintrin.h”, “immintrin.h” or “intrin.h”. It’ll depend on the compiler too.
Fortunately replacing the add matrix function using instrisic functions is pretty easy. First I’ll show and example using inline assembly so we can see step by step. After that I’ll show the friendly version using the compiler intrinsics.
So here the benefit is on the xmm vector registers and the instructions that finish with ps. That stands for Packed Single-Precision Floating Point. So now to add two matrices we just do this:
This can be hidden with a bit of layer of abstraction, like inside a operator overload.
You must have in mind that if you are working with VS2015, the compiler won’t allow you to do inline assembly for x64 target. This means that unless you use an external assembler, you’ll probably want to stick with intrinsics.
Let’s look at the more friendly version, and compatible with x64 on VS2015:
This is much clear and maybe faster, if our compiler helps us.
Don’t trust the compiler. Don’t trust old school C/C++ optimizations. In this case it’s always better to go with intrinsics or plain assembly. Prefer SOA (Structure Of Arrays) when ever you can.
Don’t go and implement a vector4 as a single __m128, use SOA for this. Specially if your are going to evaluate multiple vectors. For example:
You can read more about it here:
Or view one of this talks: