The SIMD Experience: Data Parallelism on my Game Engine
What is SIMD?
What is SIMD?
SIMD stands for Single Instruction Multiple Data. Itās pretty clear. We use one instruction to transform multiple data. SIMD has been around for while now even modern consoles support it too. We can also see SIMD support on some ARM chips.
Why should I care?
First of all, if you care about efficiency on your games then you should learn about SIMD. This technology is perfect for games and simulations that iterate constantly through the same type of data. For example a particle system would benefit a lot from this.
How does it work?
The CPU has vector registers. Depending on the system it can be 128, 256 and more modern processors support 512 bit. This means we can do ONE instruction for 16 floats.
In this case I demonstrate with 128-bit register, which means 1 instruction for 4 floats.
Implementing it on my engine
The implementation on each engine would depend on the needs you have. Personally I decided to redo my matrix and vector structures to support SIMD instructions.
For my matrix class I originally had something like this:
and my matrix addition function looked like this:
I just unrolled the common loop because I knew this matrix was always going to be 4x4.
How does it look now once I gave it an SIMD treatment. The class data looks like this:
As you can see we have the __m128 data type. This data type belongs to the instrisic instructions for SSE (Streaming SIMD Extensions). We have an array of 4 __m128 because each __m128 can hold up to 4 floats or 2 doubles. This way our new mat4 class will have the same size that our previous one but with this new SSE benefit. The other members you see are mostly to give a some abstraction to the class. They are inside a union which means they use the same memory space, so thereās no overhead or unused space.
One thing to have in mind if you are doing 128-bit access memory itās a good practice to have the data with a 16-byte alignment. Thatās why Iāve also included the alignas(16) at the top of the union. You can still do an unaligned loading but this will come with a performance hit.
So you may be wondering how you get access to this new data types. You can get access to them using the compiler intrinsics or go hardcore mode and do some inline assembly. The intrinsics instructions can usually be found on the āxmmintrin.hā, āimmintrin.hā or āintrin.hā. Itāll depend on the compiler too.
Fortunately replacing the add matrix function using instrisic functions is pretty easy. First Iāll show and example using inline assembly so we can see step by step. After that Iāll show the friendly version using the compiler intrinsics.
So here the benefit is on the xmm vector registers and the instructions that finish with ps. That stands for Packed Single-Precision Floating Point. So now to add two matrices we just do this:
This can be hidden with a bit of layer of abstraction, like inside a operator overload.
You must have in mind that if you are working with VS2015, the compiler wonāt allow you to do inline assembly for x64 target. This means that unless you use an external assembler, youāll probably want to stick with intrinsics.
Letās look at the more friendly version, and compatible with x64 on VS2015:
This is much clear and maybe faster, if our compiler helps us.
Some Advice
Donāt trust the compiler. Donāt trust old school C/C++ optimizations. In this case itās always better to go with intrinsics or plain assembly. Prefer SOA (Structure Of Arrays) when ever you can.
Donāt go and implement a vector4 as a single __m128, use SOA for this. Specially if your are going to evaluate multiple vectors. For example:
SIMD on JavaScript
If you search for SIMD + JavaScript youāll probably get the Mozilla Developer Network. I think itās great deal that they are working on and API to bring this feture to JavaScript. Currently is only available on Nightly Builds of Mozilla and Chromium. This is good news for JavaScript developers, I hope we can all take advantage of this. More info at MDN. Thereās also WebAssembly which will also give support for SIMD instructions. But itās still a work in progress.
You can read more about it here:
Using Intel Streaming Simd Extensions
The ARMĀ® NEONā¢ general-purpose SIMD
Or view one of this talks:
Vectorization (SIMD) and Scaling James Reinders, Intel Corporation