The SIMD Experience: Data Parallelism on my Game Engine

What is SIMD?

SIMD stands for Single Instruction Multiple Data. It’s pretty clear. We use one instruction to transform multiple data. SIMD has been around for while now even modern consoles support it too. We can also see SIMD support on some ARM chips.

Why should I care?

First of all, if you care about efficiency on your games then you should learn about SIMD. This technology is perfect for games and simulations that iterate constantly through the same type of data. For example a particle system would benefit a lot from this.

How does it work?

The CPU has vector registers. Depending on the system it can be 128, 256 and more modern processors support 512 bit. This means we can do ONE instruction for 16 floats.

In this case I demonstrate with 128-bit register, which means 1 instruction for 4 floats.

  [32bit value a, 32bit value a, 32bit value a, 32bit value a] <- Source Register

  [32bit value a, 32bit value a, 32bit value a, 32bit value a] <- Source Register

        |              |               |              |        <- SIMD Instruction
        V              V               V              V
  [32bit value a, 32bit value a, 32bit value a, 32bit value a] <- Destn. Register

Implementing it on my engine

The implementation on each engine would depend on the needs you have. Personally I decided to redo my matrix and vector structures to support SIMD instructions.

For my matrix class I originally had something like this:

struct mat4
{
  float data[16];
};

and my matrix addition function looked like this:

mat4 mat4_add(mat4& lhs, mat4& rhs)
{
  float *dat_l = lhs.data;
  float *dat_r = rhs.data;
  mat4 result;

  result.data[0] = dat_l[0] + dat_r[0]; result.data[1] = dat_l[1] + dat_r[1];
  result.data[2] = dat_l[2] + dat_r[2]; result.data[3] = dat_l[3] + dat_r[3];
  result.data[4] = dat_l[4] + dat_r[4]; result.data[5] = dat_l[5] + dat_r[5];
  result.data[6] = dat_l[6] + dat_r[6]; result.data[7] = dat_l[7] + dat_r[7];
  result.data[8] = dat_l[8] + dat_r[8]; result.data[9] = dat_l[9] + dat_r[9];
  result.data[10] = dat_l[10] + dat_r[10]; result.data[11] = dat_l[11] + dat_r[11];
  result.data[12] = dat_l[12] + dat_r[12]; result.data[13] = dat_l[13] + dat_r[13];
  result.data[14] = dat_l[14] + dat_r[14]; result.data[15] = dat_l[15] + dat_r[15];

  return result;
}

I just unrolled the common loop because I knew this matrix was always going to be 4x4.

How does it look now once I gave it an SIMD treatment. The class data looks like this:

struct mat4
{
  union alignas(16)
  {
    struct
    {
      float
        a, b, c, d,
        e, f, g, h,
        i, j, k, l,
        m, n, o, p;
    };
    float data[16];
    // This is the important data.
    __m128 sse_data[4];
  };
};

// or a more "clean" implementation
struct mat4
{
    __m128 data[4];
};

As you can see we have the __m128 data type. This data type belongs to the instrisic instructions for SSE (Streaming SIMD Extensions). We have an array of 4 __m128 because each __m128 can hold up to 4 floats or 2 doubles. This way our new mat4 class will have the same size that our previous one but with this new SSE benefit. The other members you see are mostly to give a some abstraction to the class. They are inside a union which means they use the same memory space, so there’s no overhead or unused space.

One thing to have in mind if you are doing 128-bit access memory it’s a good practice to have the data with a 16-byte alignment. That’s why I’ve also included the alignas(16) at the top of the union. You can still do an unaligned loading but this will come with a performance hit.

So you may be wondering how you get access to this new data types. You can get access to them using the compiler intrinsics or go hardcore mode and do some inline assembly. The intrinsics instructions can usually be found on the “xmmintrin.h”, “immintrin.h” or “intrin.h”. It’ll depend on the compiler too.

Fortunately replacing the add matrix function using instrisic functions is pretty easy. First I’ll show and example using inline assembly so we can see step by step. After that I’ll show the friendly version using the compiler intrinsics.

inline static void sse_mat4_add(__m128* m0, __m128* m1, __m128* dst)
{
  __asm
  {
    // First load the args addresses
    // into ecx, edx and edi registers.
    mov ecx, m0
    mov edx, m1
    mov edi, dst
    // Do a an aligned move of m0
    // to xmm0 vector register
    movaps xmm0, [ecx]
    // Add the first rows of m0 and m1
    addps xmm0, [edx]
    // Store xmm0 value on dst first row
    movaps[edi], xmm0
    // Here we repeat the same for each row
    // of our 4x4 matrix.
    movaps xmm0, [ecx + 10h]
    addps xmm0, [edx + 10h]
    movaps[edi + 10h], xmm0
    movaps xmm0, [ecx + 20h]
    addps xmm0, [edx + 20h]
    movaps[edi + 20h], xmm0
    movaps xmm0, [ecx + 30h]
    addps xmm0, [edx + 30h]
    movaps[edi + 30h], xmm0
  }
}

So here the benefit is on the xmm vector registers and the instructions that finish with ps. That stands for Packed Single-Precision Floating Point. So now to add two matrices we just do this:

mat4 m0;
mat4 m1;

sse_mat4_add(m0.sse_data, m1.sse_data, m0.sse_data);

This can be hidden with a bit of layer of abstraction, like inside a operator overload.

You must have in mind that if you are working with VS2015, the compiler won’t allow you to do inline assembly for x64 target. This means that unless you use an external assembler, you’ll probably want to stick with intrinsics.

Let’s look at the more friendly version, and compatible with x64 on VS2015:

inline static void sse_mat4_add(__m128* m0, __m128* m1, __m128* dst)
{
  dst[0] = _mm_add_ps(m0[0], m1[0]);
  dst[1] = _mm_add_ps(m0[1], m1[1]);
  dst[2] = _mm_add_ps(m0[2], m1[2]);
  dst[3] = _mm_add_ps(m0[3], m1[3]);
}

This is much clear and maybe faster, if our compiler helps us.

Some Advice

Don’t trust the compiler. Don’t trust old school C/C++ optimizations. In this case it’s always better to go with intrinsics or plain assembly. Prefer SOA (Structure Of Arrays) when ever you can.

Don’t go and implement a vector4 as a single __m128, use SOA for this. Specially if your are going to evaluate multiple vectors. For example:

struct soa_vector4
{
  union alignas(16)
  {
    struct 
    {
      float x[4];
      float y[4];
      float z[4];
      float w[4];
    };
    struct
    {
      __m128 x;
      __m128 y;
      __m128 z;
      __m128 w;
    } sse;
  }
};

SIMD on JavaScript

If you search for SIMD + JavaScript you’ll probably get the Mozilla Developer Network. I think it’s great deal that they are working on and API to bring this feture to JavaScript. Currently is only available on Nightly Builds of Mozilla and Chromium. This is good news for JavaScript developers, I hope we can all take advantage of this. More info at MDN. There’s also WebAssembly which will also give support for SIMD instructions. But it’s still a work in progress.

You can read more about it here:

Using Intel Streaming Simd Extensions

Intel’s Intrinsics Guide

The ARM® NEON™ general-purpose SIMD

Or view one of this talks:

Vectorization (SIMD) and Scaling James Reinders, Intel Corporation

Performance Optimization, SIMD and Cache