ldEngine: Drawing One Million Quads with OpenGL
Experimenting with persistently mapped buffers
Introduction
The idea for building my own renderer came from the fact that Iāve started creating my own engine. Itās a very simple and itās not even close to what popular engines like UE4 or Unity can do but it serves the purpose I created it for. Building 2D games.
Iāve tried designing my engine giving space for future optimizations trying to avoid the always famous quote of Donald Knuth: āPremature optimization is the root of all evilā¦ā. Which I feel is used many times in the wrong context.
One of the first important points I wanted to approach was having a flexible and powerful renderer. Even though is 2D I still wanted to be able to push vertices as fast as possible.
For this post I wonāt show exactly how I wrote my renderer because of multiple engine dependencies but Iāll show some of the core features using common libraries like SDL2, GLEW and just pure OpenGL. The source code will be available at the bottom of the post.
The Naive Renderer
The first approach was one I would consider naive. Even though I did commmon techniques like batching it still fell into the ānaiveā realm for me. It was flexible and easy to use but I couldnāt push as many vertices as I wanted.
The shaders I wrote for this demo are very simple and they wonāt change during the whole post. Iāll just define the vertex position and color attributes (in), orthographic view matrix and varying (out) variable to pass the current vertex color to the fragment shader.
Vertex Shader
Fragment Shader
In this naive approach the generation and allocation of buffers would be really simple and kind of standard. First I would declare a couple of variables.
The first 2 variables represent the preallocated memory buffer I will use to store vertex positions and vertex colors. The next 2 will represent the current address to which I would write my data. The last 2 are just GL handles for VBOs.
On my engine I would actually use a custom allocator to handle the first 4 variables. I would highly recommend using the same approach, also because you can implment memory debugging and tracking capabilities to that allocator. In this case I wonāt bother with that and do straightforward pointer arithmetic.
The allocation of the buffers on GPU would be done with glBufferData and I would allocate the same size of the buffer on the heap to store my vertex data before flushing it.
You can see in the first two lines I define the size of my vertex buffers. You can notice that I multiply sizeof(float) * 12 for vertex position buffer and sizeof(float) * 24 for vertex color buffer. This is because each sprite is formed by 2 triangles which is equal to 6 vertices. For vertex position attribute we use a vec2 and for vertex color attribute we use a vec4, this means each quad or sprite uses 12 float elements for vertex position and 24 float elements for vertex color.
When I started writing my renderer I wanted to have a very simple API to deal with. Iāve worked a lot with Monkey-X and I really like the mojo API so I went with a similar one. For this example Iāll just write a function to draw a rectangle which Iāll call drawRect, a function to set the current draw color which Iāll call setColor and a function to āflushā the data and push it to the GPU which Iāll call flush.
The function drawRect only writes the vertex data to the heap allocated buffer.
The implmentation of flush is also very simple and straightforward. We bind our buffers and do a memcpy using glBufferSubData. This will update the buffer objectās data store and make it available for rendering. We finally call glDrawArrays and render our geometry on the current frame buffer.
Running Test: glBufferSubData
With all the rendering part setup we are finally able to draw some stuff on screen. So what I do is just create a simple structure called Particle that contains position, velocity and color information.
We initialize it and run our simulation.
What renderParticles does is basically iterate through each particle pushing the quad and color information to our heap allocated buffers. Something like this:
The results of this test can be seen in this image:
On my machine I was able to render about 160.000 individual quads (about 960.000 vertices) before getting a drop on my FPS. This is still pretty good but not what I was looking for.
I want to draw ONE MILLION sprites!
Now What? Letās Stream Geometry
In this new approach we will do Buffer Object Streaming with Persistent Mapping of Buffers. Buffer Object Streaming consist of modifying a buffer object with new data while itās being used. Once youāre done writting to the buffer you do a draw call which starts a āuseā cycle on the GPU. Persistent Mapping of Buffers consists of mapping a region of our GPU memory with the client address space only once. Itās persistent because we donāt have to map and unmap the region every time we want to push data to the GPU. We only unmap this region when we want to delete the buffer. So what is the benefit compared to the previous approach? Well, We can use our own custom allocator to manage that memory as we want. We donāt need to call malloc to keep a copy of our vertex data on heap. Once we are done doing our stuff we just call glDrawArrays. This means no more binding buffers and updating using glBindBuffer, glBufferSubData and no more mapping and unmapping, which is terrible for performance. With all this in mind we can implement a flush function that looks a lot cleaner, just a single glDrawArrays.
To use this approach we need a driver with the following GL extensions:
ARB_buffer_storage Requires OpenGL 4.3+
ARB_map_buffer_range Requires OpenGL 2.1+
If we are working with OpenGL for Embedded Systems (ES), the extensions are:
EXT_buffer_storage Requires OpenGL ES 3.1+
EXT_map_buffer_range Requires OpenGL ES 2.0+
So now that we are going with a Buffer Object Streaming with a Persistent Buffer Mapping we need to change the VBO allocation routine. Our previous VBO allocation looked like this:
Now it should look like this:
fMap will enable persistent mapping, which will allow us to write to the returned address space even when itās being used by the GPU.
fCreate will allow us to dynamically update the bufferās data store.
Running Test: Buffer Object Streaming with Persistent Mapping of Buffers
We are running the same test we ran previously but with different VBO allocation and flush function.
The performance boost is undeniable. We went from 160.000 (960.000 vertices) quads to 1.000.000 (6.000.000 vertices) quads. Thatās bit more than 6 times more quads. All of this is running at 60 FPS on my machine.
Here you can see a small comparison between both approaches:
Conclusion
This was defintely a good approach for PC. If the minimum spec needed to run your game supports OpenGL 4.3+ then youāll be able to get a nice performance boost. Sadly it still not available in popular mobile devices like Appleās. There is a couple of devices that run Android/Linux/Windows that support OpenGL ES 3.1, you can see a list here. Itāll all depend on your hardware and driver specs. We can always fallback to our glBufferSubData technique if itās not supported but itās nice to be prepared anyways.
Links
- Demo Source Code: https://github.com/bitnenfer/GLSpriteRenderDemo
- Buffer Object Streaming - OpenGL Wiki: https://www.opengl.org/wiki/Buffer_Object_Streaming
- OpenGL Efficiency: AZDO https://www.khronos.org/assets/uploads/developers/library/2014-gdc/Khronos-OpenGL-Efficiency-GDC-Mar14.pdf