Optimising a simple 2D grid simulation
Outline
This post will present a series of video showing how I took a simple 2D grid simulation and made it quicker and more efficient using a variety of techniques.
The code for the simulation can be found here and requires the NCCA Graphics Library ngl
Part 1 CPUOnly
In this first video we look at the base simulation code and the overall framework of code I will be using for the simulation. It should be noted we are only worried about using an “Embarrassingly Parallel algorithm” in this demo for simplicity and to highlight the processes of speeding things up. In a future post I will look at methods for doing more complex algorithms involving spatial data structures and acceleration structures.
Part 2 CPUSOA
In this version I re-structure the code to use a Structure of Arrays SOA style layout. This has an advantage of making the code more cache friendly. It also has the major advantage of making loading to OpenGL easier and making the rest of the optimisations I’m going to make easier.
This approach is more akin to “Data Oriented Design” approaches which do go against many of the core ideas behind classic Object Oriented programming.
Part 3 Using Texture Buffers
In this demo I convert the drawing routines to use Texture Buffer Objects (TBO)[https://www.khronos.org/opengl/wiki/Buffer_Texture] and use texelFetch to get the data.
Part 4 Adding Threading
As we saw in the previous demo we are not really exercising the CPU very well to do this we need to partition the work to use more of the CPU cores. We could use OpenMP parallel for, however this is not natively supported on the mac version of clang++ so in this demo I write my own based on the code here
Part 5 Adding Threading with TBB
In this version I use the parallel_for algorithm from Intel’s
tbb I installed this via vcpkg by using ./vcpkg install tbb:x64-windows
Part 6 Using SIMD
This demo uses SIMD instructions. I used TDD to develop a simple Vec3x8 class I recommend reading this excellent article for a more in depth look at SIMD.
Part 7 Better SIMD
The biggest problem with the previous example was the data layout and how we updated the Texture buffers. In this demo we throw away the Vec3x8 class and hard code everything for speed (This is something that is mentioned a lot in Data Oriented Design texts)
Conclusions and future work
This has been a fun exercise, I still need to complete a Computer Shader and CUDA version of the code to do more comparisons. In the next set of demos I will look at a more complex problem.