Writing a big code from scratch for GPUs
Table of Contents
Introduction#
Using GPUs is the hot thing at the moment. Well, it has been for a while now…but, nowadays it is basically a requirement for a codebases to have GPU support or a good explanation on why they don’t have it at all.
There are two avenues for a codebase to start using GPUs: 1) port the existing codebase to use GPUs, or 2) rewrite the codebase from scratch
Rewriting is probably the best approach, since it gives you complete control on how everything is going to be done and you can choose to rewrite at any speed possible and whichever paths are necessary. However, it is a monumental task to do if the codebase is large enough. In reality, even a 10k line codebase is a big task - especially if the codebase lacks tests, a Continuous Integration (CI) environment, regression tests that catch performance and correctness of the physics. This is usually the case for most academic driven codebases, since most development is done by none software engineers.
Porting is the act of slowly enabling routines to the GPU by carefully dissecting them and using whichever method you want to run them on the GPU. This becomes tricky quickly because the core idea of using a GPU involves moving data from the host to the device (unless you have some of those fancy unified memory CPU- GPUs, like a GH200 or an MI300A). This is painful because, traditionally, CPU codebases have been quite careless with how memory is allocated and where it is used. Additionally, the way parallelism is set up for CPU based codes is to minimize FLOPs, because they were expensvie - sometimes with GPUs those tricks that made CPU codes fast can end up making GPU code slow! So sometimes you need to reformulate the algorithm or the code itself. This can be problematic because that means touching code that might have been there for ages and that one does not understand anymore. Additionally, this code could not be tested at all!
Both approaches can be painful, rewriting from scratch means an insane amount of work and means probably alienating a lot of existing developers that use the codebase. It stops progress right in its tracks by putting a moratorium on development or it creates a parallel codebase. It is usually a painful process, it might ruffle some feathers around the team.
Porting can be quite painful for the people doing the job - especially if the code is still moving forward because it is very easy to introduce breaking changes that can work fine on the CPU but completely destroy performance or correctness on the GPU.
My opinion#
After doing both and seeing how things work I have convinced myself that a rewrite is the best idea if you are chasing performance without compromise. What do I mean here? The only things that matter are: the correct answer and getting there as fast as we can. We do not care about CPU performance, we do not care or care little about portability (until we do), we care little about readability, etc. This can be very problematic, though hard code to read is hard code to extend and maintain. Also, if you can only run on a single platform, what do you do if your HPC centre buys another hardware? Difficult things!
This is why I believe on a pragmatic approach. Everything with moderation! So the idea is to do things as best as we can: good programming, maintainability, performance, readability, and of course, correctness!
My approach#
I am still burnt out of C++ so we shall still use Fortran for everything.
Memory handling#
Fortran is great that it has allocatable arrays, which are mostly memory safe. So the idea is that we have “handles” that carry a consistent state. This way the resources are owned by a process or set of processes. A bit object oriented. This will also let us have reusable scratch spaces, for example if we need an array of whatever size to do some work we shouldn’t allocate it on the fly, we should try to only do it at the start, put it on the GPU and reuse as we go along. This makes it easy to then just work on existing memory in a speedy way.