This will be an informative rant and a way for me to learn more things.

Bitwise reproducibility#

This means that when you run a simulation twice and compare the output between both runs you should expect to see the exact same answers, bit by bit, not within roundoff error. I.e. if the simulations differ at 1e-16, they are not bitwise reproducible. If you know anything about compilers, optimizations, math libraries, etc. you will correctly think “oh god, this sounds awful”, and yeah you are kinda right.

Let’s talk first about floating point arithmetic without going far too deep into this rabbit hole. FP arithmetic is not associative, i.e. (a+b) + c and a + (b + c) can differ, the computation will be correct and reproducibile but not bitwise identical.

What else can change the bits in an answer? Most of these things will happen in reduction operators, where a global sum over a shared variable is performed. If you are performing an MPI_Reduce with one, two, N ranks the order in which the operation, maybe a sum, gets performed is not guaranteed to be the same in every single run - same with a reduction over threads. In a few words, the order of operations in a reduction is non-deterministic. It will produce the same answer but the answers won’t be bitwise identical.

What else? Fused multiply additions, i.e. a*b + c can be transformed into a single FMA operation: the result is not guaranteed to be bitwise identical.

What else? Transcendetal functions, such as sin, cos, exp can be implemented differently in the underlying math library that the language exposes, leading to sine and cosine producing different answers. This can be particularly prevalent when doing GPU computations.

Why do we torture us?#

Well, in the physical oceanography community bitwise reproducibility ensures that extremely chaotic and convoluted future predictions can be reproduced and tested, provided the same hardware, compilers, are available. It is important to note that computations are to be bitwise reproducible within the same machine they are run on - cross machine bitwise reproducbility requires much more care.

The torture also has its benefits! It is a way to guarantee that answers remain correct and that you haven’t broken any physics by doing any optimizations, refactors, etc. i.e. if you parallelize a routine and obtain the same answers as before you have a guarantee that everything you did is correct. There are no “oh, there’s a 1e-16 difference, I am sure that is ok” - this can then come back to bite you later if things were not indeed “ok”.

This however represents a crucial difficulty because not only are you fighting physics and algorithms but your next enemy is the compiler doing things it is either not telling you it is doing, or not doing things you expected it to do.

Ensuring bitwise reproducibility in large codebases#

The only experience so far has been with MOM6 (the Modular Ocean Model) and the team of developers porting the code to GPUs. I have been helping with this for a bit and we are making good progress. We’ve reported a lot of bugs to the NVIDIA compiler people and we have gotten successful runs on one and multiple GPUs showcasing good speedups.

When I started I asked “how do I know that what I did is correct?” that’s when I was told that answers had to be bitwise reproducible. Coming from the land of quantum chemistry where you accept anything beyond the SCF convergence as an acceptable error in most things, this felt like an extreme handicap on what we could do. The more time I’ve spent in this task, the more I appreciate it. I didn’t have to learn the physics and maths of ocean circulation to understand what I was doing - however, I’ve spent a lot of my free time recently doing exactly that, but this is another story.

In short, MOM6 has a couple of rules, use safe intrinsics, turn of FMAs, test the optimization level, very detailed use of parenthesis and elevating to the powe rof, to make sure you are not breaking bitwise reproducibility.

Bitwise reproducibility between CPU and GPU#

Basically, if we turn off FMAs on the device, we do not use reductions (unless hand coded), and ensure that all arithmetic being done is simply IEEE operations, i.e. addition, multiplication, substraction, and division - we are able to get bitwise reproducible answers between the host and the device (using the same compiler).

What happens with transcenentals?#

Well, so far we hadn’t really had a problem because luckily we hadn’t had a need to run them on the device. This changed when I started to port the ALE framework which started to have calls to cuberoots, expontentials, some square roots in weird places, and a sin/cos. This is when the fire nation attacked. You will also see these whenever tides are involved! So we better get this right.

Additionally, we started to get to a point where running on different architectures started to be annoying. We had someone running on AMD CPUs, I was running on Intel, and someone else on ARM - the more cross arch we were doing, the more things that seemed to pop up.

In the end, it seems like ensuring arithmetic being done in a reproducible way is a very complex beast to handle!