AI assisted coding in Fortran
Table of Contents
Disclaimer#
First I will acknowledge that not everyone thinks of vibe coding as the same concept. The way I “vibe code” is actually closer to what people describe as AI assisted/driven programming. This is, I still read the code, I still audit what has been written, I push back before the code is even compiled. My vibes are mostly in the form of directions I want the AI to follow on how to implement, or what I want out of a routine.
Introduction#
AI assisted coding is extremely powerful - probably the best showcase of how good it can be is by starting a new project from scratch. I, personally, have felt that using it in a large codebase can be cumbersome, due to the initial barrier height of letting the AI explore the codebase. One needs to have enough knowledge about the codebase to understand and interpret the explanations the AI can conjure from its explorations.
Therefore, I believe that an experienced user of a large codebase can use an AI tool to effectively map the codebase and make it friendlier for non-experts of the codebase. A good example is my work in the MOM6 codebase, using an AI to disentangle the data dependencies would be, or would have been impossible without working in close collaboration with people with extreme knowledge of the codebase. Otherwise, it felt like I was grasping at straws. This is, however, a great win/win scenario because, I believe, that an AI ready codebase is also a more friendly codebase to humans. If there’s documentation, code maps, data dependencies, implementation notes, etc. available for anyone to read and learn about it will make the experience all too much fun!
But the most fun of “vibe coding” is by creating something new from scratch and watching it grow and inevitably let it rot. Hopefully we can avoid the latter by creating a good codebase from the start - also building something that can be found useful by someone else or that it could serve a community. In the year of our Lord 2026 it feels pointless to design any new scientific application that does not use GPUs, since the hype cannot be more monstrous at the moment! However, GPU programming is hard! Data movement unless you have one of those fancy GH/B something hundreds, an MI300A, or any of the cool systems on a chip where the CPU and the GPU share the same memory address spaace. This is not the case for most of us. So we have to keep in mind how data is moved to and from the device.
I believe that now is the time to think “how about I start this project from scratch” - if you don’t finish it you will learn a lot! if you finish it you will have learned a lot and will probably be able to write a nice paper about what you did. The crucial thing here is that you understand the physics/maths of what you are implementing and that you are familiar with the best programming practices for good software architecture/engineering. There are many many books out there that focus on this, “Philosophy of Software Design” by John Ousterhout, “Clean Code” by Uncle Bob Martin, Damian Rouson’s “Scientific Software Design” (this is a bit very Fortran/C) whereas the first 2 are more Java/Object Orientation. They are all heavy on design patterns, following those three books one can find extremely good resources on Youtube and the internet in general to get a good grasp on other books and resources that are useful for the new scientfic programmer. It is worth noting that not all the concepts here are applicable to scientific programming, take them with a grain of salt.
For example, I would appreciate a comment when someone writes a code that solves certain equations and there are parameters used in the formula. For example,
“where does this CONSTANT_ALPHA come from?”, a useful comment that says “This is the Laplacian of the equation 15 shown in paper with DOI: XYZ”.
In a very few words, the best things to take out from all of those books is to think in how to build code that is easy to extend, refactor, and add tests to.
Tests are the most important thing on the planet. If your code does not have unit tests your code is a legacy code. If you code is unit tested but you have no
regression tests (testing physics/performance) you need to add them. Tests should be wired into the build system and be easy to run. If you have Continuous Integration
(CI) set up, they should preferrably run on push, not on pull request, on push. But Jorge this is too much, my tests take 18 hours to run. Well my friend, your
tests are bad. Ideally you should be able to run your unit tests in less than 10 minutes, and that is pushing it. Maximize test coverage while minimizing execution
time - you want a fast feedback loop between the developer and the testing suite. The less the dev is waiting, the more code we can write.
Programming practices: productivity versus performance#
I like Fortran, I don’t think I have tried to hide this at all recently. I think it is a great language that if used correctly it looks as simple and pretty as python or julia, just without strings, let’s not talk about strings.
We will use Fortran for our compute heavy tasks and we will use a productivity oriented language such as Python or Julia to orchestrate workflows and run
our simulations. Because I am old school, I dislike depending on mpi4py I prefer to do my MPI in the language my compute heavy workloads are in. A lot of
people have had great success with mpi4py but I have had some bad experiences using it, especially if GPU aware MPI is required throughout the simulations.
So, the rule is that we will identify the hot loops and write those in Fortran, the communication will be done in Fortran, and the orchestration will be
done in Python. For this, we need to expose a C like API that python can consume and we will compile our code into a shared library (if python interoperability
is requested). The best way to do this is by creating a handle based approach, were the python simulation produces a handle that holds the state of
the Fortran program, thus avoiding global states. This way we can have multiple simulation handles at the same time and use Python threads if we wanted to.
Second - we need to understand how GPU programming works. In a few words we want to keep the GPU busy with compute and avoid as many unnecessary memory transfers
as we can. We will always need to transfer things, if we want to print to console the current mass, energy, volume, etc. that the simulation holds we need to send it
back from where it was compute (the device), to the host for printing. This is an easy gotcha, people often times transfer back the array that holds the information
and perform the global sum to compute the total mass on the host. This creates a serialization point, we have to transfer the array back (unless you are using asynchronous
queues, which if you are transferring whole arrays, chances are you’re not). The best thing to do here is to compute the total sum on the device and transfer the single
scalar that is the total_quantity for printing. Instead of transferring n_elem*data_size we transfer 1*data_size. What a win!
Then, we want to minimize on the fly allocations. If computations need buffers, workspaces, etc. it is good practice to allocate some memory at the start of the simulation and carry it around in a owned object that is accessible to the device functions. This way memory gets allocated once and is reused throughout the simulation. Otherwise, on the fly allocations will eat your compute up.
Following the previous lesson, we want to allocate and transfer most data to the device at the start, not as it is created/needed. For all arrays, variables, etc. that will be needed should be allocated at the start and preferrably carried around in a “resources” object that can handle all accounting you will need. So basically, a program should look like:
program main
use app_state, only: state_t
implicit none
type(state_t) :: state
call initialize_gpu(state)
!! initialize all necessary variables, arrays
call run_simulation(state)
call finalize_gpu(state)
end program main
whereas an older style code might allocate new memory throughout the run_simulation routine. Inside run_simulation we are only doing compute, no allocations, no transfers (unless
needed for I/O) etc. This maximizes the utiliation of the GPU.