AI assisted coding in Fortran: Software Architecture and GPUs
Table of Contents
What I mean by “vibe coding”#
Not everyone uses “vibe coding” the same way. The way I do it is closer to what people call AI-assisted or AI-driven programming: I still read the code, I still audit what gets written, and I push back before anything is even compiled. My vibes are mostly the directions I give the AI on how to implement something, or what I want out of a routine.
Introduction#
AI-assisted coding is extremely powerful — probably the best showcase of how good it can be is starting a new project from scratch. Personally, I have found it cumbersome in large codebases, because of the upfront cost of letting the AI explore the codebase. You need enough knowledge of the codebase to interpret the explanations the AI conjures from its explorations.
This is why I think an experienced user of a large codebase can use AI tools to effectively map the codebase and make it friendlier to non-experts. A good example is my work in the MOM6 codebase: using an AI to disentangle the data dependencies would have been impossible without working closely with people who have deep knowledge of the codebase. Otherwise, I’d just be grasping at straws. But this is a great win/win scenario, because an AI-ready codebase is also a more human-friendly codebase. If documentation, code maps, data dependencies, and implementation notes are available for anyone to read, the experience is all too much fun. I will try to write a post on this later.
That’s the legacy-code story. The greenfield story is more fun: creating something new from scratch, watching it grow, and inevitably letting it rot. Hopefully we can avoid the latter by building a good codebase from the start — ideally something useful to someone else, or that serves a community. In the year of our Lord 2026 it feels pointless to design any new scientific application that does not use GPUs, given the monstrous hype. But GPU programming is hard! Data movement — unless you have one of those fancy GH200/GB200 chips, an MI300A, or any of the cool systems-on-a-chip where the CPU and GPU share the same memory address space — is usually a performance bottleneck. Most of us won’t have access to those, so we have to do more work.
Now is the time to start a new project from scratch. If you don’t finish it, you’ll learn a lot. If you do, you’ll have learned a lot and will probably be able to write a nice paper about it. The crucial thing is that you understand the physics and maths of what you are implementing, and that you are familiar with good software architecture. There are many books on this: A Philosophy of Software Design by John Ousterhout, Clean Code by Uncle Bob Martin, and Damian Rouson’s Scientific Software Design (the last one is very Fortran/C, the first two more Java/OO). They are all heavy on design patterns, and following any of these you will find good resources elsewhere on YouTube and the wider internet for the new scientific programmer. Worth noting: not all of these concepts apply cleanly to scientific programming — take them with a grain of salt.
For example, I would appreciate a comment when someone writes code that solves certain equations and there are parameters used in the formula. “Where does this CONSTANT_ALPHA come from?” — a useful comment would say “This is the Laplacian of equation 15 shown in the paper with DOI: XYZ”.
In a few words, the best thing to take from those books: build code that is easy to extend, refactor, and test.
Tests are the most important thing on the planet. If your code does not have unit tests, your code is legacy code. If your code has unit tests but no regression tests (testing physics and performance), you need to add them. Tests should be wired into the build system and easy to run. If you have CI set up, tests should run on every push, not only on pull requests — fast feedback per commit beats waiting until a PR opens. But Jorge, this is too much, my tests take 18 hours to run. Well my friend, your tests are bad. Ideally unit tests run in under 10 minutes, and that is pushing it. The goal is high coverage, low runtime — fast feedback or bust.
Programming practices: productivity versus performance#
I like Fortran. I don’t think I have tried to hide this recently. It is a great language that, if used correctly, looks as simple and pretty as Python or Julia — just without strings. Let’s not talk about strings.
We will use Fortran for compute-heavy tasks and a productivity-oriented language like Python or Julia to orchestrate workflows and run our simulations. Because I am old school, I dislike depending on mpi4py; I prefer to do my MPI in the language my compute-heavy workloads live in. Plenty of people have had great success with mpi4py, but I have had bad experiences with it, especially when GPU-aware MPI is required throughout the simulation.
The rule, then, is: identify the hot loops and write those in Fortran, do communication in Fortran, and orchestrate in Python. To do this, we expose a C-like API that Python can consume and we compile our code into a shared library (when Python interoperability is requested). The best way is a handle-based approach, where the Python simulation produces a handle that holds the state of the Fortran program — avoiding global state. This way we can have multiple simulation handles at the same time and use Python threads if we want to.
Second — we need to understand how GPU programming works. In a few words: keep the GPU busy with compute and avoid as many unnecessary memory transfers as you can. We will always need to transfer some things; if we want to print the current mass, energy, or volume to console, we need to send it back from the device to the host. This is an easy gotcha: people often transfer the whole array back and perform the global sum on the host. This creates a serialization point — the array has to come back (unless you are using asynchronous queues, which, if you are transferring whole arrays, you probably are not). The right thing is to compute the total sum on the device and transfer the single scalar total_quantity for printing. Instead of transferring n_elem*data_size, we transfer 1*data_size. What a win.
Next, minimize on-the-fly allocations. If computations need buffers or workspaces, allocate them once at the start of the simulation and carry them around in an owned object accessible to device functions. This way memory is allocated once and reused throughout. Otherwise, on-the-fly allocations will eat your compute up.
Following from this: allocate and transfer most data to the device at the start, not as it is created or needed. All arrays and variables that will be needed should be allocated up front and preferably carried around in a “resources” object that handles all the accounting you need. A program should look like:
program main
use app_state, only: state_t
implicit none
type(state_t) :: state
call initialize_gpu(state)
!! initialize all necessary variables, arrays
call run_simulation(state)
call finalize_gpu(state)
end program main
An older-style code might allocate new memory throughout run_simulation. Inside run_simulation we are only doing compute: no allocations, no transfers (unless needed for I/O). This maximizes utilization of the GPU.
Carrying a big state variable around has its drawbacks: you might feel tempted to fill it with more and more arrays and variables, leading to a “God module” that dictates the entirety of the program’s execution and lifetime. Not good. The fix is separation of concerns. You can carry a big module — let’s just decompose it. Better to have:
type :: state_t
type(grid_t) :: my_grid
type(buffer_t) :: my_buffer
type(resources_t) :: my_resources
contains
procedure :: init => state_init
procedure :: destroy => state_destroy
end type state_t
than:
type :: state_t
integer :: nx
integer :: ny
integer :: nz
real(dp) :: dx
real(dp) :: dy
real(dp) :: dz
integer :: n_grid_points
real(dp), allocatable :: buffer_1d(:)
real(dp), allocatable :: buffer_2d(:,:)
type(mpi_comm) :: my_comm
type(blas_handle) :: my_handle
end type state_t
The composed version is cleaner, the concerns are clearly separated, and it is simpler to handle and extend. This also lends itself well to GPUs — especially if you want to stay in Fortran and use directives to move data around. At every init for every derived type you can use the right map(alloc:...) or map(to:...) clause and you will know the lifetime of the variable on both CPU and GPU. The flat design’s lack of control is what makes porting existing codebases such a challenge: keeping tabs on where the data lives and how it is used is very difficult.
For CPU/GPU transfers, it is convenient to create copy or update routines. For example:
subroutine copy_state_snapshot(state, buffer)
type(state_t), intent(inout) :: state
type(buffer_t), intent(inout) :: buffer
if (state%is_synched) return ! buffer already reflects current device state
!$omp target update from(state%buffer_1d, state%buffer_2d)
buffer%buffer_1d_cpu = state%buffer_1d
buffer%buffer_2d_cpu = state%buffer_2d
state%is_synched = .true.
end subroutine copy_state_snapshot
And when the device-side state changes:
subroutine advance_state(state)
type(state_t), intent(inout) :: state
!$omp target teams distribute parallel do
do i = 1, state%n_grid_points
! ... update state%buffer_1d, state%buffer_2d on device ...
end do
state%is_synched = .false.
end subroutine advance_state
You don’t need to think about this in CPU-only programming, but on GPUs, keeping tabs on your data is very important.
Initializing data on the GPU#
A lot of arrays are initialized to a value that is known at compile time (0.0) or that can be read from input. If you know what you are initializing your array to at compile time, you can avoid unnecessary copies. For example:
real(dp), allocatable :: a(:)
allocate(a(1024))
a = 0.0
!$omp target enter data map(to: a)
This initializes the array on the host and then transfers 1024*8 bytes to the device over PCIe. Instead, you can:
real(dp), allocatable :: a(:)
integer :: i
allocate(a(1024))
!$omp target enter data map(alloc:a)
!$omp target teams distribute parallel do
do i = 1, 1024
a(i) = 0.0
end do
!$omp target exit data map(delete:a)
This way you do not copy anything at all: the array a is allocated and initialized directly on the device. There are times when this isn’t possible — for example, reading a runtime array from disk. Those need to be initialized on the host and transferred.
Can I vibe code now?#
We’ve covered architecture, which matters a lot. You also need to understand the limitations of directive-based GPU offloading. You might be tempted to go overboard with the object-orientation features in Fortran — some are nasty to map to GPUs, or the compiler will simply not do them correctly. Test things thoroughly, write unit tests, and write minimum reproducing examples (MREs) that capture the patterns you want to try. In the next post I’ll go deeper into the Fortran side: which language features play well with AI assistants, which to avoid, and how to guide a model to write code that the compiler will actually accept on the GPU.