AI assisted coding in Fortran: Using GPUs via standard parallelism
Table of Contents
Introduction (verbatim from last post)#
I have covered some basics things and ideas, in this first post I cover the idea of Fortran being a smaller language, focused on scientific computing and thus being easy to learn and get really good at.
If you do not know anything about coding, maybe you can see this series of posts I did about getting started and about programming languages in general. Then in a third post I talk about the better practices and how to architecture a project. In my previous post I talk a bit about the architecture needed for starting a new project in Fortran if you want to target GPUs. Then in this post I talked about the basics of OpenMP data mapping to enable things running on GPUs.
Background#
The idea behind standard driven parallelism is that the language has a mechanism baked in to signal the developer, the compiler, and then the runtime that something
is safe to be run in parallel, i.e. every single operation inside a block can be executed concurrently. What does this mean? There cannot be data dependencies between
the operations, i.e. if you are doing a loop over i, you cannot depend on i-1 to calculate i+1.
To my knowledge C++ and Fortran have enabled standard driven parallelism and most of the vendors (NVIDIA, Intel, AMD) have enabled a way to extract the parallelism and translate it to their underlying tools. This is mostly done by taking the loop and using their underlying parallel library (OpenACC or OpenMP) to parallelize the loop either on the host or the device.
A good example for parallelism is your average saxpy, a vector sum that does the operation c= alpha * a + b. This is one is a bit too trivial and I usually like to
write a 3D example of it, just because it is more work and you can see things better. I like to call this JAXPY, a Jorge AXPY esque operation. This looks like:
block
integer :: i,j,k
real, allocatable :: a(:,:,:), b(:,:,:), c(:,:,:)
integer, parameters :: dims = 1024
real, parameter :: alpha = 2.0
allocate(a(dims,dims,dims))
allocate(b(dims,dims,dims))
allocate(c(dims,dims,dims))
a = 1.0
b = 2.0
c = 0.0
do j = 1, dims
do i = 1, dims
do k= 1, dims
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
end do
end do
If you are familiar with any thing parallel computing you can see that each operation into c(i,j,k) is independent of each other so in theory we can perform all
dims^3 operations in parallel without having any data race.
Traditionally, you would write this with OpenMP (or OpenACC) like:
!$omp parallel do collapse(3)
do j = 1, dims
do i = 1, dims
do k= 1, dims
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
end do
end do
Note that you can now skip !$omp end parallel do in Fortran, this is indeed very nice. If you are wanting to offload this to a device you can change the clause
to be !$omp target teams loop collapse(3) map(tofrom: a,b,c), but we already know how to better data transfers, so you know we should previously allocate
things, etc.
In Fortran, standard parallelism is achieved through the concept of do concurrent. Do concurrent has some very specific rules, see:
Within the body of a DO CONCURRENT loop the program must respect a long list of restrictions on its use of Fortran language features. Actions that obviously can’t be executed in parallel or that don’t allow all iterations to execute are prohibited. These include:
Control flow statements that would prevent the loop nest from executing all its iterations: RETURN, EXIT, and any GOTO or CYCLE that leaves the construct.
Image control statements: STOP, SYNC, LOCK/UNLOCK, EVENT, and ALLOCATE/DEALLOCATE of a coarray.
Calling a procedure that is not PURE.
Deallocation of any polymorphic entity, as that could cause an impure FINAL subroutine to be called.
Messing with the IEEE floating-point control and status flags.
Accepting some restrictions on data flow between iterations (i.e., none) and on liveness of modified objects after the loop. (The details are spelled out later.)
Obtained from this flang blog post. This blog post is a bit technical if you are trying to understand the nitty griddy details of the do concurrent implementation across different compilers, to the best of our abilities and for our convenience we can say that anything labelled within a do concurrent region is something that can be executed in parallel without data races and other nasty parallel things.
The JAXPY above can be then rewritten as:
do concurrent(j=1:dims, i=1:dims, k=1:dims)
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
You can see that this has a sense of elegance, I like to think. It says this loop can be evaluated in parallel, period. It is condensed so it is similar to writing in the MOM6-esque dialect of doing:
do j=1,dims ; do i=1:dims ; do k=1:dims
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do ; end do ; end do
Depending on the implementation the do concurrent can rearrange the ordering of the loop to maximize its vision of parallelism. This is compiler dependendent. So,
if you were to write a do concurrent and change the order of the j->i->k to k->i->j there is a chance that the compiler will rearrange things so that
the best data layout is ensured. In general, try to keep to Fortran notions of the fast index.
Underneath the hood, the different compilers map do concurrent to something they already know. NVIDIA maps it to OpenACC, Intel and AMD map it to OpenMP, and GNU
does its own pthreads like implementation. Support for parallelism using do concurrent is still flaky at best in GNU but Intel, AMD, and NVIDIA have quite mature
implementations using their preferred parallel backend. There are of course bugs, but we will not find them unless we use it! The flags to enable standard parallelism
in different compilers have been nicely document by Ivan here.
Personally, I have not gotten it to work with GNU to execute in parallel. However, a do concurrent loop seems to be faster than a plain loop because (I believe) the compiler is vectorizing it better.
Device side do concurrent#
It is all pointers in the end. Data is mapped to the GPU and is represented either as a host or device pointer. If you are unlucky and poor like me, you don’t have access to a unified memory machine so you have to keep track of data movement. Do concurrent works exactly as a traditional offloaded loop, if it sees that the things are on the device and you asked for do concurrent to be on the device it will execute it on the device.
A key aspect here is your GPU memory model at compile time. If you used -gpu=mem:managed assuming the nvidia compilers, the runtime will check if the arrays are
present and if not, will move them accordingly. This means that the code will probably do a lot of unecessary memcpys, which make things slow. If you are on a
unified memory architecture machine, you can use -gpu=mem:unified, since things share the address space the number of memcpys should not be there!
This means, that if you have a unified memory arch machine you can write the above JAXPY program without having to do any modifications! Otherwise, you will need to do:
block
integer :: i,j,k
real, allocatable :: a(:,:,:), b(:,:,:), c(:,:,:)
integer, parameters :: dims = 1024
real, parameter :: alpha = 2.0
allocate(a(dims,dims,dims))
allocate(b(dims,dims,dims))
allocate(c(dims,dims,dims))
a = 1.0
b = 2.0
c = 0.0
!$omp target enter data map(to:a,b,c)
do concurrent(j=1:dims, i=1:dims, k=1:dims)
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
!$omp target exit data map(from: a,b,c)
We already know though, that initializing things on the host when they’re only needed on the device is wasteful. Additionally, the a = 1.0 way of initializing can become
a bottleneck for very large arrays, in this case we have three arrays that are 1024^3 so we might want to do a parallel fill of the arrays. We can do that simply by:
block
integer :: i,j,k
real, allocatable :: a(:,:,:), b(:,:,:), c(:,:,:)
integer, parameters :: dims = 1024
real, parameter :: alpha = 2.0
allocate(a(dims,dims,dims))
allocate(b(dims,dims,dims))
allocate(c(dims,dims,dims))
!$omp target enter data map(alloc:a,b,c)
do concurrent(j=1:dims, i=1:dims, k=1:dims)
a(i,j,k) = 1.0
b(i,j,k) = 2.0
c(i,j,k) = 0.0
end do
do concurrent(j=1:dims, i=1:dims, k=1:dims)
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
!$omp target exit data map(from: c)
!$omp target exit data map(delete: a,b,c)
You can ommit the !$omp target data clauses if you are on say a Grace Hopper machine, or an MI300A. However, if you want your code to be portable, you should
always include the data moving directives.
So, why should we use standard parallelism over proven OpenMP/OpenACC clauses? Well, I’d argue that first it makes the code more readable. If I see a do concurrent I can think that everything inside the loop is thread safe and safe to parallelize. It makes reading the code simpler, instead of a triple nested loop we can just have a single expression. Additionally, since the compiler maps it to the preferred parallel runtime, it almost always gets the same performance as OpenMP or OpenACC!
A code without clauses I feel is pretty.
Asynchronous execution#
If you are familiar with GPUs you know there is such a thing as streams, where you can have a data stream and a compute stream and overlap communication and computation. As of now, there is no good way to use this without having to transform your loops to OpenMP or to OpenACC to use either the task or asynch constructs, respectively. So, hopefully we wait for a good way to do this.
Connection to Vibe Coding in Fortran#
Exploiting do concurrent does require you knowing a bit of what your code is doing, because you cannot simply label every loop as concurrent and hope for the best. Because
it is YOU telling the compiler: “HEY, this is safe to execute in parallel.” You are in charge of realizing if something can be actually executed in parallel.
So what does this mean? Well this means you cannot fully vibe code without understanding the problem you are trying to solve. However, if you understand what you are doing
and what you want to achieve, do concurrent is a great way to show parallelism while writing simple loops, instead of having to write explicit cuda kernels.
Suggestions for GPU vibe coding using do concurrent#
The main idea for writing efficient and readable code that executes on the GPUs is similar to what are considered best practices. Minimize allocations on the fly and inside
hot loops, try to be good at keeping track of your data. So basically, if you have a big: call gpu_map_arrays() at the beginning of your program and assume that any other array
that is used throughout the entire computation IS already on the GPU you can forget about eveyrthing OpenMP and OpenACC and simply write do concurrent loops.