AI assisted coding in Fortran: Using GPUs via OpenMP in Fortran
Table of Contents
Introduction#
I have covered some basics things and ideas, in this first post I cover the idea of Fortran being a smaller language, focused on scientific computing and thus being easy to learn and get really good at.
If you do not know anything about coding, maybe you can see this series of posts I did about getting started and about programming languages in general. Then in a third post I talk about the better practices and how to architecture a project. In my previous post I talk a bit about the architecture needed for starting a new project in Fortran if you want to target GPUs. In this post I will go a bit more into detail about how to set this up and think about scale and how big your project can get.
Background#
So before we start, I’ll clarify that my experience is in High Performance Computing (HPC), so I like to think about doign things at scale, i.e. my aim is to use 25% or more of the world’s largest supercomputers if the problem can be parallelized enough. There are some things that just cannot be run at that scale because of excessive serial bottlenecks.
Amdahl’s Law: This is the crucial barrier we will reach at some point when writing parallel code. In plain English: “You are as fast as your slowest bit”, this means that at some point a serial bit in your code will halt all parallel progress. Our aim is to minimize these so that we can exploit parallelism as much as possible.
So from the start we will design our applications with the distributed memory in mind and, of course, with GPUs in mind.
Using GPUs from Fortran#
There are multiple ways of accessing the power of GPUs from Fortran, some of them are simple, some of them are good yet annoying. Let’s discuss them a bit.
Using OpenMP#
OpenMP is a widely used standard for parallel computing, it is the best way to access shared memory parallelism (within a single node) on the CPU. It is used by C, C++, Fortran, and now Python. The OpenMP community is big and bugs get reported and fixed all the time. OpenMP is very nice because it is done via directives which the compiler can ignore if OpenMP is not requested. The key thing to consider about OpenMP is that its support and features available depend on the compiler, so if a compiler has not implemented something that you want to use you will need to use another compiler.
For GPUs, the best compiler out there is the NVIDIA Fortran (nvfortran) compiler, it does have kinks but it offers the best performance out there. gfortran, and flang are catching up but they still have some work to do.
For CPU parallelism, one can parallelize the following code by simply adding !$omp parallel do:
block
real, allocatable :: a(:,:,:)
real, allocatable :: b(:,:,:)
real, allocatable :: c(:,:,:)
real :: alpha
integer :: i,j,k, dims
alpha = 2.0
dims = 1024
allocate(a(dims,dims,dims), b(dims,dims,dims), c(dims,dims,dims))
a = 1.0
b = 2.0
c = 0.0
!$omp parallel do collapse(3)
do j = 1, dims
do i = 1, dims
do k = 1, dims
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
end do
end do
end block
I cheated here because I added collapse(3) which tells the compiler that the loop should be unrolled, this is done for performance reasons. For this to be done on the GPU,
we can simply change the OpenMP directive to: !$omp target teams loop collapse(3) map(tofrom:a,b,c). This will copy the a,b,c arrays to the GPU and use them to
perform the operation and copy them back for further use. This is extremely wasteful because there are unnecesary copys from the host to the device. This can be done better
if we do:
block
real, allocatable :: a(:,:,:)
real, allocatable :: b(:,:,:)
real, allocatable :: c(:,:,:)
real :: alpha
integer :: i,j,k, dims
alpha = 2.0
dims = 1024
allocate(a(dims,dims,dims), b(dims,dims,dims), c(dims,dims,dims))
!$omp target enter data map(alloc:a,b,c)
!$omp target teams loop collapse(3)
do j = 1,dims ; do i = 1, dims ; do k = 1, dims
a(i,j,k) = 1.0
b(i,j,k) = 2.0
c(i,j,k) = 0.0
end do ; end do ; end do
!$omp parallel do collapse(3)
do j = 1, dims
do i = 1, dims
do k = 1, dims
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
end do
end do
!$omp target update from(c)
!$omp target exti data map(delete:a,b,c)
end block
Now we have introduced a couple extra things. We have used the alloc directive in the omp target enter data clause. This, as discussed in the previous post, allocates
the array directly in the device. We then initialize it on the device using another !$omp target teams loop collapse(3) then we do the 3D saxpy like operation all on the GPU.
The only transfer that happens is us copying back C, which we could skip if we’re going to keep using it across the program. This was just an example. Finally, we delete the variables from the GPU.
One big advantage of using !$omp target teams loop is that it is portable to the CPU, so this loop will be parallelized on both the CPU and GPU. You can change where it runs by using
the environment variable OMP_TARGET_OFFLOAD=mandatory/disabled. This is a good way to test the CPU/GPU correctness and speedup over the serial/shared memory code.
Derived types (containers)#
Data is better stored in contains, such as Fortran’s derived types. Say:
type :: my_derived_type
real :: array_1(:,:)
real :: array_2(:,:)
integer :: dims
contains
procedure :: initialize_my_type
procedure :: finalize_my_type
end type my_derived_type
subroutine initialize_my_type(this, dims)
type(this), intent(inout) :: this
integer, intent(in) :: dims
this%dims = dims
allocate(this%array_1(this%dims,this%dims))
allocate(this%array_2(this%dims,this%dims))
!$omp target enter data map(alloc:this%array_1, this%array_2)
end subroutine initialize_my_type
subroutine finalize_my_type(this)
type(this), intent(inout) :: this
deallocate(this%array_1)
deallocate(this%array_2)
!$omp target exit data map(delete:this%array_1, this%array_2)
end subroutine finalize_my_type
Defining an initializer and a cleanup routine is very important. I didn’t use the word finalizer because Fortran has that concept, it is similar to the destructor in C++. You might ask, why not use the finalize routine built into the derived type. Well, there are some issues with the finalizers. They are called when the object goes out of scope, which can happen when the object is passed into a routine. If this happened mid execution, our arrays would be wiped, deallocated, and deleted from the GPU. So for now, until the finalizers are better understood by me, I prefer to have this explicitly called/done by me. This introduces the possibility of programmer error and memory leaks, but that is what code review is for.
In this aspect C++ is safer, since the destructor WILL be called but then you have to understand how scope works in C++ and that is also a rabbit hole!
Calling functions/subroutines from the GPU.#
We are taught that we should write functions and subroutines to encapsulate repeated logic, separation of concerns, etc. This is very important and
can be done nicely via the $omp declare target directive, which tells the compiler that a function is supposed to be callable from the device.
This can sometimes fail if the function is not in the same compilation unit as the function you are calling it from, but compiler support is getting better in this regard. An important thing about device functions is that you should avoid printing inside them, let them be fully compute. You can debug with print statements by adding a buffer array or something similar.
AI assisted good practices#
The general idea that the AI agent should have is that data movement has to be minimized, enter data map are preferred to using map(tofrom) at the
beginning of loops. Memory should be allocated preferrably at the start, on the fly allocations should be minimized unless they’re done on the host
OUTSIDE a hot loop. A hot loop is a place where intensive compute is being used or inside a function that gets called within a hotloop. For example,
you don’t want to do:
do j = 1, dims
do i = 1, dims
do k = 1, dims
call my_work_routine(a,b,c)
end do
end do
end do
subroutine my_work_routine(a,b,c)
...
allocate(d(size(a), size(a))
end subroutine my_work_routine
The function my_work_routine will be called O(N^3) times (dims^3). which means that you will allocate d that many times. This is a terrible
idea and will 1000% kill your performance, either on the host or the device.
Pure subroutines and functions#
Making a subroutine or a function pure means it has no side effects, i.e. it will not write to a variable outside its scope, print
to console, or do calls to external libraries. These are great because they are thread safe by design. They are callable inside a threaded
or GPU accelerated loop. When developing code using an AI assistant, you should strive to create functions and subroutines that are pure. This
will make debugging easier and parallelization simpler.