Introduction#

I have covered some basics things and ideas, in this first post I cover the idea of Fortran being a smaller language, focused on scientific computing and thus being easy to learn and get really good at.

If you do not know anything about coding, maybe you can see this series of posts I did about getting started and about programming languages in general. Then in a third post I talk about the better practices and how to architecture a project. In my previous post I talk a bit about the architecture needed for starting a new project in Fortran if you want to target GPUs. In this post I will go a bit more into detail about how to set this up and think about scale and how big your project can get.

Background#

So before we start, I’ll clarify that my experience is in High Performance Computing (HPC), so I like to think about doign things at scale, i.e. my aim is to use 25% or more of the world’s largest supercomputers if the problem can be parallelized enough. There are some things that just cannot be run at that scale because of excessive serial bottlenecks.

Amdahl’s Law: This is the crucial barrier we will reach at some point when writing parallel code. In plain English: “You are as fast as your slowest bit”, this means that at some point a serial bit in your code will halt all parallel progress. Our aim is to minimize these so that we can exploit parallelism as much as possible.

So from the start we will design our applications with the distributed memory in mind and, of course, with GPUs in mind.

Using GPUs from Fortran#

There are multiple ways of accessing the power of GPUs from Fortran, some of them are simple, some of them are good yet annoying. Let’s discuss them a bit.

Using OpenMP#

OpenMP is a widely used standard for parallel computing, it is the best way to access shared memory parallelism (within a single node) on the CPU. It is used by C, C++, Fortran, and now Python. The OpenMP community is big and bugs get reported and fixed all the time. OpenMP is very nice because it is done via directives which the compiler can ignore if OpenMP is not requested. The key thing to consider about OpenMP is that its support and features available depend on the compiler, so if a compiler has not implemented something that you want to use you will need to use another compiler.

For GPUs, the best compiler out there is the NVIDIA Fortran (nvfortran) compiler, it does have kinks but it offers the best performance out there. gfortran, and flang are catching up but they still have some work to do.

For CPU parallelism, one can parallelize the following code by simply adding !$omp parallel do:

block 
real, allocatable :: a(:,:,:)
real, allocatable :: b(:,:,:)
real, allocatable :: c(:,:,:)
real :: alpha 
integer :: i,j,k, dims 

alpha = 2.0
dims = 1024
allocate(a(dims,dims,dims), b(dims,dims,dims), c(dims,dims,dims))

a = 1.0
b = 2.0
c = 0.0

!$omp parallel do collapse(3)
do j = 1, dims
  do i = 1, dims 
    do k = 1, dims 
      c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
    end do 
  end do 
end do 

end block

I cheated here because I added collapse(3) which tells the compiler that the loop should be unrolled, this is done for performance reasons. For this to be done on the GPU, we can simply change the OpenMP directive to: !$omp target teams loop collapse(3) map(tofrom:a,b,c). This will copy the a,b,c arrays to the GPU and use them to perform the operation and copy them back for further use. This is extremely wasteful because there are unnecesary copys from the host to the device. This can be done better if we do:

block 
real, allocatable :: a(:,:,:)
real, allocatable :: b(:,:,:)
real, allocatable :: c(:,:,:)
real :: alpha 
integer :: i,j,k, dims 

alpha = 2.0
dims = 1024
allocate(a(dims,dims,dims), b(dims,dims,dims), c(dims,dims,dims))

!$omp target enter data map(alloc:a,b,c)

!$omp target teams loop collapse(3)
do j = 1,dims ; do i = 1, dims ; do k = 1, dims
  a(i,j,k) = 1.0
  b(i,j,k) = 2.0
  c(i,j,k) = 0.0
end do ; end do ; end do

!$omp parallel do collapse(3)
do j = 1, dims
  do i = 1, dims 
    do k = 1, dims 
      c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
    end do 
  end do 
end do 

!$omp target update from(c)

!$omp target exti data map(delete:a,b,c)

end block

Now we have introduced a couple extra things. We have used the alloc directive in the omp target enter data clause. This, as discussed in the previous post, allocates the array directly in the device. We then initialize it on the device using another !$omp target teams loop collapse(3) then we do the 3D saxpy like operation all on the GPU.

The only transfer that happens is us copying back C, which we could skip if we’re going to keep using it across the program. This was just an example. Finally, we delete the variables from the GPU.

One big advantage of using !$omp target teams loop is that it is portable to the CPU, so this loop will be parallelized on both the CPU and GPU. You can change where it runs by using the environment variable OMP_TARGET_OFFLOAD=mandatory/disabled. This is a good way to test the CPU/GPU correctness and speedup over the serial/shared memory code.

Derived types (containers)#

Data is better stored in contains, such as Fortran’s derived types. Say:

type :: my_derived_type 
  real :: array_1(:,:)
  real :: array_2(:,:)
  integer :: dims
  contains 
  procedure :: initialize_my_type
  procedure :: finalize_my_type
end type my_derived_type

subroutine initialize_my_type(this, dims)
  type(this), intent(inout) :: this 
  integer, intent(in) :: dims 

  this%dims = dims
  allocate(this%array_1(this%dims,this%dims))
  allocate(this%array_2(this%dims,this%dims))

  !$omp target enter data map(alloc:this%array_1, this%array_2)

end subroutine initialize_my_type 

subroutine finalize_my_type(this)
  type(this), intent(inout) :: this 

  deallocate(this%array_1)
  deallocate(this%array_2)

  !$omp target exit data map(delete:this%array_1, this%array_2)

end subroutine finalize_my_type 

Defining an initializer and a cleanup routine is very important. I didn’t use the word finalizer because Fortran has that concept, it is similar to the destructor in C++. You might ask, why not use the finalize routine built into the derived type. Well, there are some issues with the finalizers. They are called when the object goes out of scope, which can happen when the object is passed into a routine. If this happened mid execution, our arrays would be wiped, deallocated, and deleted from the GPU. So for now, until the finalizers are better understood by me, I prefer to have this explicitly called/done by me. This introduces the possibility of programmer error and memory leaks, but that is what code review is for.

In this aspect C++ is safer, since the destructor WILL be called but then you have to understand how scope works in C++ and that is also a rabbit hole!

Calling functions/subroutines from the GPU.#

We are taught that we should write functions and subroutines to encapsulate repeated logic, separation of concerns, etc. This is very important and can be done nicely via the $omp declare target directive, which tells the compiler that a function is supposed to be callable from the device.

This can sometimes fail if the function is not in the same compilation unit as the function you are calling it from, but compiler support is getting better in this regard. An important thing about device functions is that you should avoid printing inside them, let them be fully compute. You can debug with print statements by adding a buffer array or something similar.

AI assisted good practices#

The general idea that the AI agent should have is that data movement has to be minimized, enter data map are preferred to using map(tofrom) at the beginning of loops. Memory should be allocated preferrably at the start, on the fly allocations should be minimized unless they’re done on the host OUTSIDE a hot loop. A hot loop is a place where intensive compute is being used or inside a function that gets called within a hotloop. For example, you don’t want to do:

do j = 1, dims
  do i = 1, dims 
    do k = 1, dims 
      call my_work_routine(a,b,c)
    end do 
  end do 
end do 

subroutine my_work_routine(a,b,c)
...

allocate(d(size(a), size(a))

end subroutine my_work_routine 

The function my_work_routine will be called O(N^3) times (dims^3). which means that you will allocate d that many times. This is a terrible idea and will 1000% kill your performance, either on the host or the device.

Pure subroutines and functions#

Making a subroutine or a function pure means it has no side effects, i.e. it will not write to a variable outside its scope, print to console, or do calls to external libraries. These are great because they are thread safe by design. They are callable inside a threaded or GPU accelerated loop. When developing code using an AI assistant, you should strive to create functions and subroutines that are pure. This will make debugging easier and parallelization simpler.