Getting started with Programming, part 3: where to start

Intro#

If you’ve gotten to this point of the post series, you are truly curious. If you started here, feel free to go back (or not!).

Most of us who started writing code for scientific applications are probably students at any level of studies: ranging from high school to any level of academic - maybe you are also in an industrial setting and you need to help your company write some code. The fun thing is that, at any point, anyone can be tasked with writing code.

Where are you and where are you going?#

The first question I have to ask you (you have to ask yourself) is: am I supposed to write a new program from scratch or am I hooking up to an existing code?

This is an important question because it dictates how much design you need to do. If you are working with an existing codebase you will probably end up writing a new set of routines and functions within the program that reuse the existing architecture of it. For example, in the realm of quantum chemistry you might be tasked with adding a new property calculation; in weather maybe a new parametrization; in finite element methods a new discretization, etc.

This means that the code you will be working with already has a design in place, it will use classes and object orientation; functional programming; etc. you simply have to look at how the code is structured and structure your program similarly.

This is not always the case

In academic software there is often little design in place. The architecture is dictated by the student who wrote the most code and their design decision are based on “paper driven design”, i.e. almost everything is secondary, I just want to write a paper and graduate/move on.

This is also not always the case. Some academic programs have very strong foundations and they have a style guide and examples that they enforce for new code. Pray that the code you’re working with looks like this one, since this will make your life easier.

So, what if it is like the former? Well, then comes the fun/annoying part of writing code: designing your program.

On program design#

Let’s think about the city of Villahermosa, Tabasco in my native Mexico. It is built on a swamp…so it floods often. The design decision of building it on a floodplain was, in retrospective, not great.

Mexico City is built on top of the ruins of Tenochtitlan, which was built on reclaimed land on the lake of Texcoco. Naturally, some parts of the city are sinking. Expanding the metropolitan area by continuously reclaiming area on the lake is probably not the best idea on the planet. Now Mexico City has around 25 million people, it is a bit too late and to complex to rearchitecture the idea.

Think about software design similarly to building a city or a house. If you’ve played games like Cities Skylines or Factorio you know that choosing the design of your city/factory is important. Otherwise, you will be quickly burdened by bottlenecks that will constrict the rate at which you can grow your project. Software design is extremely similar. A bad design can lead to a new feature taking a month versus a couple of days to implement.

So, if you are working in a program that has little to no design it is your duty to start building said infrastructure. Why? Because that’s what good programmers do. Also, it will pay off, believe me.

How do I start, then?#

The first step is to assess what is the status of the current program you are working with. This by no means is saying that most code is bad - you should not arrive at a new group and seek to impose your ideas without demonstrating that you are sane/correct to do so. So let’s look at some objective measures to help you reason about a the state of a code.

If you live in Australia you might have an experience with bad design. I recently saw a property which had in its advertisment “levelled floors”. You would think that one shouldn’t need to say that, it should be a given…alas, here we are. A similar thing can happen in large or moderately sized codebases. The series of things that should be a given for any codebase are: unit testing, regression testing, and automatic testing/deployment - otherwise known as Continuous Integration / Continuous Deployment (CI/CD).

A unit test tests a specific piece of code - not an entire “method”, so to say. For example, a recipe for cookies:

Measure ingredients
- Measure flour
- Measure butter
- Measure milk
- Measure sugar
- Grab eggs
Crack eggs
Melt butter
Mix ingredients
Scoop and place on oven tray
Bake for x minutes at 180C (325 F)

A unit test would verify the correctness of each step. “Test measure flour”, “Test measure butter” - as opposed to “Test flavor of cookies”. Testing the final product is a regression test, your program does what it is supposed to do but you are not verifying each step along the day. You assume that if the cookie tastes good, all the steps completed successfully. So why do I need to test each step?

Well, let’s say your cookie tastes terrible. What went wrong? You start tracing back the problem now (debugging the issue). You look at your flour, maybe it is contaminated with something. Is your butter rancid? Is your milk past its prime? Are the eggs bad? Was the oven at the right temperature? etc. This is a time consuming process and can be difficult to pinpoint without running the taste test multiple times with the same ingredients. In terms of cookies, it probably is not very hard, but in terms of code it will.

If you had unit testing set up for your cookie process, you would have picked up - before even starting the baking process, that your milk is borderline bad. “Test milk” -> Failure, tastes wonky. You don’t even try to bake the cookie and you now know exactly what is wrong. Unit tests work exactly like this for code. If all unit tests pass, regression tests should pass. Regression tests usually capture difficult settings, or known problems in the physics/maths of the problem. For example, making a bechamel sauce with gluten free flour.

If you’ve ever made bechamel, you should know that it is a bit more daunting than you’d expect. Now use gluten free flour. I don’t think I have ever made a bechamel that looks the same with GF flour. So I’d have a test for this, because the process itself is a bit more complex. Think of an unstable system that can misbehave if you haven’t dialed in your method well. Lack of convergence of an iterative process, etc.

Regression tests also should measure time to completion. Have you made your code slower/faster?

If the codebase you’re working with has an automated pipeline that runs a set of unit tests and regression tests, you’re probably going to have a great time developing on this application. If not…well saddle up!

CI/CD#

Continuous Integration / Continuous Deployment (CI/CD) is one of the most common concepts one will come across when writing code for a large application (hopefully). CI can be boiled down to automatically running a series of tests after a certain “trigger”. For example, in a personal application of mine every time any branch sees a “push”, i.e. code in the upstream repository gets updated from offline changes the CI gets runs. Whenever a Pull Request is opened, a run a series of extra tests that aim to test things that are not tested on “push”.

The idea of testing on push is that developers get a very quick feedback on if their code broke something by accident, which happens very often.

Continuous Deployment on the other hand is the generation of artifacts that can be correlated to a certain state of the code in time in an automatic fashion. For example, a CD pipeline that runs every week that creates a “nightly-build” release that can be downloaded by people and have the confidence that it was tested with the nightly CI procedure (which includes a set amount of tests and coverage).