GROMACS is a scientific code designed to simulate the dynamics of small boxes of stuff, that usually contain a protein, water, perhaps a lipid bilayer and a range of other molecules depending on the study. It assumes that all the atoms can be represented as points with a mass and an electrical charge and that all the bonds can modelled using simple harmonic springs. There are some other terms that describe the bending and twisting of molecules and all of these, when combined with two long range terms, which take into account the repulsion and attraction between electrical charges, allow you to calculate the force on any atom due to the positions of all the other atoms. Once you know the force, you can calculate where the atom will be a short time later (often 2 fs) but of course the positions have changed so you have to recalculate the forces. And so on.
Anyway, I use GROMACS a lot in my research and the most recent major version, 4.6, was released in January 2012. In this post I’m going to briefly describe my experience with some of the improvements. First off, so much has changed that I think it would have been more accurate to call this GROMACS 5.0. For example, version 4.6 is a lot faster than version 4.5. I typically use three different benchmarks when measuring the performance; one is an all-atom simulation of a bacterial peptide transporter in a lipid bilayer (78,033 atoms). The other two are both coarse-grained models of a lipid bilayer using the MARTINI forcefield – the difference is one has 6,000 lipids (137,232 beads), the other 54,000 (2,107,010 beads). Ok, so how much faster is version 4.6? It is important here to bear in mind that GROMACS was already very fast since a lot of effort had been put into optimising the loops that the code spends most of its time running. Even so, version 4.6 is between 20-120% faster when using either of the first two benchmarks, and in some cases even faster. How? Well, it seems the developers have completely re-written those loops using SIMD commands. One important consequence of this is that it is vital to use the best compiler and, since you have to specify which SIMD instruction sets to use, you may need several different versions of the key binary, mdrun. For example, you may want a version compiled using AVX SIMD instruction sets for recent CPUs, but also a version compiled using an older SSE SIMD instruction set. The latter will run on newer architectures, but it will be slower. You must never run a version compiled with no SIMD instruction sets as this can be 10x slower!
The other big performance improvement is that GROMACS 4.6 now uses GPUs seamlessly. The calculations are shared between any GPUs and the CPUs and GROMACS will even shift the load to try and share it equally. Erik Landahl, one of the GROMACS developers, gave an interesting NVIDA webinar on this subject in April 2013. A GPU here just means a reasonable consumer graphics card, such as an NVIDIA GTX680, that has compute capability of 2.0 or higher. So, how much performance boost do we see? I typically see a boost of 2.1-2.7x for the atomistic benchmark and 1.4-2.2x for the first, smaller coarse-grained benchmark. Just for fun, you can try running a version of GROMACS compiled with no SIMD instructions with a GPU (and without a GPU) and then you can get a performance increase of 10x.
Before I finish, I was given some good advice on running GROMACS benchmarks. Firstly, make sure you use the -noconfout mdrun option since this prevents it from writing a final .gro file as this takes some time. Secondly setup a .tpr file that will run for a long (wallclock) time even on a large number of cores and then use the -resethway option in combination with a time limit, such as -maxh 0.25, as this would then reset the timers after 7.5 min and record how many steps were calculated between 7.5 and 15 minutes. From experience a bit of time spent writing some good BASH scripts to automatically setup, run and analyse the benchmarking simulations really pays off in the long run.
In future posts I’ll talk about the scaling of GROMACS 4.6 (that is where the third benchmark comes in) and also look at the GPU performance in a bit more detail.