I mentioned before that I would write something on running GROMACS on GPUs. Let’s imagine we want to simulate a solvated lipid bilayer containing 6,000 lipids for 5 µs. The total number of MARTINI coarse-grained beads is around 137,000 and the box dimensions are roughly 42x42x11 nm. Although this is smaller than the benchmark we looked at last time, it is still a challenge to run on a workstation. To see this let’s consider running it on my MacPro using GROMACS 4.6.1. The machine is an early 2008 MacPro and has 2 Intel Xeons, each with 4 cores. Using 8 MPI processes gets me 132 ns/day, so I would have to wait 38 days for 5 µs. Too slow!
So if I have a particular system I want to simulate, how many processing cores can I harness to run a single GROMACS version 4.6 job? If I only use a few then the simulation will take a long time to finish, if I use too many the cores will end up waiting for communications from other cores and so the simulation will be inefficient (and also take a long time to finish). In between is a regime where the code, in this case GROMACS, scales well. Ideally, of course, you’d like linear scaling i.e. if I run on 100 cores in parallel it is 100x faster than if I ran on just one.