Optimization tricks
Profiling
- Compile your code with pgf90 and the following flags:
FFLAGS= -Mextend -g -traceback -pg
- Run GMIN on your input. Note that gmon.out is produced.
- Run the profiler. The output is long, so put it to a file.
gprof ~/location_of_GMIN_binary/GMIN > my_output_file
The first part of the file says how long the program spent in each subroutine and how many calls were made to each subroutine.
Speeding up small loops
FOR loops should only be used when the instructions cannot be explicitly written. For example, the identity matrix should be coded as
I3(:,:) = 0.D0 I3(1,1) = 1.D0; I3(2,2) = 1.D0; I3(3,3) = 1.D0
rather than
I3(:,:) = 0.D0 FOR I = 0, 3 I3(I,I) = 1.D0 ENDDO
You would hope that the compiler would do this type of optimization for you, but I saw a nice speedup in my case, so I wrote this out myself.
Computing values only once
Usually a potential subroutine has this form:
FOR OUTER LOOP OVER MOLECULES FOR INNER LOOP OVER MOLECULES FOR OUTER LOOP OVER SITES FOR INNER LOOP OVER SITES Compute potential contributions ENDDO ENDDO ENDDO ENDDO
If there are N molecules with n sites each, then there are typically some N values associated with each molecule (e.g., rotation matrices), some Nn values associated with each site (e.g., position of site with respect to molecular origin), and some values associated with each pair of sites.
Because the code visits the "Compute contributions" area some times, it is much more efficient if the values associated with molecules or sites are calculated in a different loop:
FOR LOOP OVER MOLECULES Calculate molecule values FOR LOOP OVER SITES Calculate site values ENDDO ENDDO FOR OUTER LOOP OVER MOLECULES etc ENDDO
Putting array dimensions in the right order
For historical reasons, Fortran puts its array dimensions in the opposite order from what we would expect. For example, a 3x3 matrix A has its components stored as a 1D array in the memory in order
A[0][0] A[1][0] A[2][0] A[0][1] A[1][1] A[2][1] A[0][2] A[1][2] A[2][2]
so the more efficient way to loop is to put the second component in the outermost loop:
for j=0,2 for i=0,2 compute a[i][j] enddo enddo
The effect can apparently be 2-3x faster, but I don't think that memory access is the limiting factor in our programs. For example, I used an array of shape matrices, one 3x3 shape matrix per site per molecule. I originally coded this as A[molecule][site][row][col], and switching to A[row][col][site][mol] made my code 0.2% faster, so I didn't even bother changing it.