NECI Parallelization
It seems that there are two main aims of NECI Parallelization
- To Reduce Integral Cache storage needs (and also to therefore compute the stored integrals in parallel)
- To Sum together say 2-v sums in parallel
The two are rather intricately linked in terms of the ordering of storage and generation. Integrals <ij|ab> are stored as ordered pairs of ordered pairs, ((i,a),(j,b)) where in (x,y), x<=y. Two-vertex graphs (and indeed all excitations) are generated also in ordered pairs, but instead by excited from to excited to: (i,j)->(a,b). Single excitations are simply i->j. The integrals required for calculation of a 2-vertex graph are:
Single i->a
<ik|ak>, <ik|ka>
double (i,j)->(a,b)
<ij|ab>, <ij|ba> (and for non-lowdiag, <ia|ia>, <ia|ai> (+i->j, a->b), and <ab|ab>, <ab|ba>, although all these are stored in UMat2D)
In order to parallelize efficiently, the four-index 2-electron (rather than two-index 2-electron) integrals should be used only where they are stored.
To achieve this it is possible to divide integrals up according to their first (i) index, and store only a subset of (i,j) at each compute node. It is similarly possible to divide excitations up across compute nodes.
However, if integrals are to be computed in parallel, then the calling code should execute essentially the same logic for each excitation, such that when an integral has to be calculated rather than looked up, all compute nodes are ready to calculate it in parallel. This leads to somewhat of a communication problem:
Parallel communication is slow (slow enough not to wait to do it?), and yet to know whether we need to calculate an integral we need to communicate. It seems the solution is to batch the integrals: Each node is assigned a set of excitations, and then lists a number of the integrals that it needs for those excitations (the batch size). These are collated, and then all the integrals calculated and distributed to each node, where they are used in the calculation. This means that all the ffts etc are done in parallel with very little communication (although they themselves do communicate - not sure how much), and the NECI code doesn't need to wait significantly.
--alex 04:54, 29 March 2008 (GMT)
Yes, I am thinking along similar lines. It's clear that we need to divide up the integral storage across the processors. If we do this, then we also should split up the calculation so that the communication is kept to a minimum. Communication sucks, even on tardis---by the time you get to 8 processors, FFT communication takes longer than the actual FFT (by 16 processors it's the dominant factor). FFT communication is all-to-all though: other types perform better.
Ideally we'd have the integrals stored such that each processor only has to refer to integrals that it stores. Blocking up the integrals is probably the way forward. Each integral involves:
- 2 FFTs to form the co-densities in Fourier space (each w/ communication).
- Sum over G-vectors (which are split amongst the processors).
- Summation over the processors.
If we block up the integrals cleverly, we might be able to cut down on the # FFTs.
Then, each processor can calculate its part of the calculation and we do a global summation at the end.
Would this be specific for only 2-v graphs, or does it extend to the star etc?
--james 11:43, 29 March 2008 (GMT)