NECI Parallelization
NECI Parallelization
It seems that there are two main aims of NECI Parallelization
- To Reduce Integral Cache storage needs (and also to therefore compute the stored integrals in parallel)
- To Sum together say 2-v sums in parallel
The two are rather intricately linked in terms of the ordering of storage and generation. Integrals <ij|ab> are stored as ordered pairs of ordered pairs, ((i,a),(j,b)) where in (x,y), x<=y. Two-vertex graphs (and indeed all excitations) are generated also in ordered pairs, but instead by excited from to excited to: (i,j)->(a,b). Single excitations are simply i->j. The integrals required for calculation of a 2-vertex graph are:
Single i->a
<ik|ak>, <ik|ka>
double (i,j)->(a,b)
<ij|ab>, <ij|ba> (and for non-lowdiag, <ia|ia>, <ia|ai> (+i->j, a->b), and <ab|ab>, <ab|ba>, although all these are stored in UMat2D)
In order to parallelize efficiently, the four-index 2-electron (rather than two-index 2-electron) integrals should be used only where they are stored.
To achieve this it is possible to divide integrals up according to their first (i) index, and store only a subset of (i,j) at each compute node. It is similarly possible to divide excitations up across compute nodes.
However, if integrals are to be computed in parallel, then the calling code should execute essentially the same logic for each excitation, such that when an integral has to be calculated rather than looked up, all compute nodes are ready to calculate it in parallel. This leads to somewhat of a communication problem:
Parallel communication is slow (slow enough not to wait to do it?), and yet to know whether we need to calculate an integral we need to communicate. It seems the solution is to batch the integrals: Each node is assigned a set of excitations, and then lists a number of the integrals that it needs for those excitations (the batch size). These are collated, and then all the integrals calculated and distributed to each node, where they are used in the calculation. This means that all the ffts etc are done in parallel with very little communication (although they themselves do communicate - not sure how much), and the NECI code doesn't need to wait significantly.
--alex 04:54, 29 March 2008 (GMT)