NECI Parallelization

From CUC3
Jump to navigation Jump to search

It seems that there are two main aims of NECI Parallelization

  • To Reduce Integral Cache storage needs (and also to therefore compute the stored integrals in parallel)
  • To Sum together say 2-v sums in parallel

The two are rather intricately linked in terms of the ordering of storage and generation. Integrals <ij|ab> are stored as ordered pairs of ordered pairs, ((i,a),(j,b)) where in (x,y), x<=y. Two-vertex graphs (and indeed all excitations) are generated also in ordered pairs, but instead by excited from to excited to: (i,j)->(a,b). Single excitations are simply i->j. The integrals required for calculation of a 2-vertex graph are:

Single i->a

<ik|ak>, <ik|ka>

double (i,j)->(a,b)

<ij|ab>, <ij|ba> (and for non-lowdiag, <ia|ia>, <ia|ai> (+i->j, a->b), and <ab|ab>, <ab|ba>, although all these are stored in UMat2D)

In order to parallelize efficiently, the four-index 2-electron (rather than two-index 2-electron) integrals should be used only where they are stored.

To achieve this it is possible to divide integrals up according to their first (i) index, and store only a subset of (i,j) at each compute node. It is similarly possible to divide excitations up across compute nodes.

However, if integrals are to be computed in parallel, then the calling code should execute essentially the same logic for each excitation, such that when an integral has to be calculated rather than looked up, all compute nodes are ready to calculate it in parallel. This leads to somewhat of a communication problem:

Parallel communication is slow (slow enough not to wait to do it?), and yet to know whether we need to calculate an integral we need to communicate. It seems the solution is to batch the integrals: Each node is assigned a set of excitations, and then lists a number of the integrals that it needs for those excitations (the batch size). These are collated, and then all the integrals calculated and distributed to each node, where they are used in the calculation. This means that all the ffts etc are done in parallel with very little communication (although they themselves do communicate - not sure how much), and the NECI code doesn't need to wait significantly.

--alex 04:54, 29 March 2008 (GMT)

Yes, I am thinking along similar lines. It's clear that we need to divide up the integral storage across the processors. If we do this, then we also should split up the calculation so that the communication is kept to a minimum. Communication sucks, even on tardis---by the time you get to 8 processors, FFT communication takes longer than the actual FFT (by 16 processors it's the dominant factor). FFT communication is all-to-all though: other types perform better.

Ideally we'd have the integrals stored such that each processor only has to refer to integrals that it stores. Blocking up the integrals is probably the way forward. Each integral involves:

  • 2 FFTs to form the co-densities in Fourier space (each w/ communication).
  • Sum over G-vectors (which are split amongst the processors).
  • Summation over the processors.

If we block up the integrals cleverly, we might be able to cut down on the # FFTs.

Then, each processor can calculate its part of the calculation and we do a global summation at the end.

Would this be specific for only 2-v graphs, or does it extend to the star etc?

--james 11:43, 29 March 2008 (GMT)

It would require different routines for star and sum. Sum is easier first (and we get MP2 out of it). Star will have the advantage that the whole star calculation can be parallelized, making it a lot faster.

The blocking may well have some extra advantages - <ij|ab> and <ij|ac> use the same co-density ia, and would be sequentially requested, so probably within the same block.

Is the CPMD parallel code shipshape?

--alex 16:02, 29 March 2008 (GMT)

Arse. It would be convenient if we could parallelise all types of calculation in the same way, so any later work is automatically parallel.

CPMD parallel progress:

Integrals done. There might be some efficiencies to be gained via blocking on-the-fly integral calculation.

--james 14:25, 18 June 2008 (BST)

After discussion with James, it seems that if the integrals are to be equally distributed among processors, the <ij|ab> integrals must be stored separately from the <ki|ka> integrals. James will handle this rewrite of UMatCache.

--alex 11:25, 18 June 2008 (BST)

I think the phrase was "I can do some of it"...

--james 14:25, 18 June 2008 (BST)