Difference between revisions of "NECI Parallelization"

From CUC3
Jump to navigation Jump to search
import>Ajwt3
import>Jss43
Line 59: Line 59:
   
 
--[[User:ajwt3|alex]] 16:02, 29 March 2008 (GMT)
 
--[[User:ajwt3|alex]] 16:02, 29 March 2008 (GMT)
  +
  +
Arse. It would be convenient if we could parallelise all types of calculation in the same way, so any later work is automatically parallel.
  +
  +
CPMD parallel progress:
  +
  +
* Can run KS determinant and exchange energy calculations.
  +
* Cannot do shifted integrals in parallel, but as I've discussed, I think I know how to do these without further communication. I just need to delve into some rather dark code...
  +
* Need to merge back into trunk.
  +
  +
I'll try to look at shifted integrals this weekend/early next week. We can still start on parallelising the NECI part, so long as we realise the results are wrong...
  +
  +
--[[User:jss43|james]] 18:12, 29 March 2008 (GMT)

Revision as of 18:12, 29 March 2008

It seems that there are two main aims of NECI Parallelization

  • To Reduce Integral Cache storage needs (and also to therefore compute the stored integrals in parallel)
  • To Sum together say 2-v sums in parallel

The two are rather intricately linked in terms of the ordering of storage and generation. Integrals <ij|ab> are stored as ordered pairs of ordered pairs, ((i,a),(j,b)) where in (x,y), x<=y. Two-vertex graphs (and indeed all excitations) are generated also in ordered pairs, but instead by excited from to excited to: (i,j)->(a,b). Single excitations are simply i->j. The integrals required for calculation of a 2-vertex graph are:

Single i->a

<ik|ak>, <ik|ka>

double (i,j)->(a,b)

<ij|ab>, <ij|ba> (and for non-lowdiag, <ia|ia>, <ia|ai> (+i->j, a->b), and <ab|ab>, <ab|ba>, although all these are stored in UMat2D)

In order to parallelize efficiently, the four-index 2-electron (rather than two-index 2-electron) integrals should be used only where they are stored.

To achieve this it is possible to divide integrals up according to their first (i) index, and store only a subset of (i,j) at each compute node. It is similarly possible to divide excitations up across compute nodes.

However, if integrals are to be computed in parallel, then the calling code should execute essentially the same logic for each excitation, such that when an integral has to be calculated rather than looked up, all compute nodes are ready to calculate it in parallel. This leads to somewhat of a communication problem:

Parallel communication is slow (slow enough not to wait to do it?), and yet to know whether we need to calculate an integral we need to communicate. It seems the solution is to batch the integrals: Each node is assigned a set of excitations, and then lists a number of the integrals that it needs for those excitations (the batch size). These are collated, and then all the integrals calculated and distributed to each node, where they are used in the calculation. This means that all the ffts etc are done in parallel with very little communication (although they themselves do communicate - not sure how much), and the NECI code doesn't need to wait significantly.

--alex 04:54, 29 March 2008 (GMT)

Yes, I am thinking along similar lines. It's clear that we need to divide up the integral storage across the processors. If we do this, then we also should split up the calculation so that the communication is kept to a minimum. Communication sucks, even on tardis---by the time you get to 8 processors, FFT communication takes longer than the actual FFT (by 16 processors it's the dominant factor). FFT communication is all-to-all though: other types perform better.

Ideally we'd have the integrals stored such that each processor only has to refer to integrals that it stores. Blocking up the integrals is probably the way forward. Each integral involves:

  • 2 FFTs to form the co-densities in Fourier space (each w/ communication).
  • Sum over G-vectors (which are split amongst the processors).
  • Summation over the processors.

If we block up the integrals cleverly, we might be able to cut down on the # FFTs.

Then, each processor can calculate its part of the calculation and we do a global summation at the end.

Would this be specific for only 2-v graphs, or does it extend to the star etc?

--james 11:43, 29 March 2008 (GMT)

It would require different routines for star and sum. Sum is easier first (and we get MP2 out of it). Star will have the advantage that the whole star calculation can be parallelized, making it a lot faster.

The blocking may well have some extra advantages - <ij|ab> and <ij|ac> use the same co-density ia, and would be sequentially requested, so probably within the same block.

Is the CPMD parallel code shipshape?

--alex 16:02, 29 March 2008 (GMT)

Arse. It would be convenient if we could parallelise all types of calculation in the same way, so any later work is automatically parallel.

CPMD parallel progress:

  • Can run KS determinant and exchange energy calculations.
  • Cannot do shifted integrals in parallel, but as I've discussed, I think I know how to do these without further communication. I just need to delve into some rather dark code...
  • Need to merge back into trunk.

I'll try to look at shifted integrals this weekend/early next week. We can still start on parallelising the NECI part, so long as we realise the results are wrong...

--james 18:12, 29 March 2008 (GMT)