17CD1DC7-5EF1-485A-9EF9-A33718C28ECC@pcpm.ucl.ac.be"
type="cite">Dear Rick,
On 07 Sep 2007, at 21:49, Rick Muller wrote:
I'm trying to run a 48 atom job in
parallel with abinit. There is
only one k-point, and no spin
polarization. I've been through the
abinip tutorial, and have found very
little effect using the band
parallelization, despite having 70 bands
(I only got something like a
1.4 speedup on 4 processors). I've heard
rumors of the FFT
parallelization: is this ready for
production use, or should I just
learn to love serial jobs?
These are not rumours, have a look at the release_notes of both
v5.3 (B.2)
and v5.4 (B.4)
I think it is worth to try combined band/FFT parallelization
with ABINITv5.4 . There have been several exchanges of mails
on the Forum mailing list about it, you should search in the
archives
for the names Francois Bottin and Gilles Zerah, who have
actually submitted a paper, also available on
You might perhaps encounter difficulties in the make or link
steps of the installation, and it has not yet been as extensively
tested as the k point parallelization, but it is worth to give
it a try now.
Xavier
Large scale ab initio calculations based
on three levels of parallelization
We suggest and
implement a parallelization scheme based on an efficient multiband
eigenvalue solver, called the locally optimal block preconditioned
conjugate gradient LOBPCG method, and using an optimized
three-dimensional (3D) fast Fourier transform (FFT) in the ab
initio}plane-wave code ABINIT. In addition to the standard data
partitioning over processors corresponding to different k-points, we
introduce data partitioning with respect to blocks of bands as well as
spatial partitioning in the Fourier space of coefficients over the
plane waves basis set used in ABINIT. This k-points-multiband-FFT
parallelization avoids any collective communications on the whole set
of processors relying instead on one-dimensional communications only.
For a single k-point, super-linear scaling is achieved for up to 100
processors due to an extensive use of hardware optimized BLAS, LAPACK,
and SCALAPACK routines, mainly in the LOBPCG routine. We observe good
performance up to 200 processors. With 10 k-points our three-way data
partitioning results in linear scaling up to 1000 processors for a
practical system used for testing.