Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] parallelism over bands in ABINIT

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] parallelism over bands in ABINIT


Chronological Thread 
  • From: Francois Bottin <Francois.Bottin@cea.fr>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] parallelism over bands in ABINIT
  • Date: Fri, 24 Nov 2006 15:30:01 +0100

Dear Dr. Dumont,

Guillaume Dumont wrote:

Dear Dr. Bottin,

I succesfully ran your gold case on my machine and was able to get superlinear scaling, e.g. a speedup factor of 40 for 32 cpus. Attached are timmings results for the 1, 4, 6, 8, 16 and 32 processors runs, a graph showing the speedup obtained and a table with the timmings.

Your band-FFT calculations are the first ones performed outside our lab (to our knowledge). It's fine that you reproduce the superlinear scaling up to 32 processors.
If more processors are available, could you keep on trying up 100 processors, please.


However I was not able to get this kind of scaling with my own case. Here are the main differences between your case and mine:

- You have a lot more bands than I do: 648 vs 80

Does the lobpcg part in these two systems weight equally? In Au, the lobpcg part corresponds approximatively to the total time. Its perfect scaling gives the supelinear behaviour of ABINIT.

- Your fft grid is symmetric, mine is not: 108 x 108 x 108 vs 36 x 36 x 512 (my unit cell is not symmetric)

Does your FFT part (fourwf) strongly increase (more than 2 times) between 1 and 32 processors? And what is its weight? Even if this FFT is strongly optimized, the scaling does not remain linear.


- You have only one k-point: 1 vs 4

- iscf: 7 vs 5

Even if this input variable does not have any influency on scaling, you should use the Pulay mixing (on potential only: iscf=7) rather than iscf 5. This mixing seems more efficient (stable).


Here are my observations:

- reducing the number of ffts in the z direction did not change the scaling behavior
- reducing the number of k-points (4 -> 1) gave better speedup factors for large number of cpus, e.g. pour 32 cpus speedup was 20 with 4 kpts and 27 with one.

It is interesting. It seems that lobpcg (call for each k point) does not give the total time.

- changing iscf to 7 reduced the overall time but did not improve the scaling.
- changing the orientation of the unit cell (and hence de fft grid) did not change the scaling either.

I also noticed that the -O2 flag gives better performance (shorter execution times) than the -O3 flag.

In the cases of both Au and GaAsN systems? For gold, the code is two times faster (if I remember correctly) with the -O3 flag compilation.

All these calculations and those reported in my previous posts were done a

Altix 3700 with 128 CPUs
CPUs are 1,5 GHz Itanium 2 with a L3 cache of 6 Mb
4 Gb of DDR ram per cpu
communication: NUMAlink 3
we used the intel compiler (version 9.1) and the mkl library (version 8.1). -O2 flag was used to compiled both.

If you have any suggestions to improve my scaling, let me know.

Regards,

Thank you for your observations and good luck.
Best regards
Francois


--
##############################################################
Francois Bottin tel: 01 69 26 41 73
CEA/DIF fax: 01 69 26 70 77
BP 12 Bruyeres-le-Chatel email: Francois.Bottin@cea.fr
##############################################################




Archive powered by MHonArc 2.6.16.

Top of Page