Skip to Content.
Sympa Menu

forum - Re: RE : [abinit-forum] Band/FFT parallelism on large systems

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: RE : [abinit-forum] Band/FFT parallelism on large systems


Chronological Thread 
  • From: TORRENT Marc <marc.torrent@cea.fr>
  • To: forum@abinit.org
  • Subject: Re: RE : [abinit-forum] Band/FFT parallelism on large systems
  • Date: Mon, 22 Jun 2009 12:01:01 +0200
  • Organization: CEA-DAM

Title: RE : [abinit-forum] Band/FFT parallelism on large systems
Hi again David,

To simplify...
Could you test with the last 5.9/trunk/5.9.0-public ?
The treatment of MPI_IO WFK has been changed for the 5.9 and it's more convenient for us to debug this last branch.
Thanks

Marc


Marc.TORRENT@cea.fr a écrit :
C892277475E7FB4F90BC212EEE14CA8365C52D@U-SANTORIN.dif.dam.intra.cea.fr" type="cite">

Hi David,

1) Did you try with the last corrections contained in the last 5.8.3 bzr revision or 5.8.4 (at least revision 507) ? Muriel Delaveau and I found several improvements for the writing of the WFK file with MPI-IO; these corrections improve the portability of the code and have been merged in the revision 507. We found that the code produced crashes on several architectures because of wrong treatment of buffers... and hope to have correct that.

2) If you want to be able to use the file with anaddb you have to :
- use the last 5.8.3 branch (or 5.8.4)
or
- use the --enable-mpi-io-buggy option when building the code (unuseful after 5.8.3 rev507); but you could have buffer problems in that case.


We tested the new changes on ifort and gcc43 with mpich, open-mpi.

In band-fft, the memory is splitted for the wfk but not for other quantities, especially if you use PAW. We plan to correct that soon.

Marc




-------- Message d'origine--------
De: David Waroquiers [mailto:david.waroquiers@uclouvain.be]
Date: ven. 19/06/2009 12:29
À: forum@abinit.org
Objet : [abinit-forum] Band/FFT parallelism on large systems
 
Hello all,

I have tried to use the band/fft parallelism on a large supercell (a-SiO2, 72
atoms and 108 atoms). I encountered a problem while using a lot of bands (4480
bands). It reaches convergences but crashes at the end of the run when it is
supposed to write the WFK file (outwf call). I tried to run a calculation with
16, 32 and 64 processors.

I have tried with fewer bands (640) and it works.
Do you have any idea how to overcome this problem ? The WFK file is supposed to
be 4 GB and the available memory on the clusters is more than that. By the way,
in the band/fft parallelism approach, the memory for the wfk is split into the
different cpus, isn't it ?

I encountered another problem while using cut3d to analyse the wfk generated
with the band/fft parallelism. It does not recognise the file as a valid wfk
file (about the same message as when band/fft parallelism didn't allow to
restart with a different number of processors, before version 5.8 if I'm
right). Any idea too ?

My input file is hereafter and the log messages are after the input file. I'm
using public version 5.8.3, revision 485 and the machines used are the "green"
clusters in UCL :  102 Dual Quad-Core Xeon L5420/2.5GHz in Dell Blade M1000e
with 16 GB (or 32 GB for some nodes) per node of 8 processors.

Thanks a lot

David Waroquiers
PhD Student
UCL - PCPM - ETSF




My input file :

# Amorphous SiO2 : Generation of the WFK file needed for the KSS (for GW
corrections)
# Dataset 1 : GS calculation (_DEN generation)
# Dataset 2 : GS calculation with many bands (_WFK generation)

  ndtset      2
  jdtset       1 2
  timopt      2

# Dataset 1 : _DEN file generation (Density)

  tolvrs1     1.0d-12
  prtden1    1
  nstep1     5 #5 for testing
  iscf1        7
  npulayit1  7
  nband1    256

# Dataset 2 : _WFK file (Wavefunction)

  tolwfr2     1.0d-12
  nband2     4480
  nbdbuf2    384
  istwfk2      1
  iscf2         7
  nstep2      5 #5 for testing
  getden2    1

# Options for Band/FFT Parallelism

  paral_kgb   1
  wfoptalg    14
  nloalg      4
  fftalg      401
  iprcch      4
  intxc       0
  istwfk      1
  fft_opt_lob 2
  npfft       1
  npband      16 #32 #64

# K-point mesh

  kptopt      0
  kpt         0.0 0.0 0.0

# System definition
# Unit cell

  acell       1.9465690950E+01  1.9465690950E+01  1.9465690950E+01
  rprim       1 0 0
              0 1 0
              0 0 1

# Atom types

  ntypat      2
  znucl       8 14

# Atoms and coordinates

  natom       72
  typat       48*1 24*2
  xcart       1.8342971905E+01  1.0013093348E+01  4.9948115472E+00
              1.8450118788E+01  5.1100335358E+00  1.1410341879E+01
              3.0243029960E+00  1.7006888337E+01  1.0689037523E+01
              6.3068666011E+00  1.4446482399E+01  7.9505060279E+00
              1.9178811503E+01  4.4712567836E-01  3.4641995090E+00
              9.5178783093E+00  1.2762912471E+01  1.4947329016E+01
              1.7402433472E+01  4.7067303120E+00  1.5833402903E+00
              1.0623164695E+01  2.7299953166E+00  8.7471659694E+00
              1.2931573871E+01  1.8128981231E+01  6.7007362518E+00
              1.8660924236E+01  1.4792395464E+01  3.1319031106E+00
              7.0217232014E+00  6.3190579071E+00  2.1266991430E+00
              4.1181163909E-01  5.0929210080E+00  5.7193503290E+00
              7.6209880479E+00  1.5443775482E+00  6.1023412080E-01
              1.7923134211E+01  9.4056919719E+00  1.3628670860E+01
              1.4710748045E+01  9.1118601940E+00  1.7566857742E+01
              1.0411995344E+01  1.0041061607E+00  1.5870123306E+01
              1.0980496920E+01  1.3629862231E+01  6.8821852197E+00
              1.2756648650E+01  9.3922889131E+00  1.2966781879E+01
              1.3710153187E+01  2.2151381385E+00  1.9176017166E+01
              5.9015247795E+00  1.8254646045E+01  1.6364133902E+01
              3.5889689987E+00  8.6729161022E+00  4.9047876611E+00
              1.4649631278E+01  1.1782133781E+01  2.4189697381E+00
              1.3094524372E+01  1.5574388332E+01  1.1017906884E+01
              1.8122798453E+00  1.5904671691E+01  1.5390374184E+01
              8.0934509994E+00  9.9606459884E+00  5.6351418737E+00
              1.0388873243E+01  1.1258002356E+01  1.9535431306E+01
              1.7801695829E+01  1.5681701759E+01  1.1954743795E+01
              2.9395289639E+00  3.6212308778E+00  1.4808160737E+00
              1.3785141980E+01  3.1146153451E+00  4.7897808777E+00
              6.6125694236E+00  3.8955369666E+00  1.1802613942E+01
              1.0543336669E+00  8.8480531151E+00  9.2302571597E+00
              1.5034376672E+01  1.3207034271E+01  1.5126390258E+01
              1.9223920516E+01  6.5595988246E-01  1.3020475817E+01
              6.6553078921E+00  5.1934209327E+00  6.9894256581E+00
              1.4918361618E+01  3.1596212425E+00  1.4324193688E+01
              8.0804273193E+00  7.9884008127E+00  1.4307619386E+01
              5.7570753518E+00  1.3551949199E+01  1.8079850277E+01
              1.2833144388E+01  6.9576781789E+00  1.8702339976E+00
              2.7890960157E+00  1.7032376017E+01  6.7568473875E-01
              8.6760457768E+00  1.0859908527E+01  1.0407253204E+01
              6.3690907257E+00  2.2769273004E-01  8.2629069843E+00
              1.4623475391E+01  1.7952319809E+01  1.5406783784E+01
              1.5775821227E+01  1.3896960139E+01  6.9539101570E+00
              1.5477566296E+01  1.7519166868E+00  9.5117606862E+00
              3.1098755647E+00  7.2414373656E+00  1.3444441571E+01
              2.9576688783E+00  1.7045648497E+01  5.6738016905E+00
              9.7659864282E+00  1.6334927247E+01  1.8709494220E+01
              5.6015780233E+00  4.6820174692E+00  1.6849684158E+01
              1.3193293623E+01  1.6296954721E+00  7.4269549058E+00
              4.0153579861E+00  1.6089810803E+01  1.7511617105E+01
              1.5080013653E+01  1.5674902127E+01  1.3378910449E+01
              1.3468917372E+01  1.2396666756E+00  1.6246453919E+01
              3.1443097800E-01  3.4518653529E+00  3.0738155414E+00
              1.6864054813E+01  1.2620177700E+01  4.3720388857E+00
              1.3252228290E+01  1.5343974821E+01  7.9284144425E+00
              8.4872425534E+00  1.9054897865E+01  1.7815010425E+01
              1.5170448087E+01  1.0186883021E+01  1.4748027393E+01
              7.5516402653E+00  3.0013700719E+00  8.9766200084E+00
              6.4090355722E+00  7.4843741588E+00  4.9295671605E+00
              4.3705611827E-01  7.6893073781E+00  1.2001555962E+01
              8.4238741309E+00  1.2232714786E+01  7.6995337657E+00
              5.8387974184E+00  5.9155119378E+00  1.4039991791E+01
              1.3107235988E+01  9.8055489044E+00  6.4400593019E-01
              8.3270647814E-01  1.7227458132E+01  1.2775664290E+01
              1.4372625432E+01  4.2560000137E+00  1.9730406948E+00
              5.7914453145E+00  4.0664955533E+00  3.9036518542E-01
              9.7815513593E+00  1.0257448955E+01  1.3164763822E+01
              1.6979663973E+01  2.5757556368E+00  1.2070399003E+01
              9.1476280310E-01  8.1192625454E+00  6.2498371664E+00
              8.8902943261E+00  1.3433615492E+01  1.7894990037E+01
              4.7238007437E+00  1.7074503731E+01  8.1487422033E+00
              1.1337419675E+00  1.7170180156E+01  3.2442179093E+00

# Energy cutoff for the planewaves

  ecut        32.0

# Parameters for the SCF cycles

  nstep       5
  diemac      4.0
  ixc         11






Here is the end of the message log for the 72 atoms cell with 4480 bands run on
16 processors :

================================================================================

 ----iterations are completed or convergence reached----

 outwf  : write wavefunction to file asio2test_para16_4096o_DS2_WFK
-P-0000  leave_test : synchronization done...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 15 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 6159 on
node node054 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[green:29817] 15 more processes have sent help message help-mpi-api.txt /
mpi-abort
[green:29817] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help
/ error messages





Here is the end of the log file I got from the 108 atoms cell (with 3200 bands)
run on 32 processors :

================================================================================

 ----iterations are completed or convergence reached----

 outwf  : write wavefunction to file as108nr_004o_DS2_WFK
-P-0000  leave_test : synchronization done...
[node090:26797] *** Process received signal ***
[node090:26797] Signal: Segmentation fault (11)
[node090:26797] Signal code: Address not mapped (1)
[node090:26797] Failing at address: 0x2
[node090:26797] [ 0] /lib64/libpthread.so.0 [0x395300e4c0]
[node090:26797] [ 1]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1)
[0x2b389ea74951]
[node090:26797] [ 2]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3)
[0x2b389ea750c3]
[node090:26797] [ 3]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8)
[0x2b389ea9d848]
[node090:26797] [ 4]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38)
[0x2b389e828780]
[node090:26797] [ 5]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29) [0x10823a9]
[node090:26797] [ 6]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9) [0x107f0b7]
[node090:26797] [ 7]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88) [0x107d1b4]
[node090:26797] [ 8]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616) [0x5f5a72]
[node090:26797] [ 9]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074) [0x4675cc]
[node090:26797] [10]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e) [0x44ea52]
[node090:26797] [11]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6) [0x4448f6]
[node090:26797] [12] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a)
[0x43f642]
[node090:26797] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x395281d974]
[node090:26797] [14] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip [0x43f569]
[node090:26797] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 26797 on node node090 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------





Archive powered by MHonArc 2.6.16.

Top of Page