C892277475E7FB4F90BC212EEE14CA8365C52D@U-SANTORIN.dif.dam.intra.cea.fr"
 type="cite">
  
  
  
  Hi David,
  
  1) Did you try with the last corrections contained
in the last 5.8.3 bzr revision or 5.8.4 (at least revision 507) ?
Muriel Delaveau and I found several improvements for the writing of the
WFK file with MPI-IO; these corrections improve the portability of the
code and have been merged in the revision 507. We found that the code
produced crashes on several architectures because of wrong treatment of
buffers... and hope to have correct that.
  2) If you want to be able to use the file with
anaddb you have to :
  
  - use the last 5.8.3 branch (or 5.8.4)
  
  or
  
  - use the --enable-mpi-io-buggy option when building
the code (unuseful after 5.8.3 rev507); but you could have buffer
problems in that case.
  
  We tested the new changes on ifort and gcc43 with
mpich, open-mpi.
  
  In band-fft, the memory is splitted for the wfk but
not for other quantities, especially if you use PAW. We plan to correct
that soon.
  Marc
  
  
  
  
  -------- Message d'origine--------
  
  De: David Waroquiers [mailto:david.waroquiers@uclouvain.be]
  
  Date: ven. 19/06/2009 12:29
  
  À: forum@abinit.org
  
  Objet : [abinit-forum] Band/FFT parallelism on large
systems
  
   
  
  Hello all,
  
  I have tried to use the band/fft parallelism on a
large supercell (a-SiO2, 72
  
  atoms and 108 atoms). I encountered a problem while
using a lot of bands (4480
  
  bands). It reaches convergences but crashes at the end
of the run when it is
  
  supposed to write the WFK file (outwf call). I tried
to run a calculation with
  
  16, 32 and 64 processors.
  
  I have tried with fewer bands (640) and it works.
  
  Do you have any idea how to overcome this problem ?
The WFK file is supposed to
  
  be 4 GB and the available memory on the clusters is
more than that. By the way,
  
  in the band/fft parallelism approach, the memory for
the wfk is split into the
  
  different cpus, isn't it ?
  
  I encountered another problem while using cut3d to
analyse the wfk generated
  
  with the band/fft parallelism. It does not recognise
the file as a valid wfk
  
  file (about the same message as when band/fft
parallelism didn't allow to
  
  restart with a different number of processors, before
version 5.8 if I'm
  
  right). Any idea too ?
  
  My input file is hereafter and the log messages are
after the input file. I'm
  
  using public version 5.8.3, revision 485 and the
machines used are the "green"
  
  clusters in UCL :  102 Dual Quad-Core Xeon
L5420/2.5GHz in Dell Blade M1000e
  
  with 16 GB (or 32 GB for some nodes) per node of 8
processors.
  
  Thanks a lot
  
  David Waroquiers
  
  PhD Student
  
  UCL - PCPM - ETSF
  
  
  
  
  My input file :
  
  # Amorphous SiO2 : Generation of the WFK file
needed for the KSS (for GW
  
  corrections)
  
  # Dataset 1 : GS calculation (_DEN generation)
  
  # Dataset 2 : GS calculation with many bands (_WFK
generation)
  
    ndtset      2
  
    jdtset       1 2
  
    timopt      2
  
  # Dataset 1 : _DEN file generation (Density)
  
    tolvrs1     1.0d-12
  
    prtden1    1
  
    nstep1     5 #5 for testing
  
    iscf1        7
  
    npulayit1  7
  
    nband1    256
  
  # Dataset 2 : _WFK file (Wavefunction)
  
    tolwfr2     1.0d-12
  
    nband2     4480
  
    nbdbuf2    384
  
    istwfk2      1
  
    iscf2         7
  
    nstep2      5 #5 for testing
  
    getden2    1
  
  # Options for Band/FFT Parallelism
  
    paral_kgb   1
  
    wfoptalg    14
  
    nloalg      4
  
    fftalg      401
  
    iprcch      4
  
    intxc       0
  
    istwfk      1
  
    fft_opt_lob 2
  
    npfft       1
  
    npband      16 #32 #64
  
  # K-point mesh
  
    kptopt      0
  
    kpt         0.0 0.0 0.0
  
  # System definition
  
  # Unit cell
  
    acell       1.9465690950E+01  1.9465690950E+01 
1.9465690950E+01
  
    rprim       1 0 0
  
              0 1 0
  
              0 0 1
  
  # Atom types
  
    ntypat      2
  
    znucl       8 14
  
  # Atoms and coordinates
  
    natom       72
  
    typat       48*1 24*2
  
    xcart       1.8342971905E+01  1.0013093348E+01 
4.9948115472E+00
  
              1.8450118788E+01  5.1100335358E+00 
1.1410341879E+01
  
              3.0243029960E+00  1.7006888337E+01 
1.0689037523E+01
  
              6.3068666011E+00  1.4446482399E+01 
7.9505060279E+00
  
              1.9178811503E+01  4.4712567836E-01 
3.4641995090E+00
  
              9.5178783093E+00  1.2762912471E+01 
1.4947329016E+01
  
              1.7402433472E+01  4.7067303120E+00 
1.5833402903E+00
  
              1.0623164695E+01  2.7299953166E+00 
8.7471659694E+00
  
              1.2931573871E+01  1.8128981231E+01 
6.7007362518E+00
  
              1.8660924236E+01  1.4792395464E+01 
3.1319031106E+00
  
              7.0217232014E+00  6.3190579071E+00 
2.1266991430E+00
  
              4.1181163909E-01  5.0929210080E+00 
5.7193503290E+00
  
              7.6209880479E+00  1.5443775482E+00 
6.1023412080E-01
  
              1.7923134211E+01  9.4056919719E+00 
1.3628670860E+01
  
              1.4710748045E+01  9.1118601940E+00 
1.7566857742E+01
  
              1.0411995344E+01  1.0041061607E+00 
1.5870123306E+01
  
              1.0980496920E+01  1.3629862231E+01 
6.8821852197E+00
  
              1.2756648650E+01  9.3922889131E+00 
1.2966781879E+01
  
              1.3710153187E+01  2.2151381385E+00 
1.9176017166E+01
  
              5.9015247795E+00  1.8254646045E+01 
1.6364133902E+01
  
              3.5889689987E+00  8.6729161022E+00 
4.9047876611E+00
  
              1.4649631278E+01  1.1782133781E+01 
2.4189697381E+00
  
              1.3094524372E+01  1.5574388332E+01 
1.1017906884E+01
  
              1.8122798453E+00  1.5904671691E+01 
1.5390374184E+01
  
              8.0934509994E+00  9.9606459884E+00 
5.6351418737E+00
  
              1.0388873243E+01  1.1258002356E+01 
1.9535431306E+01
  
              1.7801695829E+01  1.5681701759E+01 
1.1954743795E+01
  
              2.9395289639E+00  3.6212308778E+00 
1.4808160737E+00
  
              1.3785141980E+01  3.1146153451E+00 
4.7897808777E+00
  
              6.6125694236E+00  3.8955369666E+00 
1.1802613942E+01
  
              1.0543336669E+00  8.8480531151E+00 
9.2302571597E+00
  
              1.5034376672E+01  1.3207034271E+01 
1.5126390258E+01
  
              1.9223920516E+01  6.5595988246E-01 
1.3020475817E+01
  
              6.6553078921E+00  5.1934209327E+00 
6.9894256581E+00
  
              1.4918361618E+01  3.1596212425E+00 
1.4324193688E+01
  
              8.0804273193E+00  7.9884008127E+00 
1.4307619386E+01
  
              5.7570753518E+00  1.3551949199E+01 
1.8079850277E+01
  
              1.2833144388E+01  6.9576781789E+00 
1.8702339976E+00
  
              2.7890960157E+00  1.7032376017E+01 
6.7568473875E-01
  
              8.6760457768E+00  1.0859908527E+01 
1.0407253204E+01
  
              6.3690907257E+00  2.2769273004E-01 
8.2629069843E+00
  
              1.4623475391E+01  1.7952319809E+01 
1.5406783784E+01
  
              1.5775821227E+01  1.3896960139E+01 
6.9539101570E+00
  
              1.5477566296E+01  1.7519166868E+00 
9.5117606862E+00
  
              3.1098755647E+00  7.2414373656E+00 
1.3444441571E+01
  
              2.9576688783E+00  1.7045648497E+01 
5.6738016905E+00
  
              9.7659864282E+00  1.6334927247E+01 
1.8709494220E+01
  
              5.6015780233E+00  4.6820174692E+00 
1.6849684158E+01
  
              1.3193293623E+01  1.6296954721E+00 
7.4269549058E+00
  
              4.0153579861E+00  1.6089810803E+01 
1.7511617105E+01
  
              1.5080013653E+01  1.5674902127E+01 
1.3378910449E+01
  
              1.3468917372E+01  1.2396666756E+00 
1.6246453919E+01
  
              3.1443097800E-01  3.4518653529E+00 
3.0738155414E+00
  
              1.6864054813E+01  1.2620177700E+01 
4.3720388857E+00
  
              1.3252228290E+01  1.5343974821E+01 
7.9284144425E+00
  
              8.4872425534E+00  1.9054897865E+01 
1.7815010425E+01
  
              1.5170448087E+01  1.0186883021E+01 
1.4748027393E+01
  
              7.5516402653E+00  3.0013700719E+00 
8.9766200084E+00
  
              6.4090355722E+00  7.4843741588E+00 
4.9295671605E+00
  
              4.3705611827E-01  7.6893073781E+00 
1.2001555962E+01
  
              8.4238741309E+00  1.2232714786E+01 
7.6995337657E+00
  
              5.8387974184E+00  5.9155119378E+00 
1.4039991791E+01
  
              1.3107235988E+01  9.8055489044E+00 
6.4400593019E-01
  
              8.3270647814E-01  1.7227458132E+01 
1.2775664290E+01
  
              1.4372625432E+01  4.2560000137E+00 
1.9730406948E+00
  
              5.7914453145E+00  4.0664955533E+00 
3.9036518542E-01
  
              9.7815513593E+00  1.0257448955E+01 
1.3164763822E+01
  
              1.6979663973E+01  2.5757556368E+00 
1.2070399003E+01
  
              9.1476280310E-01  8.1192625454E+00 
6.2498371664E+00
  
              8.8902943261E+00  1.3433615492E+01 
1.7894990037E+01
  
              4.7238007437E+00  1.7074503731E+01 
8.1487422033E+00
  
              1.1337419675E+00  1.7170180156E+01 
3.2442179093E+00
  
  # Energy cutoff for the planewaves
  
    ecut        32.0
  
  # Parameters for the SCF cycles
  
    nstep       5
  
    diemac      4.0
  
    ixc         11
  
  
  
  
  
  
  Here is the end of the message log for the 72 atoms
cell with 4480 bands run on
  
  16 processors :
  
  ================================================================================
  
   ----iterations are completed or convergence
reached----
  
   outwf  : write wavefunction to file
asio2test_para16_4096o_DS2_WFK
  
  -P-0000  leave_test : synchronization done...
  
  --------------------------------------------------------------------------
  
  MPI_ABORT was invoked on rank 15 in communicator
MPI_COMM_WORLD
  
  with errorcode 1.
  
  NOTE: invoking MPI_ABORT causes Open MPI to kill
all MPI processes.
  
  You may or may not see output from other processes,
depending on
  
  exactly when Open MPI kills them.
  
  --------------------------------------------------------------------------
  
  --------------------------------------------------------------------------
  
  mpirun has exited due to process rank 1 with PID 6159
on
  
  node node054 exiting without calling "finalize". This
may
  
  have caused other processes in the application to be
  
  terminated by signals sent by mpirun (as reported
here).
  
  --------------------------------------------------------------------------
  
  [green:29817] 15 more processes have sent help message
help-mpi-api.txt /
  
  mpi-abort
  
  [green:29817] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help
  
  / error messages
  
  
  
  
  
  Here is the end of the log file I got from the 108
atoms cell (with 3200 bands)
  
  run on 32 processors :
  
  ================================================================================
  
   ----iterations are completed or convergence
reached----
  
   outwf  : write wavefunction to file
as108nr_004o_DS2_WFK
  
  -P-0000  leave_test : synchronization done...
  
  [node090:26797] *** Process received signal ***
  
  [node090:26797] Signal: Segmentation fault (11)
  
  [node090:26797] Signal code: Address not mapped (1)
  
  [node090:26797] Failing at address: 0x2
  
  [node090:26797] [ 0] /lib64/libpthread.so.0
[0x395300e4c0]
  
  [node090:26797] [ 1]
  
  /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1)
  
  [0x2b389ea74951]
  
  [node090:26797] [ 2]
  
  /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3)
  
  [0x2b389ea750c3]
  
  [node090:26797] [ 3]
  
  /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8)
  
  [0x2b389ea9d848]
  
  [node090:26797] [ 4]
  
  /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38)
  
  [0x2b389e828780]
  
  [node090:26797] [ 5]
  
  /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29)
[0x10823a9]
  
  [node090:26797] [ 6]
  
  /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9)
[0x107f0b7]
  
  [node090:26797] [ 7]
  
  /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88)
[0x107d1b4]
  
  [node090:26797] [ 8]
  
  /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616)
[0x5f5a72]
  
  [node090:26797] [ 9]
  
  /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074)
[0x4675cc]
  
  [node090:26797] [10]
  
  /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e)
[0x44ea52]
  
  [node090:26797] [11]
  
  /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6)
[0x4448f6]
  
  [node090:26797] [12]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a)
  
  [0x43f642]
  
  [node090:26797] [13]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x395281d974]
  
  [node090:26797] [14]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip [0x43f569]
  
  [node090:26797] *** End of error message ***
  
  --------------------------------------------------------------------------
  
  mpirun noticed that process rank 8 with PID 26797 on
node node090 exited on
  
  signal 11 (Segmentation fault).
  
  --------------------------------------------------------------------------