C892277475E7FB4F90BC212EEE14CA8365C52D@U-SANTORIN.dif.dam.intra.cea.fr"
type="cite">
Hi David,
1) Did you try with the last corrections contained
in the last 5.8.3 bzr revision or 5.8.4 (at least revision 507) ?
Muriel Delaveau and I found several improvements for the writing of the
WFK file with MPI-IO; these corrections improve the portability of the
code and have been merged in the revision 507. We found that the code
produced crashes on several architectures because of wrong treatment of
buffers... and hope to have correct that.
2) If you want to be able to use the file with
anaddb you have to :
- use the last 5.8.3 branch (or 5.8.4)
or
- use the --enable-mpi-io-buggy option when building
the code (unuseful after 5.8.3 rev507); but you could have buffer
problems in that case.
We tested the new changes on ifort and gcc43 with
mpich, open-mpi.
In band-fft, the memory is splitted for the wfk but
not for other quantities, especially if you use PAW. We plan to correct
that soon.
Marc
-------- Message d'origine--------
De: David Waroquiers [mailto:david.waroquiers@uclouvain.be]
Date: ven. 19/06/2009 12:29
À: forum@abinit.org
Objet : [abinit-forum] Band/FFT parallelism on large
systems
Hello all,
I have tried to use the band/fft parallelism on a
large supercell (a-SiO2, 72
atoms and 108 atoms). I encountered a problem while
using a lot of bands (4480
bands). It reaches convergences but crashes at the end
of the run when it is
supposed to write the WFK file (outwf call). I tried
to run a calculation with
16, 32 and 64 processors.
I have tried with fewer bands (640) and it works.
Do you have any idea how to overcome this problem ?
The WFK file is supposed to
be 4 GB and the available memory on the clusters is
more than that. By the way,
in the band/fft parallelism approach, the memory for
the wfk is split into the
different cpus, isn't it ?
I encountered another problem while using cut3d to
analyse the wfk generated
with the band/fft parallelism. It does not recognise
the file as a valid wfk
file (about the same message as when band/fft
parallelism didn't allow to
restart with a different number of processors, before
version 5.8 if I'm
right). Any idea too ?
My input file is hereafter and the log messages are
after the input file. I'm
using public version 5.8.3, revision 485 and the
machines used are the "green"
clusters in UCL : 102 Dual Quad-Core Xeon
L5420/2.5GHz in Dell Blade M1000e
with 16 GB (or 32 GB for some nodes) per node of 8
processors.
Thanks a lot
David Waroquiers
PhD Student
UCL - PCPM - ETSF
My input file :
# Amorphous SiO2 : Generation of the WFK file
needed for the KSS (for GW
corrections)
# Dataset 1 : GS calculation (_DEN generation)
# Dataset 2 : GS calculation with many bands (_WFK
generation)
ndtset 2
jdtset 1 2
timopt 2
# Dataset 1 : _DEN file generation (Density)
tolvrs1 1.0d-12
prtden1 1
nstep1 5 #5 for testing
iscf1 7
npulayit1 7
nband1 256
# Dataset 2 : _WFK file (Wavefunction)
tolwfr2 1.0d-12
nband2 4480
nbdbuf2 384
istwfk2 1
iscf2 7
nstep2 5 #5 for testing
getden2 1
# Options for Band/FFT Parallelism
paral_kgb 1
wfoptalg 14
nloalg 4
fftalg 401
iprcch 4
intxc 0
istwfk 1
fft_opt_lob 2
npfft 1
npband 16 #32 #64
# K-point mesh
kptopt 0
kpt 0.0 0.0 0.0
# System definition
# Unit cell
acell 1.9465690950E+01 1.9465690950E+01
1.9465690950E+01
rprim 1 0 0
0 1 0
0 0 1
# Atom types
ntypat 2
znucl 8 14
# Atoms and coordinates
natom 72
typat 48*1 24*2
xcart 1.8342971905E+01 1.0013093348E+01
4.9948115472E+00
1.8450118788E+01 5.1100335358E+00
1.1410341879E+01
3.0243029960E+00 1.7006888337E+01
1.0689037523E+01
6.3068666011E+00 1.4446482399E+01
7.9505060279E+00
1.9178811503E+01 4.4712567836E-01
3.4641995090E+00
9.5178783093E+00 1.2762912471E+01
1.4947329016E+01
1.7402433472E+01 4.7067303120E+00
1.5833402903E+00
1.0623164695E+01 2.7299953166E+00
8.7471659694E+00
1.2931573871E+01 1.8128981231E+01
6.7007362518E+00
1.8660924236E+01 1.4792395464E+01
3.1319031106E+00
7.0217232014E+00 6.3190579071E+00
2.1266991430E+00
4.1181163909E-01 5.0929210080E+00
5.7193503290E+00
7.6209880479E+00 1.5443775482E+00
6.1023412080E-01
1.7923134211E+01 9.4056919719E+00
1.3628670860E+01
1.4710748045E+01 9.1118601940E+00
1.7566857742E+01
1.0411995344E+01 1.0041061607E+00
1.5870123306E+01
1.0980496920E+01 1.3629862231E+01
6.8821852197E+00
1.2756648650E+01 9.3922889131E+00
1.2966781879E+01
1.3710153187E+01 2.2151381385E+00
1.9176017166E+01
5.9015247795E+00 1.8254646045E+01
1.6364133902E+01
3.5889689987E+00 8.6729161022E+00
4.9047876611E+00
1.4649631278E+01 1.1782133781E+01
2.4189697381E+00
1.3094524372E+01 1.5574388332E+01
1.1017906884E+01
1.8122798453E+00 1.5904671691E+01
1.5390374184E+01
8.0934509994E+00 9.9606459884E+00
5.6351418737E+00
1.0388873243E+01 1.1258002356E+01
1.9535431306E+01
1.7801695829E+01 1.5681701759E+01
1.1954743795E+01
2.9395289639E+00 3.6212308778E+00
1.4808160737E+00
1.3785141980E+01 3.1146153451E+00
4.7897808777E+00
6.6125694236E+00 3.8955369666E+00
1.1802613942E+01
1.0543336669E+00 8.8480531151E+00
9.2302571597E+00
1.5034376672E+01 1.3207034271E+01
1.5126390258E+01
1.9223920516E+01 6.5595988246E-01
1.3020475817E+01
6.6553078921E+00 5.1934209327E+00
6.9894256581E+00
1.4918361618E+01 3.1596212425E+00
1.4324193688E+01
8.0804273193E+00 7.9884008127E+00
1.4307619386E+01
5.7570753518E+00 1.3551949199E+01
1.8079850277E+01
1.2833144388E+01 6.9576781789E+00
1.8702339976E+00
2.7890960157E+00 1.7032376017E+01
6.7568473875E-01
8.6760457768E+00 1.0859908527E+01
1.0407253204E+01
6.3690907257E+00 2.2769273004E-01
8.2629069843E+00
1.4623475391E+01 1.7952319809E+01
1.5406783784E+01
1.5775821227E+01 1.3896960139E+01
6.9539101570E+00
1.5477566296E+01 1.7519166868E+00
9.5117606862E+00
3.1098755647E+00 7.2414373656E+00
1.3444441571E+01
2.9576688783E+00 1.7045648497E+01
5.6738016905E+00
9.7659864282E+00 1.6334927247E+01
1.8709494220E+01
5.6015780233E+00 4.6820174692E+00
1.6849684158E+01
1.3193293623E+01 1.6296954721E+00
7.4269549058E+00
4.0153579861E+00 1.6089810803E+01
1.7511617105E+01
1.5080013653E+01 1.5674902127E+01
1.3378910449E+01
1.3468917372E+01 1.2396666756E+00
1.6246453919E+01
3.1443097800E-01 3.4518653529E+00
3.0738155414E+00
1.6864054813E+01 1.2620177700E+01
4.3720388857E+00
1.3252228290E+01 1.5343974821E+01
7.9284144425E+00
8.4872425534E+00 1.9054897865E+01
1.7815010425E+01
1.5170448087E+01 1.0186883021E+01
1.4748027393E+01
7.5516402653E+00 3.0013700719E+00
8.9766200084E+00
6.4090355722E+00 7.4843741588E+00
4.9295671605E+00
4.3705611827E-01 7.6893073781E+00
1.2001555962E+01
8.4238741309E+00 1.2232714786E+01
7.6995337657E+00
5.8387974184E+00 5.9155119378E+00
1.4039991791E+01
1.3107235988E+01 9.8055489044E+00
6.4400593019E-01
8.3270647814E-01 1.7227458132E+01
1.2775664290E+01
1.4372625432E+01 4.2560000137E+00
1.9730406948E+00
5.7914453145E+00 4.0664955533E+00
3.9036518542E-01
9.7815513593E+00 1.0257448955E+01
1.3164763822E+01
1.6979663973E+01 2.5757556368E+00
1.2070399003E+01
9.1476280310E-01 8.1192625454E+00
6.2498371664E+00
8.8902943261E+00 1.3433615492E+01
1.7894990037E+01
4.7238007437E+00 1.7074503731E+01
8.1487422033E+00
1.1337419675E+00 1.7170180156E+01
3.2442179093E+00
# Energy cutoff for the planewaves
ecut 32.0
# Parameters for the SCF cycles
nstep 5
diemac 4.0
ixc 11
Here is the end of the message log for the 72 atoms
cell with 4480 bands run on
16 processors :
================================================================================
----iterations are completed or convergence
reached----
outwf : write wavefunction to file
asio2test_para16_4096o_DS2_WFK
-P-0000 leave_test : synchronization done...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 15 in communicator
MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill
all MPI processes.
You may or may not see output from other processes,
depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 6159
on
node node054 exiting without calling "finalize". This
may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported
here).
--------------------------------------------------------------------------
[green:29817] 15 more processes have sent help message
help-mpi-api.txt /
mpi-abort
[green:29817] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help
/ error messages
Here is the end of the log file I got from the 108
atoms cell (with 3200 bands)
run on 32 processors :
================================================================================
----iterations are completed or convergence
reached----
outwf : write wavefunction to file
as108nr_004o_DS2_WFK
-P-0000 leave_test : synchronization done...
[node090:26797] *** Process received signal ***
[node090:26797] Signal: Segmentation fault (11)
[node090:26797] Signal code: Address not mapped (1)
[node090:26797] Failing at address: 0x2
[node090:26797] [ 0] /lib64/libpthread.so.0
[0x395300e4c0]
[node090:26797] [ 1]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1)
[0x2b389ea74951]
[node090:26797] [ 2]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3)
[0x2b389ea750c3]
[node090:26797] [ 3]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8)
[0x2b389ea9d848]
[node090:26797] [ 4]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38)
[0x2b389e828780]
[node090:26797] [ 5]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29)
[0x10823a9]
[node090:26797] [ 6]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9)
[0x107f0b7]
[node090:26797] [ 7]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88)
[0x107d1b4]
[node090:26797] [ 8]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616)
[0x5f5a72]
[node090:26797] [ 9]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074)
[0x4675cc]
[node090:26797] [10]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e)
[0x44ea52]
[node090:26797] [11]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6)
[0x4448f6]
[node090:26797] [12]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a)
[0x43f642]
[node090:26797] [13]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x395281d974]
[node090:26797] [14]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip [0x43f569]
[node090:26797] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 26797 on
node node090 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------