forum@abinit.org
Subject: The ABINIT Users Mailing List ( CLOSED )
List archive
- From: <Marc.TORRENT@cea.fr>
- To: <forum@abinit.org>
- Subject: RE : [abinit-forum] Band/FFT parallelism on large systems
- Date: Sat, 20 Jun 2009 15:15:05 +0200
Hi David,
1) Did you try with the last corrections contained in the last 5.8.3 bzr
revision or 5.8.4 (at least revision 507) ? Muriel Delaveau and I found
several improvements for the writing of the WFK file with MPI-IO; these
corrections improve the portability of the code and have been merged in the
revision 507. We found that the code produced crashes on several
architectures because of wrong treatment of buffers... and hope to have
correct that.
2) If you want to be able to use the file with anaddb you have to :
- use the last 5.8.3 branch (or 5.8.4)
or
- use the --enable-mpi-io-buggy option when building the code (unuseful after
5.8.3 rev507); but you could have buffer problems in that case.
We tested the new changes on ifort and gcc43 with mpich, open-mpi.
In band-fft, the memory is splitted for the wfk but not for other quantities,
especially if you use PAW. We plan to correct that soon.
Marc
-------- Message d'origine--------
De: David Waroquiers [mailto:david.waroquiers@uclouvain.be]
Date: ven. 19/06/2009 12:29
À: forum@abinit.org
Objet : [abinit-forum] Band/FFT parallelism on large systems
Hello all,
I have tried to use the band/fft parallelism on a large supercell (a-SiO2, 72
atoms and 108 atoms). I encountered a problem while using a lot of bands (4480
bands). It reaches convergences but crashes at the end of the run when it is
supposed to write the WFK file (outwf call). I tried to run a calculation with
16, 32 and 64 processors.
I have tried with fewer bands (640) and it works.
Do you have any idea how to overcome this problem ? The WFK file is supposed
to
be 4 GB and the available memory on the clusters is more than that. By the
way,
in the band/fft parallelism approach, the memory for the wfk is split into the
different cpus, isn't it ?
I encountered another problem while using cut3d to analyse the wfk generated
with the band/fft parallelism. It does not recognise the file as a valid wfk
file (about the same message as when band/fft parallelism didn't allow to
restart with a different number of processors, before version 5.8 if I'm
right). Any idea too ?
My input file is hereafter and the log messages are after the input file. I'm
using public version 5.8.3, revision 485 and the machines used are the "green"
clusters in UCL : 102 Dual Quad-Core Xeon L5420/2.5GHz in Dell Blade M1000e
with 16 GB (or 32 GB for some nodes) per node of 8 processors.
Thanks a lot
David Waroquiers
PhD Student
UCL - PCPM - ETSF
My input file :
# Amorphous SiO2 : Generation of the WFK file needed for the KSS (for GW
corrections)
# Dataset 1 : GS calculation (_DEN generation)
# Dataset 2 : GS calculation with many bands (_WFK generation)
ndtset 2
jdtset 1 2
timopt 2
# Dataset 1 : _DEN file generation (Density)
tolvrs1 1.0d-12
prtden1 1
nstep1 5 #5 for testing
iscf1 7
npulayit1 7
nband1 256
# Dataset 2 : _WFK file (Wavefunction)
tolwfr2 1.0d-12
nband2 4480
nbdbuf2 384
istwfk2 1
iscf2 7
nstep2 5 #5 for testing
getden2 1
# Options for Band/FFT Parallelism
paral_kgb 1
wfoptalg 14
nloalg 4
fftalg 401
iprcch 4
intxc 0
istwfk 1
fft_opt_lob 2
npfft 1
npband 16 #32 #64
# K-point mesh
kptopt 0
kpt 0.0 0.0 0.0
# System definition
# Unit cell
acell 1.9465690950E+01 1.9465690950E+01 1.9465690950E+01
rprim 1 0 0
0 1 0
0 0 1
# Atom types
ntypat 2
znucl 8 14
# Atoms and coordinates
natom 72
typat 48*1 24*2
xcart 1.8342971905E+01 1.0013093348E+01 4.9948115472E+00
1.8450118788E+01 5.1100335358E+00 1.1410341879E+01
3.0243029960E+00 1.7006888337E+01 1.0689037523E+01
6.3068666011E+00 1.4446482399E+01 7.9505060279E+00
1.9178811503E+01 4.4712567836E-01 3.4641995090E+00
9.5178783093E+00 1.2762912471E+01 1.4947329016E+01
1.7402433472E+01 4.7067303120E+00 1.5833402903E+00
1.0623164695E+01 2.7299953166E+00 8.7471659694E+00
1.2931573871E+01 1.8128981231E+01 6.7007362518E+00
1.8660924236E+01 1.4792395464E+01 3.1319031106E+00
7.0217232014E+00 6.3190579071E+00 2.1266991430E+00
4.1181163909E-01 5.0929210080E+00 5.7193503290E+00
7.6209880479E+00 1.5443775482E+00 6.1023412080E-01
1.7923134211E+01 9.4056919719E+00 1.3628670860E+01
1.4710748045E+01 9.1118601940E+00 1.7566857742E+01
1.0411995344E+01 1.0041061607E+00 1.5870123306E+01
1.0980496920E+01 1.3629862231E+01 6.8821852197E+00
1.2756648650E+01 9.3922889131E+00 1.2966781879E+01
1.3710153187E+01 2.2151381385E+00 1.9176017166E+01
5.9015247795E+00 1.8254646045E+01 1.6364133902E+01
3.5889689987E+00 8.6729161022E+00 4.9047876611E+00
1.4649631278E+01 1.1782133781E+01 2.4189697381E+00
1.3094524372E+01 1.5574388332E+01 1.1017906884E+01
1.8122798453E+00 1.5904671691E+01 1.5390374184E+01
8.0934509994E+00 9.9606459884E+00 5.6351418737E+00
1.0388873243E+01 1.1258002356E+01 1.9535431306E+01
1.7801695829E+01 1.5681701759E+01 1.1954743795E+01
2.9395289639E+00 3.6212308778E+00 1.4808160737E+00
1.3785141980E+01 3.1146153451E+00 4.7897808777E+00
6.6125694236E+00 3.8955369666E+00 1.1802613942E+01
1.0543336669E+00 8.8480531151E+00 9.2302571597E+00
1.5034376672E+01 1.3207034271E+01 1.5126390258E+01
1.9223920516E+01 6.5595988246E-01 1.3020475817E+01
6.6553078921E+00 5.1934209327E+00 6.9894256581E+00
1.4918361618E+01 3.1596212425E+00 1.4324193688E+01
8.0804273193E+00 7.9884008127E+00 1.4307619386E+01
5.7570753518E+00 1.3551949199E+01 1.8079850277E+01
1.2833144388E+01 6.9576781789E+00 1.8702339976E+00
2.7890960157E+00 1.7032376017E+01 6.7568473875E-01
8.6760457768E+00 1.0859908527E+01 1.0407253204E+01
6.3690907257E+00 2.2769273004E-01 8.2629069843E+00
1.4623475391E+01 1.7952319809E+01 1.5406783784E+01
1.5775821227E+01 1.3896960139E+01 6.9539101570E+00
1.5477566296E+01 1.7519166868E+00 9.5117606862E+00
3.1098755647E+00 7.2414373656E+00 1.3444441571E+01
2.9576688783E+00 1.7045648497E+01 5.6738016905E+00
9.7659864282E+00 1.6334927247E+01 1.8709494220E+01
5.6015780233E+00 4.6820174692E+00 1.6849684158E+01
1.3193293623E+01 1.6296954721E+00 7.4269549058E+00
4.0153579861E+00 1.6089810803E+01 1.7511617105E+01
1.5080013653E+01 1.5674902127E+01 1.3378910449E+01
1.3468917372E+01 1.2396666756E+00 1.6246453919E+01
3.1443097800E-01 3.4518653529E+00 3.0738155414E+00
1.6864054813E+01 1.2620177700E+01 4.3720388857E+00
1.3252228290E+01 1.5343974821E+01 7.9284144425E+00
8.4872425534E+00 1.9054897865E+01 1.7815010425E+01
1.5170448087E+01 1.0186883021E+01 1.4748027393E+01
7.5516402653E+00 3.0013700719E+00 8.9766200084E+00
6.4090355722E+00 7.4843741588E+00 4.9295671605E+00
4.3705611827E-01 7.6893073781E+00 1.2001555962E+01
8.4238741309E+00 1.2232714786E+01 7.6995337657E+00
5.8387974184E+00 5.9155119378E+00 1.4039991791E+01
1.3107235988E+01 9.8055489044E+00 6.4400593019E-01
8.3270647814E-01 1.7227458132E+01 1.2775664290E+01
1.4372625432E+01 4.2560000137E+00 1.9730406948E+00
5.7914453145E+00 4.0664955533E+00 3.9036518542E-01
9.7815513593E+00 1.0257448955E+01 1.3164763822E+01
1.6979663973E+01 2.5757556368E+00 1.2070399003E+01
9.1476280310E-01 8.1192625454E+00 6.2498371664E+00
8.8902943261E+00 1.3433615492E+01 1.7894990037E+01
4.7238007437E+00 1.7074503731E+01 8.1487422033E+00
1.1337419675E+00 1.7170180156E+01 3.2442179093E+00
# Energy cutoff for the planewaves
ecut 32.0
# Parameters for the SCF cycles
nstep 5
diemac 4.0
ixc 11
Here is the end of the message log for the 72 atoms cell with 4480 bands run
on
16 processors :
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file asio2test_para16_4096o_DS2_WFK
-P-0000 leave_test : synchronization done...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 15 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 6159 on
node node054 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[green:29817] 15 more processes have sent help message help-mpi-api.txt /
mpi-abort
[green:29817] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help
/ error messages
Here is the end of the log file I got from the 108 atoms cell (with 3200
bands)
run on 32 processors :
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file as108nr_004o_DS2_WFK
-P-0000 leave_test : synchronization done...
[node090:26797] *** Process received signal ***
[node090:26797] Signal: Segmentation fault (11)
[node090:26797] Signal code: Address not mapped (1)
[node090:26797] Failing at address: 0x2
[node090:26797] [ 0] /lib64/libpthread.so.0 [0x395300e4c0]
[node090:26797] [ 1]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1)
[0x2b389ea74951]
[node090:26797] [ 2]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3)
[0x2b389ea750c3]
[node090:26797] [ 3]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8)
[0x2b389ea9d848]
[node090:26797] [ 4]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38)
[0x2b389e828780]
[node090:26797] [ 5]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29) [0x10823a9]
[node090:26797] [ 6]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9) [0x107f0b7]
[node090:26797] [ 7]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88) [0x107d1b4]
[node090:26797] [ 8]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616) [0x5f5a72]
[node090:26797] [ 9]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074) [0x4675cc]
[node090:26797] [10]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e) [0x44ea52]
[node090:26797] [11]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6) [0x4448f6]
[node090:26797] [12]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a)
[0x43f642]
[node090:26797] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x395281d974]
[node090:26797] [14] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip
[0x43f569]
[node090:26797] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 26797 on node node090 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
<<winmail.dat>>
- [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- RE : [abinit-forum] Band/FFT parallelism on large systems, Marc.TORRENT, 06/20/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/22/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/24/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, DELAVEAU Muriel, 06/24/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
Archive powered by MHonArc 2.6.16.