forum@abinit.org
Subject:
The ABINIT Users Mailing List ( CLOSED )
List archive
- From: TORRENT Marc <marc.torrent@cea.fr>
- To: forum@abinit.org
- Subject: Re: [abinit-forum] Band/FFT parallelism on large systems
- Date: Mon, 22 Jun 2009 11:58:28 +0200
- Organization: CEA-DAM
Yes, David is right:
I implemented this for the 5.8;
accesswff is 1 when kgb parallelism is requested.
Soon, other variables will be automaticaly set (fft_opt_lob, wfoptalg,
...).
Marc
Emmanuel Arras a écrit :
4A3B76F8.8010909@cea.fr" type="cite">
I thought it was not the case.
It is since 5.8 perhaps?
But since you seem to be sure, my mistake then.
David Waroquiers a écrit :
1245409342.8971.2.camel@b302-2566.pcpm.ucl.ac.be" type="cite">
I'm pretty sure it is automatically put to accesswff 1 when paral_kgb =
1 so it is not needed to specify it by hand.
On Fri, 2009-06-19 at 12:57 +0200, Emmanuel Arras wrote:
you should use
accesswff 1
David Waroquiers a écrit :
Hello all,
I have tried to use the band/fft parallelism on a large supercell (a-SiO2, 72
atoms and 108 atoms). I encountered a problem while using a lot of bands (4480
bands). It reaches convergences but crashes at the end of the run when it is
supposed to write the WFK file (outwf call). I tried to run a calculation with
16, 32 and 64 processors.
I have tried with fewer bands (640) and it works.
Do you have any idea how to overcome this problem ? The WFK file is supposed to
be 4 GB and the available memory on the clusters is more than that. By the way,
in the band/fft parallelism approach, the memory for the wfk is split into the
different cpus, isn't it ?
I encountered another problem while using cut3d to analyse the wfk generated
with the band/fft parallelism. It does not recognise the file as a valid wfk
file (about the same message as when band/fft parallelism didn't allow to
restart with a different number of processors, before version 5.8 if I'm
right). Any idea too ?
My input file is hereafter and the log messages are after the input file. I'm
using public version 5.8.3, revision 485 and the machines used are the "green"
clusters in UCL : 102 Dual Quad-Core Xeon L5420/2.5GHz in Dell Blade M1000e
with 16 GB (or 32 GB for some nodes) per node of 8 processors.
Thanks a lot
David Waroquiers
PhD Student
UCL - PCPM - ETSF
My input file :
# Amorphous SiO2 : Generation of the WFK file needed for the KSS (for GW
corrections)
# Dataset 1 : GS calculation (_DEN generation)
# Dataset 2 : GS calculation with many bands (_WFK generation)
ndtset 2
jdtset 1 2
timopt 2
# Dataset 1 : _DEN file generation (Density)
tolvrs1 1.0d-12
prtden1 1
nstep1 5 #5 for testing
iscf1 7
npulayit1 7
nband1 256
# Dataset 2 : _WFK file (Wavefunction)
tolwfr2 1.0d-12
nband2 4480
nbdbuf2 384
istwfk2 1
iscf2 7
nstep2 5 #5 for testing
getden2 1
# Options for Band/FFT Parallelism
paral_kgb 1
wfoptalg 14
nloalg 4
fftalg 401
iprcch 4
intxc 0
istwfk 1
fft_opt_lob 2
npfft 1
npband 16 #32 #64
# K-point mesh
kptopt 0
kpt 0.0 0.0 0.0
# System definition
# Unit cell
acell 1.9465690950E+01 1.9465690950E+01 1.9465690950E+01
rprim 1 0 0
0 1 0
0 0 1
# Atom types
ntypat 2
znucl 8 14
# Atoms and coordinates
natom 72
typat 48*1 24*2
xcart 1.8342971905E+01 1.0013093348E+01 4.9948115472E+00
1.8450118788E+01 5.1100335358E+00 1.1410341879E+01
3.0243029960E+00 1.7006888337E+01 1.0689037523E+01
6.3068666011E+00 1.4446482399E+01 7.9505060279E+00
1.9178811503E+01 4.4712567836E-01 3.4641995090E+00
9.5178783093E+00 1.2762912471E+01 1.4947329016E+01
1.7402433472E+01 4.7067303120E+00 1.5833402903E+00
1.0623164695E+01 2.7299953166E+00 8.7471659694E+00
1.2931573871E+01 1.8128981231E+01 6.7007362518E+00
1.8660924236E+01 1.4792395464E+01 3.1319031106E+00
7.0217232014E+00 6.3190579071E+00 2.1266991430E+00
4.1181163909E-01 5.0929210080E+00 5.7193503290E+00
7.6209880479E+00 1.5443775482E+00 6.1023412080E-01
1.7923134211E+01 9.4056919719E+00 1.3628670860E+01
1.4710748045E+01 9.1118601940E+00 1.7566857742E+01
1.0411995344E+01 1.0041061607E+00 1.5870123306E+01
1.0980496920E+01 1.3629862231E+01 6.8821852197E+00
1.2756648650E+01 9.3922889131E+00 1.2966781879E+01
1.3710153187E+01 2.2151381385E+00 1.9176017166E+01
5.9015247795E+00 1.8254646045E+01 1.6364133902E+01
3.5889689987E+00 8.6729161022E+00 4.9047876611E+00
1.4649631278E+01 1.1782133781E+01 2.4189697381E+00
1.3094524372E+01 1.5574388332E+01 1.1017906884E+01
1.8122798453E+00 1.5904671691E+01 1.5390374184E+01
8.0934509994E+00 9.9606459884E+00 5.6351418737E+00
1.0388873243E+01 1.1258002356E+01 1.9535431306E+01
1.7801695829E+01 1.5681701759E+01 1.1954743795E+01
2.9395289639E+00 3.6212308778E+00 1.4808160737E+00
1.3785141980E+01 3.1146153451E+00 4.7897808777E+00
6.6125694236E+00 3.8955369666E+00 1.1802613942E+01
1.0543336669E+00 8.8480531151E+00 9.2302571597E+00
1.5034376672E+01 1.3207034271E+01 1.5126390258E+01
1.9223920516E+01 6.5595988246E-01 1.3020475817E+01
6.6553078921E+00 5.1934209327E+00 6.9894256581E+00
1.4918361618E+01 3.1596212425E+00 1.4324193688E+01
8.0804273193E+00 7.9884008127E+00 1.4307619386E+01
5.7570753518E+00 1.3551949199E+01 1.8079850277E+01
1.2833144388E+01 6.9576781789E+00 1.8702339976E+00
2.7890960157E+00 1.7032376017E+01 6.7568473875E-01
8.6760457768E+00 1.0859908527E+01 1.0407253204E+01
6.3690907257E+00 2.2769273004E-01 8.2629069843E+00
1.4623475391E+01 1.7952319809E+01 1.5406783784E+01
1.5775821227E+01 1.3896960139E+01 6.9539101570E+00
1.5477566296E+01 1.7519166868E+00 9.5117606862E+00
3.1098755647E+00 7.2414373656E+00 1.3444441571E+01
2.9576688783E+00 1.7045648497E+01 5.6738016905E+00
9.7659864282E+00 1.6334927247E+01 1.8709494220E+01
5.6015780233E+00 4.6820174692E+00 1.6849684158E+01
1.3193293623E+01 1.6296954721E+00 7.4269549058E+00
4.0153579861E+00 1.6089810803E+01 1.7511617105E+01
1.5080013653E+01 1.5674902127E+01 1.3378910449E+01
1.3468917372E+01 1.2396666756E+00 1.6246453919E+01
3.1443097800E-01 3.4518653529E+00 3.0738155414E+00
1.6864054813E+01 1.2620177700E+01 4.3720388857E+00
1.3252228290E+01 1.5343974821E+01 7.9284144425E+00
8.4872425534E+00 1.9054897865E+01 1.7815010425E+01
1.5170448087E+01 1.0186883021E+01 1.4748027393E+01
7.5516402653E+00 3.0013700719E+00 8.9766200084E+00
6.4090355722E+00 7.4843741588E+00 4.9295671605E+00
4.3705611827E-01 7.6893073781E+00 1.2001555962E+01
8.4238741309E+00 1.2232714786E+01 7.6995337657E+00
5.8387974184E+00 5.9155119378E+00 1.4039991791E+01
1.3107235988E+01 9.8055489044E+00 6.4400593019E-01
8.3270647814E-01 1.7227458132E+01 1.2775664290E+01
1.4372625432E+01 4.2560000137E+00 1.9730406948E+00
5.7914453145E+00 4.0664955533E+00 3.9036518542E-01
9.7815513593E+00 1.0257448955E+01 1.3164763822E+01
1.6979663973E+01 2.5757556368E+00 1.2070399003E+01
9.1476280310E-01 8.1192625454E+00 6.2498371664E+00
8.8902943261E+00 1.3433615492E+01 1.7894990037E+01
4.7238007437E+00 1.7074503731E+01 8.1487422033E+00
1.1337419675E+00 1.7170180156E+01 3.2442179093E+00
# Energy cutoff for the planewaves
ecut 32.0
# Parameters for the SCF cycles
nstep 5
diemac 4.0
ixc 11
Here is the end of the message log for the 72 atoms cell with 4480 bands run on
16 processors :
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file asio2test_para16_4096o_DS2_WFK
-P-0000 leave_test : synchronization done...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 15 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 6159 on
node node054 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[green:29817] 15 more processes have sent help message help-mpi-api.txt /
mpi-abort
[green:29817] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help
/ error messages
Here is the end of the log file I got from the 108 atoms cell (with 3200 bands)
run on 32 processors :
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file as108nr_004o_DS2_WFK
-P-0000 leave_test : synchronization done...
[node090:26797] *** Process received signal ***
[node090:26797] Signal: Segmentation fault (11)
[node090:26797] Signal code: Address not mapped (1)
[node090:26797] Failing at address: 0x2
[node090:26797] [ 0] /lib64/libpthread.so.0 [0x395300e4c0]
[node090:26797] [ 1]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1)
[0x2b389ea74951]
[node090:26797] [ 2]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3)
[0x2b389ea750c3]
[node090:26797] [ 3]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8)
[0x2b389ea9d848]
[node090:26797] [ 4]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38)
[0x2b389e828780]
[node090:26797] [ 5]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29) [0x10823a9]
[node090:26797] [ 6]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9) [0x107f0b7]
[node090:26797] [ 7]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88) [0x107d1b4]
[node090:26797] [ 8]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616) [0x5f5a72]
[node090:26797] [ 9]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074) [0x4675cc]
[node090:26797] [10]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e) [0x44ea52]
[node090:26797] [11]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6) [0x4448f6]
[node090:26797] [12] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a)
[0x43f642]
[node090:26797] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x395281d974]
[node090:26797] [14] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip [0x43f569]
[node090:26797] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 26797 on node node090 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
--
Emmanuel ARRAS
L_Sim (Laboratoire de Simulation Atomistique)
SP2M / INAC
CEA Grenoble
tel : 00 33 (0)4 387 86862
|
Archive powered by MHonArc 2.6.16.