Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] Band/FFT parallelism on large systems

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] Band/FFT parallelism on large systems


Chronological Thread 
  • From: TORRENT Marc <marc.torrent@cea.fr>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] Band/FFT parallelism on large systems
  • Date: Mon, 22 Jun 2009 11:58:28 +0200
  • Organization: CEA-DAM

Yes, David is right:
I implemented this for the 5.8;
accesswff is 1 when kgb parallelism is requested.
Soon, other variables will be automaticaly set (fft_opt_lob, wfoptalg, ...).

Marc

Emmanuel Arras a écrit :
4A3B76F8.8010909@cea.fr" type="cite"> I thought it was not the case.
It is since 5.8 perhaps?
But since you seem to be sure, my mistake then.


David Waroquiers a écrit :
1245409342.8971.2.camel@b302-2566.pcpm.ucl.ac.be" type="cite">
I'm pretty sure it is automatically put to accesswff 1 when paral_kgb =
1 so it is not needed to specify it by hand.

On Fri, 2009-06-19 at 12:57 +0200, Emmanuel Arras wrote:
  
you should use
accesswff 1



David Waroquiers a écrit :
    
Hello all,

I have tried to use the band/fft parallelism on a large supercell (a-SiO2, 72
atoms and 108 atoms). I encountered a problem while using a lot of bands (4480
bands). It reaches convergences but crashes at the end of the run when it is
supposed to write the WFK file (outwf call). I tried to run a calculation with
16, 32 and 64 processors.

I have tried with fewer bands (640) and it works.
Do you have any idea how to overcome this problem ? The WFK file is supposed to
be 4 GB and the available memory on the clusters is more than that. By the way,
in the band/fft parallelism approach, the memory for the wfk is split into the
different cpus, isn't it ?

I encountered another problem while using cut3d to analyse the wfk generated
with the band/fft parallelism. It does not recognise the file as a valid wfk
file (about the same message as when band/fft parallelism didn't allow to
restart with a different number of processors, before version 5.8 if I'm
right). Any idea too ?

My input file is hereafter and the log messages are after the input file. I'm
using public version 5.8.3, revision 485 and the machines used are the "green"
clusters in UCL :  102 Dual Quad-Core Xeon L5420/2.5GHz in Dell Blade M1000e
with 16 GB (or 32 GB for some nodes) per node of 8 processors.

Thanks a lot

David Waroquiers
PhD Student
UCL - PCPM - ETSF




My input file :

# Amorphous SiO2 : Generation of the WFK file needed for the KSS (for GW
corrections)
# Dataset 1 : GS calculation (_DEN generation)
# Dataset 2 : GS calculation with many bands (_WFK generation)

  ndtset      2
  jdtset       1 2
  timopt      2

# Dataset 1 : _DEN file generation (Density)

  tolvrs1     1.0d-12
  prtden1    1
  nstep1     5 #5 for testing
  iscf1        7
  npulayit1  7
  nband1    256

# Dataset 2 : _WFK file (Wavefunction)

  tolwfr2     1.0d-12
  nband2     4480
  nbdbuf2    384
  istwfk2      1
  iscf2 	7
  nstep2      5 #5 for testing
  getden2    1

# Options for Band/FFT Parallelism

  paral_kgb   1
  wfoptalg    14
  nloalg      4
  fftalg      401
  iprcch      4
  intxc       0
  istwfk      1
  fft_opt_lob 2
  npfft       1
  npband      16 #32 #64

# K-point mesh

  kptopt      0
  kpt	      0.0 0.0 0.0

# System definition
# Unit cell

  acell       1.9465690950E+01	1.9465690950E+01  1.9465690950E+01
  rprim       1 0 0
	      0 1 0
	      0 0 1

# Atom types

  ntypat      2
  znucl       8 14

# Atoms and coordinates

  natom       72
  typat       48*1 24*2
  xcart       1.8342971905E+01	1.0013093348E+01  4.9948115472E+00
	      1.8450118788E+01	5.1100335358E+00  1.1410341879E+01
	      3.0243029960E+00	1.7006888337E+01  1.0689037523E+01
	      6.3068666011E+00	1.4446482399E+01  7.9505060279E+00
	      1.9178811503E+01	4.4712567836E-01  3.4641995090E+00
	      9.5178783093E+00	1.2762912471E+01  1.4947329016E+01
	      1.7402433472E+01	4.7067303120E+00  1.5833402903E+00
	      1.0623164695E+01	2.7299953166E+00  8.7471659694E+00
	      1.2931573871E+01	1.8128981231E+01  6.7007362518E+00
	      1.8660924236E+01	1.4792395464E+01  3.1319031106E+00
	      7.0217232014E+00	6.3190579071E+00  2.1266991430E+00
	      4.1181163909E-01	5.0929210080E+00  5.7193503290E+00
	      7.6209880479E+00	1.5443775482E+00  6.1023412080E-01
	      1.7923134211E+01	9.4056919719E+00  1.3628670860E+01
	      1.4710748045E+01	9.1118601940E+00  1.7566857742E+01
	      1.0411995344E+01	1.0041061607E+00  1.5870123306E+01
	      1.0980496920E+01	1.3629862231E+01  6.8821852197E+00
	      1.2756648650E+01	9.3922889131E+00  1.2966781879E+01
	      1.3710153187E+01	2.2151381385E+00  1.9176017166E+01
	      5.9015247795E+00	1.8254646045E+01  1.6364133902E+01
	      3.5889689987E+00	8.6729161022E+00  4.9047876611E+00
	      1.4649631278E+01	1.1782133781E+01  2.4189697381E+00
	      1.3094524372E+01	1.5574388332E+01  1.1017906884E+01
	      1.8122798453E+00	1.5904671691E+01  1.5390374184E+01
	      8.0934509994E+00	9.9606459884E+00  5.6351418737E+00
	      1.0388873243E+01	1.1258002356E+01  1.9535431306E+01
	      1.7801695829E+01	1.5681701759E+01  1.1954743795E+01
	      2.9395289639E+00	3.6212308778E+00  1.4808160737E+00
	      1.3785141980E+01	3.1146153451E+00  4.7897808777E+00
	      6.6125694236E+00	3.8955369666E+00  1.1802613942E+01
	      1.0543336669E+00	8.8480531151E+00  9.2302571597E+00
	      1.5034376672E+01	1.3207034271E+01  1.5126390258E+01
	      1.9223920516E+01	6.5595988246E-01  1.3020475817E+01
	      6.6553078921E+00	5.1934209327E+00  6.9894256581E+00
	      1.4918361618E+01	3.1596212425E+00  1.4324193688E+01
	      8.0804273193E+00	7.9884008127E+00  1.4307619386E+01
	      5.7570753518E+00	1.3551949199E+01  1.8079850277E+01
	      1.2833144388E+01	6.9576781789E+00  1.8702339976E+00
	      2.7890960157E+00	1.7032376017E+01  6.7568473875E-01
	      8.6760457768E+00	1.0859908527E+01  1.0407253204E+01
	      6.3690907257E+00	2.2769273004E-01  8.2629069843E+00
	      1.4623475391E+01	1.7952319809E+01  1.5406783784E+01
	      1.5775821227E+01	1.3896960139E+01  6.9539101570E+00
	      1.5477566296E+01	1.7519166868E+00  9.5117606862E+00
	      3.1098755647E+00	7.2414373656E+00  1.3444441571E+01
	      2.9576688783E+00	1.7045648497E+01  5.6738016905E+00
	      9.7659864282E+00	1.6334927247E+01  1.8709494220E+01
	      5.6015780233E+00	4.6820174692E+00  1.6849684158E+01
	      1.3193293623E+01	1.6296954721E+00  7.4269549058E+00
	      4.0153579861E+00	1.6089810803E+01  1.7511617105E+01
	      1.5080013653E+01	1.5674902127E+01  1.3378910449E+01
	      1.3468917372E+01	1.2396666756E+00  1.6246453919E+01
	      3.1443097800E-01	3.4518653529E+00  3.0738155414E+00
	      1.6864054813E+01	1.2620177700E+01  4.3720388857E+00
	      1.3252228290E+01	1.5343974821E+01  7.9284144425E+00
	      8.4872425534E+00	1.9054897865E+01  1.7815010425E+01
	      1.5170448087E+01	1.0186883021E+01  1.4748027393E+01
	      7.5516402653E+00	3.0013700719E+00  8.9766200084E+00
	      6.4090355722E+00	7.4843741588E+00  4.9295671605E+00
	      4.3705611827E-01	7.6893073781E+00  1.2001555962E+01
	      8.4238741309E+00	1.2232714786E+01  7.6995337657E+00
	      5.8387974184E+00	5.9155119378E+00  1.4039991791E+01
	      1.3107235988E+01	9.8055489044E+00  6.4400593019E-01
	      8.3270647814E-01	1.7227458132E+01  1.2775664290E+01
	      1.4372625432E+01	4.2560000137E+00  1.9730406948E+00
	      5.7914453145E+00	4.0664955533E+00  3.9036518542E-01
	      9.7815513593E+00	1.0257448955E+01  1.3164763822E+01
	      1.6979663973E+01	2.5757556368E+00  1.2070399003E+01
	      9.1476280310E-01	8.1192625454E+00  6.2498371664E+00
	      8.8902943261E+00	1.3433615492E+01  1.7894990037E+01
	      4.7238007437E+00	1.7074503731E+01  8.1487422033E+00
	      1.1337419675E+00	1.7170180156E+01  3.2442179093E+00

# Energy cutoff for the planewaves

  ecut	      32.0

# Parameters for the SCF cycles

  nstep       5
  diemac      4.0
  ixc	      11






Here is the end of the message log for the 72 atoms cell with 4480 bands run on
16 processors :

================================================================================

 ----iterations are completed or convergence reached----

 outwf	: write wavefunction to file asio2test_para16_4096o_DS2_WFK
-P-0000  leave_test : synchronization done...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 15 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 6159 on
node node054 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[green:29817] 15 more processes have sent help message help-mpi-api.txt /
mpi-abort
[green:29817] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help
/ error messages





Here is the end of the log file I got from the 108 atoms cell (with 3200 bands)
run on 32 processors :

================================================================================

 ----iterations are completed or convergence reached----

 outwf	: write wavefunction to file as108nr_004o_DS2_WFK
-P-0000  leave_test : synchronization done...
[node090:26797] *** Process received signal ***
[node090:26797] Signal: Segmentation fault (11)
[node090:26797] Signal code: Address not mapped (1)
[node090:26797] Failing at address: 0x2
[node090:26797] [ 0] /lib64/libpthread.so.0 [0x395300e4c0]
[node090:26797] [ 1]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1)
[0x2b389ea74951]
[node090:26797] [ 2]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3)
[0x2b389ea750c3]
[node090:26797] [ 3]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8)
[0x2b389ea9d848]
[node090:26797] [ 4]
/cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38)
[0x2b389e828780]
[node090:26797] [ 5]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29) [0x10823a9]
[node090:26797] [ 6]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9) [0x107f0b7]
[node090:26797] [ 7]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88) [0x107d1b4]
[node090:26797] [ 8]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616) [0x5f5a72]
[node090:26797] [ 9]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074) [0x4675cc]
[node090:26797] [10]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e) [0x44ea52]
[node090:26797] [11]
/home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6) [0x4448f6]
[node090:26797] [12] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a)
[0x43f642]
[node090:26797] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x395281d974]
[node090:26797] [14] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip [0x43f569]
[node090:26797] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 26797 on node node090 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

  
      


  

-- 
Emmanuel ARRAS
L_Sim (Laboratoire de Simulation Atomistique)
SP2M / INAC
CEA Grenoble
tel : 00 33 (0)4 387 86862
  




Archive powered by MHonArc 2.6.16.

Top of Page