forum@abinit.org
Subject: The ABINIT Users Mailing List ( CLOSED )
List archive
- From: Emmanuel Arras <emmanuel.arras@cea.fr>
- To: forum@abinit.org
- Subject: Re: [abinit-forum] Band/FFT parallelism on large systems
- Date: Fri, 19 Jun 2009 13:31:04 +0200
I thought it was not the case. It is since 5.8 perhaps? But since you seem to be sure, my mistake then. David Waroquiers a écrit : 1245409342.8971.2.camel@b302-2566.pcpm.ucl.ac.be" type="cite">I'm pretty sure it is automatically put to accesswff 1 when paral_kgb = 1 so it is not needed to specify it by hand. On Fri, 2009-06-19 at 12:57 +0200, Emmanuel Arras wrote:you should use accesswff 1 David Waroquiers a écrit :Hello all, I have tried to use the band/fft parallelism on a large supercell (a-SiO2, 72 atoms and 108 atoms). I encountered a problem while using a lot of bands (4480 bands). It reaches convergences but crashes at the end of the run when it is supposed to write the WFK file (outwf call). I tried to run a calculation with 16, 32 and 64 processors. I have tried with fewer bands (640) and it works. Do you have any idea how to overcome this problem ? The WFK file is supposed to be 4 GB and the available memory on the clusters is more than that. By the way, in the band/fft parallelism approach, the memory for the wfk is split into the different cpus, isn't it ? I encountered another problem while using cut3d to analyse the wfk generated with the band/fft parallelism. It does not recognise the file as a valid wfk file (about the same message as when band/fft parallelism didn't allow to restart with a different number of processors, before version 5.8 if I'm right). Any idea too ? My input file is hereafter and the log messages are after the input file. I'm using public version 5.8.3, revision 485 and the machines used are the "green" clusters in UCL : 102 Dual Quad-Core Xeon L5420/2.5GHz in Dell Blade M1000e with 16 GB (or 32 GB for some nodes) per node of 8 processors. Thanks a lot David Waroquiers PhD Student UCL - PCPM - ETSF My input file : # Amorphous SiO2 : Generation of the WFK file needed for the KSS (for GW corrections) # Dataset 1 : GS calculation (_DEN generation) # Dataset 2 : GS calculation with many bands (_WFK generation) ndtset 2 jdtset 1 2 timopt 2 # Dataset 1 : _DEN file generation (Density) tolvrs1 1.0d-12 prtden1 1 nstep1 5 #5 for testing iscf1 7 npulayit1 7 nband1 256 # Dataset 2 : _WFK file (Wavefunction) tolwfr2 1.0d-12 nband2 4480 nbdbuf2 384 istwfk2 1 iscf2 7 nstep2 5 #5 for testing getden2 1 # Options for Band/FFT Parallelism paral_kgb 1 wfoptalg 14 nloalg 4 fftalg 401 iprcch 4 intxc 0 istwfk 1 fft_opt_lob 2 npfft 1 npband 16 #32 #64 # K-point mesh kptopt 0 kpt 0.0 0.0 0.0 # System definition # Unit cell acell 1.9465690950E+01 1.9465690950E+01 1.9465690950E+01 rprim 1 0 0 0 1 0 0 0 1 # Atom types ntypat 2 znucl 8 14 # Atoms and coordinates natom 72 typat 48*1 24*2 xcart 1.8342971905E+01 1.0013093348E+01 4.9948115472E+00 1.8450118788E+01 5.1100335358E+00 1.1410341879E+01 3.0243029960E+00 1.7006888337E+01 1.0689037523E+01 6.3068666011E+00 1.4446482399E+01 7.9505060279E+00 1.9178811503E+01 4.4712567836E-01 3.4641995090E+00 9.5178783093E+00 1.2762912471E+01 1.4947329016E+01 1.7402433472E+01 4.7067303120E+00 1.5833402903E+00 1.0623164695E+01 2.7299953166E+00 8.7471659694E+00 1.2931573871E+01 1.8128981231E+01 6.7007362518E+00 1.8660924236E+01 1.4792395464E+01 3.1319031106E+00 7.0217232014E+00 6.3190579071E+00 2.1266991430E+00 4.1181163909E-01 5.0929210080E+00 5.7193503290E+00 7.6209880479E+00 1.5443775482E+00 6.1023412080E-01 1.7923134211E+01 9.4056919719E+00 1.3628670860E+01 1.4710748045E+01 9.1118601940E+00 1.7566857742E+01 1.0411995344E+01 1.0041061607E+00 1.5870123306E+01 1.0980496920E+01 1.3629862231E+01 6.8821852197E+00 1.2756648650E+01 9.3922889131E+00 1.2966781879E+01 1.3710153187E+01 2.2151381385E+00 1.9176017166E+01 5.9015247795E+00 1.8254646045E+01 1.6364133902E+01 3.5889689987E+00 8.6729161022E+00 4.9047876611E+00 1.4649631278E+01 1.1782133781E+01 2.4189697381E+00 1.3094524372E+01 1.5574388332E+01 1.1017906884E+01 1.8122798453E+00 1.5904671691E+01 1.5390374184E+01 8.0934509994E+00 9.9606459884E+00 5.6351418737E+00 1.0388873243E+01 1.1258002356E+01 1.9535431306E+01 1.7801695829E+01 1.5681701759E+01 1.1954743795E+01 2.9395289639E+00 3.6212308778E+00 1.4808160737E+00 1.3785141980E+01 3.1146153451E+00 4.7897808777E+00 6.6125694236E+00 3.8955369666E+00 1.1802613942E+01 1.0543336669E+00 8.8480531151E+00 9.2302571597E+00 1.5034376672E+01 1.3207034271E+01 1.5126390258E+01 1.9223920516E+01 6.5595988246E-01 1.3020475817E+01 6.6553078921E+00 5.1934209327E+00 6.9894256581E+00 1.4918361618E+01 3.1596212425E+00 1.4324193688E+01 8.0804273193E+00 7.9884008127E+00 1.4307619386E+01 5.7570753518E+00 1.3551949199E+01 1.8079850277E+01 1.2833144388E+01 6.9576781789E+00 1.8702339976E+00 2.7890960157E+00 1.7032376017E+01 6.7568473875E-01 8.6760457768E+00 1.0859908527E+01 1.0407253204E+01 6.3690907257E+00 2.2769273004E-01 8.2629069843E+00 1.4623475391E+01 1.7952319809E+01 1.5406783784E+01 1.5775821227E+01 1.3896960139E+01 6.9539101570E+00 1.5477566296E+01 1.7519166868E+00 9.5117606862E+00 3.1098755647E+00 7.2414373656E+00 1.3444441571E+01 2.9576688783E+00 1.7045648497E+01 5.6738016905E+00 9.7659864282E+00 1.6334927247E+01 1.8709494220E+01 5.6015780233E+00 4.6820174692E+00 1.6849684158E+01 1.3193293623E+01 1.6296954721E+00 7.4269549058E+00 4.0153579861E+00 1.6089810803E+01 1.7511617105E+01 1.5080013653E+01 1.5674902127E+01 1.3378910449E+01 1.3468917372E+01 1.2396666756E+00 1.6246453919E+01 3.1443097800E-01 3.4518653529E+00 3.0738155414E+00 1.6864054813E+01 1.2620177700E+01 4.3720388857E+00 1.3252228290E+01 1.5343974821E+01 7.9284144425E+00 8.4872425534E+00 1.9054897865E+01 1.7815010425E+01 1.5170448087E+01 1.0186883021E+01 1.4748027393E+01 7.5516402653E+00 3.0013700719E+00 8.9766200084E+00 6.4090355722E+00 7.4843741588E+00 4.9295671605E+00 4.3705611827E-01 7.6893073781E+00 1.2001555962E+01 8.4238741309E+00 1.2232714786E+01 7.6995337657E+00 5.8387974184E+00 5.9155119378E+00 1.4039991791E+01 1.3107235988E+01 9.8055489044E+00 6.4400593019E-01 8.3270647814E-01 1.7227458132E+01 1.2775664290E+01 1.4372625432E+01 4.2560000137E+00 1.9730406948E+00 5.7914453145E+00 4.0664955533E+00 3.9036518542E-01 9.7815513593E+00 1.0257448955E+01 1.3164763822E+01 1.6979663973E+01 2.5757556368E+00 1.2070399003E+01 9.1476280310E-01 8.1192625454E+00 6.2498371664E+00 8.8902943261E+00 1.3433615492E+01 1.7894990037E+01 4.7238007437E+00 1.7074503731E+01 8.1487422033E+00 1.1337419675E+00 1.7170180156E+01 3.2442179093E+00 # Energy cutoff for the planewaves ecut 32.0 # Parameters for the SCF cycles nstep 5 diemac 4.0 ixc 11 Here is the end of the message log for the 72 atoms cell with 4480 bands run on 16 processors : ================================================================================ ----iterations are completed or convergence reached---- outwf : write wavefunction to file asio2test_para16_4096o_DS2_WFK -P-0000 leave_test : synchronization done... -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 15 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 1 with PID 6159 on node node054 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [green:29817] 15 more processes have sent help message help-mpi-api.txt / mpi-abort [green:29817] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Here is the end of the log file I got from the 108 atoms cell (with 3200 bands) run on 32 processors : ================================================================================ ----iterations are completed or convergence reached---- outwf : write wavefunction to file as108nr_004o_DS2_WFK -P-0000 leave_test : synchronization done... [node090:26797] *** Process received signal *** [node090:26797] Signal: Segmentation fault (11) [node090:26797] Signal code: Address not mapped (1) [node090:26797] Failing at address: 0x2 [node090:26797] [ 0] /lib64/libpthread.so.0 [0x395300e4c0] [node090:26797] [ 1] /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1) [0x2b389ea74951] [node090:26797] [ 2] /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3) [0x2b389ea750c3] [node090:26797] [ 3] /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8) [0x2b389ea9d848] [node090:26797] [ 4] /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38) [0x2b389e828780] [node090:26797] [ 5] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29) [0x10823a9] [node090:26797] [ 6] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9) [0x107f0b7] [node090:26797] [ 7] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88) [0x107d1b4] [node090:26797] [ 8] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616) [0x5f5a72] [node090:26797] [ 9] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074) [0x4675cc] [node090:26797] [10] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e) [0x44ea52] [node090:26797] [11] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6) [0x4448f6] [node090:26797] [12] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a) [0x43f642] [node090:26797] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x395281d974] [node090:26797] [14] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip [0x43f569] [node090:26797] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 8 with PID 26797 on node node090 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- -- Emmanuel ARRAS L_Sim (Laboratoire de Simulation Atomistique) SP2M / INAC CEA Grenoble tel : 00 33 (0)4 387 86862 |
- [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- RE : [abinit-forum] Band/FFT parallelism on large systems, Marc.TORRENT, 06/20/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/22/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/24/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, DELAVEAU Muriel, 06/24/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
Archive powered by MHonArc 2.6.16.