forum@abinit.org
Subject: The ABINIT Users Mailing List ( CLOSED )
List archive
- From: David Waroquiers <david.waroquiers@uclouvain.be>
- To: forum@abinit.org
- Subject: Re: RE : [abinit-forum] Band/FFT parallelism on large systems
- Date: Wed, 24 Jun 2009 09:22:37 +0200
Hello,
I tried with the last 5.9.0 (revision 485) and it crashed (even for few
bands : nband 256) with 8 cpus. Note that I decreased the number of
steps to speed up the test (but it doesnt matter anyway).
The end of the log file is at the end of the message.
Any suggestion ?
David
scprqt: WARNING -
nstep= 5 was not enough SCF cycles to converge;
potential residual= 4.649E+00 exceeds tolvrs= 1.000E-12
ioarr: writing density data
ioarr: file name is asio2test_para8o_DS1_DEN
ioarr: data written to disk file asio2test_para8o_DS1_DEN
-P-0000 leave_test : synchronization done...
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file asio2test_para8o_DS1_WFK
-P-0000 leave_test : synchronization done...
[node008:27384] *** Process received signal ***
[node008:27385] *** Process received signal ***
[node008:27385] Signal: Segmentation fault (11)
[node008:27385] Signal code: Address not mapped (1)
[node008:27385] Failing at address: 0xc2c37e54
[node008:27389] *** Process received signal ***
[node008:27389] Signal: Segmentation fault (11)
[node008:27389] Signal code: Address not mapped (1)
[node008:27389] Failing at address: 0xb2ee4574
[node008:27390] *** Process received signal ***
[node008:27390] Signal: Segmentation fault (11)
[node008:27390] Signal code: Address not mapped (1)
[node008:27390] Failing at address: 0xb83e1914
[node008:27391] *** Process received signal ***
[node008:27391] Signal: Segmentation fault (11)
[node008:27391] Signal code: Address not mapped (1)
[node008:27391] Failing at address: 0xb83c32e4
[node008:27386] *** Process received signal ***
[node008:27386] Signal: Segmentation fault (11)
[node008:27386] Signal code: Address not mapped (1)
[node008:27386] Failing at address: 0xc9cb7dd4
[node008:27387] *** Process received signal ***
[node008:27387] Signal: Segmentation fault (11)
[node008:27387] Signal code: Address not mapped (1)
[node008:27387] Failing at address: 0xb67bd5a4
[node008:27388] *** Process received signal ***
[node008:27388] Signal: Segmentation fault (11)
[node008:27388] Signal code: Address not mapped (1)
[node008:27388] Failing at address: 0xc2f74d94
[node008:27384] Signal: Segmentation fault (11)
[node008:27384] Signal code: Address not mapped (1)
[node008:27384] Failing at address: 0xbdd70364
[node008:27389] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27389]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27389]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27389]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27389]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27389]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27389]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27389]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27389]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27389] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27389]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27389] *** End of error message ***
[node008:27388] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27388]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27388]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27388]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27388]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27388]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27388]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27388]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27388]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27388] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27388]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27388] *** End of error message ***
[node008:27391] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27391]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27391]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27391]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27391]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27391]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27391]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27391]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27391]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27391] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27391]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27391] *** End of error message ***
[node008:27390] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27390]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27390]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27390]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27390]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27390]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27390]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27390]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27390]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27390] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27390]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27390] *** End of error message ***
[node008:27386] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27386]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27386]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27386]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27386]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27386]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27386]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27386]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27386]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27386] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27386]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27386] *** End of error message ***
[node008:27385] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27385]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27385]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27385]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27385]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27385]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27385]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27385]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27385]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27385] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27385]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27385] *** End of error message ***
[node008:27387] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27384] [ 0] /lib64/libpthread.so.0 [0x3db920e4c0]
[node008:27384]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27384]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27384]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27384]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27384]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27384]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27387]
[ 1]
/home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(xderivewrite_int2d_mpio_displ_+0x939)
[0x148614f]
[node008:27384]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27384]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27384] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27384]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27384] *** End of error message ***
[node008:27387]
[ 2] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(writewf_
+0xe96) [0x11fe522]
[node008:27387]
[ 3] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(rwwf_
+0x4e36) [0x11fd682]
[node008:27387]
[ 4] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(outwf_
+0x2983) [0x62d93b]
[node008:27387]
[ 5] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(gstate_
+0x16f5e) [0x46deae]
[node008:27387]
[ 6] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(driver_
+0xaec4) [0x451a64]
[node008:27387]
[ 7] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(MAIN__
+0x2f7d) [0x44348d]
[node008:27387]
[ 8] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(main+0x2a)
[0x440502]
[node008:27387] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3db861d974]
[node008:27387]
[10] /home/pcpm/waroquiers/590_r485bis/abinit/5.9/bin/abinip(mpi_bcast_
+0x49) [0x440429]
[node008:27387] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 27390 on node node008 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
On Mon, 2009-06-22 at 12:01 +0200, TORRENT Marc wrote:
> Hi again David,
>
> To simplify...
> Could you test with the last 5.9/trunk/5.9.0-public ?
> The treatment of MPI_IO WFK has been changed for the 5.9 and it's more
> convenient for us to debug this last branch.
> Thanks
>
> Marc
>
>
> Marc.TORRENT@cea.fr a écrit :
> > Hi David,
> >
> > 1) Did you try with the last corrections contained in the last 5.8.3
> > bzr revision or 5.8.4 (at least revision 507) ? Muriel Delaveau and
> > I found several improvements for the writing of the WFK file with
> > MPI-IO; these corrections improve the portability of the code and
> > have been merged in the revision 507. We found that the code
> > produced crashes on several architectures because of wrong treatment
> > of buffers... and hope to have correct that.
> >
> > 2) If you want to be able to use the file with anaddb you have to :
> > - use the last 5.8.3 branch (or 5.8.4)
> > or
> > - use the --enable-mpi-io-buggy option when building the code
> > (unuseful after 5.8.3 rev507); but you could have buffer problems in
> > that case.
> >
> >
> > We tested the new changes on ifort and gcc43 with mpich, open-mpi.
> >
> > In band-fft, the memory is splitted for the wfk but not for other
> > quantities, especially if you use PAW. We plan to correct that soon.
> >
> > Marc
> >
> >
> >
> >
> > -------- Message d'origine--------
> > De: David Waroquiers [mailto:david.waroquiers@uclouvain.be]
> > Date: ven. 19/06/2009 12:29
> > À: forum@abinit.org
> > Objet : [abinit-forum] Band/FFT parallelism on large systems
> >
> > Hello all,
> >
> > I have tried to use the band/fft parallelism on a large supercell
> > (a-SiO2, 72
> > atoms and 108 atoms). I encountered a problem while using a lot of
> > bands (4480
> > bands). It reaches convergences but crashes at the end of the run
> > when it is
> > supposed to write the WFK file (outwf call). I tried to run a
> > calculation with
> > 16, 32 and 64 processors.
> >
> > I have tried with fewer bands (640) and it works.
> > Do you have any idea how to overcome this problem ? The WFK file is
> > supposed to
> > be 4 GB and the available memory on the clusters is more than that.
> > By the way,
> > in the band/fft parallelism approach, the memory for the wfk is
> > split into the
> > different cpus, isn't it ?
> >
> > I encountered another problem while using cut3d to analyse the wfk
> > generated
> > with the band/fft parallelism. It does not recognise the file as a
> > valid wfk
> > file (about the same message as when band/fft parallelism didn't
> > allow to
> > restart with a different number of processors, before version 5.8 if
> > I'm
> > right). Any idea too ?
> >
> > My input file is hereafter and the log messages are after the input
> > file. I'm
> > using public version 5.8.3, revision 485 and the machines used are
> > the "green"
> > clusters in UCL : 102 Dual Quad-Core Xeon L5420/2.5GHz in Dell
> > Blade M1000e
> > with 16 GB (or 32 GB for some nodes) per node of 8 processors.
> >
> > Thanks a lot
> >
> > David Waroquiers
> > PhD Student
> > UCL - PCPM - ETSF
> >
> >
> >
> >
> > My input file :
> >
> > # Amorphous SiO2 : Generation of the WFK file needed for the KSS
> > (for GW
> > corrections)
> > # Dataset 1 : GS calculation (_DEN generation)
> > # Dataset 2 : GS calculation with many bands (_WFK generation)
> >
> > ndtset 2
> > jdtset 1 2
> > timopt 2
> >
> > # Dataset 1 : _DEN file generation (Density)
> >
> > tolvrs1 1.0d-12
> > prtden1 1
> > nstep1 5 #5 for testing
> > iscf1 7
> > npulayit1 7
> > nband1 256
> >
> > # Dataset 2 : _WFK file (Wavefunction)
> >
> > tolwfr2 1.0d-12
> > nband2 4480
> > nbdbuf2 384
> > istwfk2 1
> > iscf2 7
> > nstep2 5 #5 for testing
> > getden2 1
> >
> > # Options for Band/FFT Parallelism
> >
> > paral_kgb 1
> > wfoptalg 14
> > nloalg 4
> > fftalg 401
> > iprcch 4
> > intxc 0
> > istwfk 1
> > fft_opt_lob 2
> > npfft 1
> > npband 16 #32 #64
> >
> > # K-point mesh
> >
> > kptopt 0
> > kpt 0.0 0.0 0.0
> >
> > # System definition
> > # Unit cell
> >
> > acell 1.9465690950E+01 1.9465690950E+01 1.9465690950E+01
> > rprim 1 0 0
> > 0 1 0
> > 0 0 1
> >
> > # Atom types
> >
> > ntypat 2
> > znucl 8 14
> >
> > # Atoms and coordinates
> >
> > natom 72
> > typat 48*1 24*2
> > xcart 1.8342971905E+01 1.0013093348E+01 4.9948115472E+00
> > 1.8450118788E+01 5.1100335358E+00 1.1410341879E+01
> > 3.0243029960E+00 1.7006888337E+01 1.0689037523E+01
> > 6.3068666011E+00 1.4446482399E+01 7.9505060279E+00
> > 1.9178811503E+01 4.4712567836E-01 3.4641995090E+00
> > 9.5178783093E+00 1.2762912471E+01 1.4947329016E+01
> > 1.7402433472E+01 4.7067303120E+00 1.5833402903E+00
> > 1.0623164695E+01 2.7299953166E+00 8.7471659694E+00
> > 1.2931573871E+01 1.8128981231E+01 6.7007362518E+00
> > 1.8660924236E+01 1.4792395464E+01 3.1319031106E+00
> > 7.0217232014E+00 6.3190579071E+00 2.1266991430E+00
> > 4.1181163909E-01 5.0929210080E+00 5.7193503290E+00
> > 7.6209880479E+00 1.5443775482E+00 6.1023412080E-01
> > 1.7923134211E+01 9.4056919719E+00 1.3628670860E+01
> > 1.4710748045E+01 9.1118601940E+00 1.7566857742E+01
> > 1.0411995344E+01 1.0041061607E+00 1.5870123306E+01
> > 1.0980496920E+01 1.3629862231E+01 6.8821852197E+00
> > 1.2756648650E+01 9.3922889131E+00 1.2966781879E+01
> > 1.3710153187E+01 2.2151381385E+00 1.9176017166E+01
> > 5.9015247795E+00 1.8254646045E+01 1.6364133902E+01
> > 3.5889689987E+00 8.6729161022E+00 4.9047876611E+00
> > 1.4649631278E+01 1.1782133781E+01 2.4189697381E+00
> > 1.3094524372E+01 1.5574388332E+01 1.1017906884E+01
> > 1.8122798453E+00 1.5904671691E+01 1.5390374184E+01
> > 8.0934509994E+00 9.9606459884E+00 5.6351418737E+00
> > 1.0388873243E+01 1.1258002356E+01 1.9535431306E+01
> > 1.7801695829E+01 1.5681701759E+01 1.1954743795E+01
> > 2.9395289639E+00 3.6212308778E+00 1.4808160737E+00
> > 1.3785141980E+01 3.1146153451E+00 4.7897808777E+00
> > 6.6125694236E+00 3.8955369666E+00 1.1802613942E+01
> > 1.0543336669E+00 8.8480531151E+00 9.2302571597E+00
> > 1.5034376672E+01 1.3207034271E+01 1.5126390258E+01
> > 1.9223920516E+01 6.5595988246E-01 1.3020475817E+01
> > 6.6553078921E+00 5.1934209327E+00 6.9894256581E+00
> > 1.4918361618E+01 3.1596212425E+00 1.4324193688E+01
> > 8.0804273193E+00 7.9884008127E+00 1.4307619386E+01
> > 5.7570753518E+00 1.3551949199E+01 1.8079850277E+01
> > 1.2833144388E+01 6.9576781789E+00 1.8702339976E+00
> > 2.7890960157E+00 1.7032376017E+01 6.7568473875E-01
> > 8.6760457768E+00 1.0859908527E+01 1.0407253204E+01
> > 6.3690907257E+00 2.2769273004E-01 8.2629069843E+00
> > 1.4623475391E+01 1.7952319809E+01 1.5406783784E+01
> > 1.5775821227E+01 1.3896960139E+01 6.9539101570E+00
> > 1.5477566296E+01 1.7519166868E+00 9.5117606862E+00
> > 3.1098755647E+00 7.2414373656E+00 1.3444441571E+01
> > 2.9576688783E+00 1.7045648497E+01 5.6738016905E+00
> > 9.7659864282E+00 1.6334927247E+01 1.8709494220E+01
> > 5.6015780233E+00 4.6820174692E+00 1.6849684158E+01
> > 1.3193293623E+01 1.6296954721E+00 7.4269549058E+00
> > 4.0153579861E+00 1.6089810803E+01 1.7511617105E+01
> > 1.5080013653E+01 1.5674902127E+01 1.3378910449E+01
> > 1.3468917372E+01 1.2396666756E+00 1.6246453919E+01
> > 3.1443097800E-01 3.4518653529E+00 3.0738155414E+00
> > 1.6864054813E+01 1.2620177700E+01 4.3720388857E+00
> > 1.3252228290E+01 1.5343974821E+01 7.9284144425E+00
> > 8.4872425534E+00 1.9054897865E+01 1.7815010425E+01
> > 1.5170448087E+01 1.0186883021E+01 1.4748027393E+01
> > 7.5516402653E+00 3.0013700719E+00 8.9766200084E+00
> > 6.4090355722E+00 7.4843741588E+00 4.9295671605E+00
> > 4.3705611827E-01 7.6893073781E+00 1.2001555962E+01
> > 8.4238741309E+00 1.2232714786E+01 7.6995337657E+00
> > 5.8387974184E+00 5.9155119378E+00 1.4039991791E+01
> > 1.3107235988E+01 9.8055489044E+00 6.4400593019E-01
> > 8.3270647814E-01 1.7227458132E+01 1.2775664290E+01
> > 1.4372625432E+01 4.2560000137E+00 1.9730406948E+00
> > 5.7914453145E+00 4.0664955533E+00 3.9036518542E-01
> > 9.7815513593E+00 1.0257448955E+01 1.3164763822E+01
> > 1.6979663973E+01 2.5757556368E+00 1.2070399003E+01
> > 9.1476280310E-01 8.1192625454E+00 6.2498371664E+00
> > 8.8902943261E+00 1.3433615492E+01 1.7894990037E+01
> > 4.7238007437E+00 1.7074503731E+01 8.1487422033E+00
> > 1.1337419675E+00 1.7170180156E+01 3.2442179093E+00
> >
> > # Energy cutoff for the planewaves
> >
> > ecut 32.0
> >
> > # Parameters for the SCF cycles
> >
> > nstep 5
> > diemac 4.0
> > ixc 11
> >
> >
> >
> >
> >
> >
> > Here is the end of the message log for the 72 atoms cell with 4480
> > bands run on
> > 16 processors :
> >
> > ================================================================================
> >
> >
> > ----iterations are completed or convergence reached----
> >
> > outwf : write wavefunction to file asio2test_para16_4096o_DS2_WFK
> > -P-0000 leave_test : synchronization done...
> > --------------------------------------------------------------------------
> >
> > MPI_ABORT was invoked on rank 15 in communicator MPI_COMM_WORLD
> > with errorcode 1.
> >
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> > --------------------------------------------------------------------------
> >
> > --------------------------------------------------------------------------
> >
> > mpirun has exited due to process rank 1 with PID 6159 on
> > node node054 exiting without calling "finalize". This may
> > have caused other processes in the application to be
> > terminated by signals sent by mpirun (as reported here).
> > --------------------------------------------------------------------------
> >
> > [green:29817] 15 more processes have sent help message
> > help-mpi-api.txt /
> > mpi-abort
> > [green:29817] Set MCA parameter "orte_base_help_aggregate" to 0 to
> > see all help
> > / error messages
> >
> >
> >
> >
> >
> > Here is the end of the log file I got from the 108 atoms cell (with
> > 3200 bands)
> > run on 32 processors :
> >
> > ================================================================================
> >
> >
> > ----iterations are completed or convergence reached----
> >
> > outwf : write wavefunction to file as108nr_004o_DS2_WFK
> > -P-0000 leave_test : synchronization done...
> > [node090:26797] *** Process received signal ***
> > [node090:26797] Signal: Segmentation fault (11)
> > [node090:26797] Signal code: Address not mapped (1)
> > [node090:26797] Failing at address: 0x2
> > [node090:26797] [ 0] /lib64/libpthread.so.0 [0x395300e4c0]
> > [node090:26797] [ 1]
> > /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_add+0x6c1)
> >
> > [0x2b389ea74951]
> > [node090:26797] [ 2]
> > /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(ompi_ddt_create_indexed_block+0x1b3)
> >
> > [0x2b389ea750c3]
> > [node090:26797] [ 3]
> > /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi.so.0(MPI_Type_create_indexed_block+0xb8)
> >
> > [0x2b389ea9d848]
> > [node090:26797] [ 4]
> > /cvos/shared/apps/openmpi/intel/64/1.3.1/lib64/libmpi_f77.so.0(mpi_type_create_indexed_block_f+0x38)
> >
> > [0x2b389e828780]
> > [node090:26797] [ 5]
> > /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(wffwritecg_+0xc29)
> > [0x10823a9]
> > [node090:26797] [ 6]
> > /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(writewf_+0x1ef9)
> > [0x107f0b7]
> > [node090:26797] [ 7]
> > /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(rwwf_+0x3e88)
> > [0x107d1b4]
> > [node090:26797] [ 8]
> > /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(outwf_+0x2616)
> > [0x5f5a72]
> > [node090:26797] [ 9]
> > /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(gstate_+0x15074)
> > [0x4675cc]
> > [node090:26797] [10]
> > /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(driver_+0x740e)
> > [0x44ea52]
> > [node090:26797] [11]
> > /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(MAIN__+0x52a6)
> > [0x4448f6]
> > [node090:26797]
> > [12] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip(main+0x2a)
> > [0x43f642]
> > [node090:26797] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
> > [0x395281d974]
> > [node090:26797] [14] /home/pcpm/waroquiers/583/abinit/5.8/bin/abinip
> > [0x43f569]
> > [node090:26797] *** End of error message ***
> > --------------------------------------------------------------------------
> >
> > mpirun noticed that process rank 8 with PID 26797 on node node090
> > exited on
> > signal 11 (Segmentation fault).
> > --------------------------------------------------------------------------
> >
> >
>
- [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/19/2009
- RE : [abinit-forum] Band/FFT parallelism on large systems, Marc.TORRENT, 06/20/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/22/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, David Waroquiers, 06/24/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, DELAVEAU Muriel, 06/24/2009
- Re: RE : [abinit-forum] Band/FFT parallelism on large systems, TORRENT Marc, 06/22/2009
- Re: [abinit-forum] Band/FFT parallelism on large systems, Emmanuel Arras, 06/19/2009
Archive powered by MHonArc 2.6.16.