forum@abinit.org
Subject: The ABINIT Users Mailing List ( CLOSED )
List archive
- From: Matteo Giantomassi <gmatteo@pcpm.ucl.ac.be>
- To: forum@abinit.org
- Subject: Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?
- Date: Tue, 15 May 2007 16:21:56 +0200 (MEST)
Quoting Chao Cao <cao@qtp.ufl.edu>:
> My pleasure... I did the modification, and now it stops during dataset
>
> #3. The error is actually very wierd:
>
> k = 0.000 0.000 0.000
> Band E_lda <Vxclda> E(N-1) <Hhartree> SigX SigC[E(N-1)]
>
> Z dSigC/dE Sig[E(N)] DeltaE E(N)_pert E(N)_diago
> 1 -6.162 -10.105 -6.162 3.942 -7.066 4.193
> 1.000 0.000 -2.873 7.232 1.069 1.069
> 2 5.849 -10.669 5.849 16.518 -5.720 1.754
> 1.000 0.000 -3.966 6.702 12.552 12.545
> 3 5.849 -10.669 5.849 16.518 -5.713 1.752
> 1.000 0.000 -3.961 6.708 12.557 12.557
> 4 5.849 -10.669 5.849 16.518 -5.705 1.749
> 1.000 0.000 -3.955 6.713 12.563 12.568
> 5 8.371 -9.775 8.371 18.146 -1.436 -1.770
> 1.000 0.000 -3.206 6.569 14.940 14.918
> 6 8.371 -9.775 8.371 18.146 -1.436 -1.770
> 1.000 0.000 -3.206 6.570 14.940 14.940
> 7 8.371 -9.775 8.371 18.146 -1.435 -1.770
> 1.000 0.000 -3.206 6.569 14.940 14.963
> 8 8.966 -10.185 8.966 19.151 -1.319 -1.979
> 1.000 0.000 -3.298 6.888 15.853 15.854
>
> E^0_gap 2.521
> E^GW_gap 2.377
> DeltaE^GW_gap -0.144
>
> sigma : loop over k point, treating k point number 2
> model GW with PPM
> Self-Consistent on Energies and Wavefunctions
> calculating <nk|sigma|nk>
> k = 0.250 0.000 0.000
> bands n = from 1 to 8
>
> calculation status ( 64 to be completed):
> **************************CODE STOPS
> HERE******************************
>
> It seems that the first kpoint is done, but it failed to calculate for
>
> the 2nd kpoint (which I assume should be exactly the same). I put "-O0
>
> -g" in the optimization flags, but was not helpful at all.
Hi, I do not think it a problem related to the optimization flags.
It should be a problem related to the array mpi_enreg%proc_distr
which is deallocated after the first k point.
Actually I am using my own version of the code to run calculations
in parallel and I do not have such a kind of problem.
You have to modify the subroutine csigme.F90 in 15gw .
Just comment the following line situated at the end of the line
and let me know!
if(associated(mpi_enreg%proc_distrb)) deallocate(mpi_enreg%proc_distrb)
> Also, I noticed something suspicious (at least to me): in the
> 21drive/sigma.F90:
>
>
> filnam=trim(dtfil%filnam_ds(4))//'_GW'
> open(21,file=filnam,status='unknown',form='formatted')
> write(21,*) sp%nkcalc
>
> filnam=trim(dtfil%filnam_ds(4))//'_SIG'
> open(22,file=filnam,status='unknown',form='formatted')
>
> filnam=trim(dtfil%filnam_ds(4))//'_SGR'
> open(23,file=filnam,status='unknown',form='formatted')
>
> shouldn't it be quoted by a set of "if(me==0) then... end if" statement?
Well you are right but anyway these files are written
in the write_sigma_results subroutine. Now according to the modifications
I suggested in the last mail, only the master processor is calling that
subroutine and no harm should be done.
By the way, the next minor release of Abinit is coming soon and there
have been some improvements in the GW part although some things are still
under development.
So, what about testing also the new implementation as soon as it will
be available on the web page?
Best,
Matteo Giantomassi
> otherwise it looks like all the threads are trying to open the same
> file.
>
> However, changing this doesn't eliminate the problem I mentioned...:s
>
>
> Chao Cao
> Matteo Giantomassi wrote:
> >
> > On Wed, 9 May 2007, Chao Cao wrote:
> >
> >> There are definitely more places to change...:) I got the following
>
> >> error after doing the modifications mentioned here:
> >>
> >> forrtl: severe (37): inconsistent record length, unit 26, file
> >> /scratch/ufhpc/ccao/test/t88/fort.26
> >> mpirun noticed that job rank 2 with PID 10384 on node "r4a-s16.local"
>
> >> exited on signal 13.
> >> [r4a-s35.local:30975] [0,0,1]-[0,0,0] mca_oob_tcp_msg_recv: readv
> >> failed with errno=104
> >
> > Dear Chao Cao,
> >
> > First of all thanks for your contribution to the debugging of the
> > GW-parallelism. We really need beta-testers since we are still working
>
> > on the
> > implementation, and we have to be sure that the GW-code is portable
> > among different architectures-compiler and MPI-implementations.
> >
> > It seems that the code crashes when trying to read the wavefunctions
>
> > from file.
> > The subroutine involved should be 15gw/cchi0.F90
> >
> > I remember that in version 5.3.4 there were some problems related to
>
> > the reading from file Indeed the code should not read the
> > wavefunctions from file fort.26 since they are supposed to in memory
>
> > (unless mkmem=0 is used in the input file).
> >
> > So try to modify 15gw/cchi0.F90 by inserting
> > wfnr_not_in_memory=.false. at the beginning of the file
> > This should partially fix the problem you are encountering, but I
> cannot
> > exclude other problems arising in other parts of the code.
> >
> > Best Regards, Matteo Giantomassi
> >>
> >> And I'm still looking for the responsible part of the code... I am
> >> not familiar with abinit source code though, so it'll take long
> >> time... :p
> >>
> >>
> >> Chao Cao
> >>
> >> Matteo Giantomassi wrote:
> >>>
> >>> On Wed, 9 May 2007, Chao Cao wrote:
> >>>
> >>>> Dear abinit users:
> >>>> I compiled latest abinit (version 5.3.4, 5.2.4 was also tried) with
>
> >>>> mpi parallel support. It was compiled with pathscale compiler/ACML
>
> >>>> library/Open MPI 1.1.4. Everything compiled OK and all the
> >>>> sequential tests were passed. Parallel execution of abinit (abinip)
>
> >>>> for ground state calculations were also performed without error.
> >>>> However, when I test on GW calculations, the code would always stop
>
> >>>> with errors.
> >>>> For example, when I perform test_v4 t88, the code will stop after
>
> >>>> it goes into data set 2. In the log file (redirected from standard
>
> >>>> output), it stops at:
> >>>> ......
> >>>> ......
> >>>>
> >>>> End the ECHO of the ABINIT file header
> >>>>
> ===============================================================================
>
> >>>> Results from ABINIT code
> >>>> Ab-initio plane waves calculation
> >>>> Results from ABINIT code Ab-initio plane waves
>
> >>>> calculation number of
>
> >>>> electrons 8
> >>>> number of symmetries without inversion 24
> >>>> number of bands 30
> >>>> number of plane waves 89
> >>>> ......
> >>>> ......
> >>>> vkbsign: 1.0 1.0 0.0
> >>>>
> >>>>
> >>>> k eigenvalues [eV]
> >>>> 1 -6.16 5.85 5.85 5.85 8.37 8.37 8.37 8.97
>
> >>>> 13.42 13.85
> >>>>
> >>>> 2 -5.35 1.88 5.07 5.07 7.73 9.34 9.34 12.53
>
> >>>> 13.18 13.18
> >>>>
> >>>> 3 -3.79 -1.20 4.62 4.62 7.26 9.19 9.19 13.33
>
> >>>> 16.71 16.71
> >>>>
> >>>> 4 -5.07 2.33 3.91 3.91 6.90 8.90 11.60 11.60
>
> >>>> 13.71 15.07
> >>>>
> >>>> 5 -3.42 -0.57 2.24 3.60 7.28 10.25 11.49 11.78
>
> >>>> 15.93 16.53
> >>>>
> >>>> 6 -4.10 0.33 2.02 4.49 8.21 10.60 10.92 11.81
>
> >>>> 12.40 15.39
> >>>>
> >>>> 7 -2.00 -2.00 2.94 2.94 6.46 6.46 15.78 15.78
>
> >>>> 17.07 17.07
> >>>>
> >>>> 8 -1.84 -1.84 1.91 1.91 10.09 10.09 10.74 10.74
>
> >>>> 16.42 16.42
> >>>>
> >>>> 3 additional processes aborted (not shown)
> >>>> **********************************END OF LOG
> >>>> FILE************************************
> >>>>
> >>>>
> >>>> and in the mpi.out:
> >>>>
> >>>>
> >>>> lib-4091 : UNRECOVERABLE library error
> >>>> A WRITE operation was attempted on a file with no write
> permission.
> >>>>
> >>>> Encountered during a sequential formatted WRITE to unit 7
> >>>> Fortran unit 7 is connected to a sequential formatted text file:
> >>>> "fort.7"
> >>>> Current format: (i3,7x,10f7.2/50(10x,10f7.2/))
> >>>> ^
> >>>> Signal:6 info.si_errno:0(Success) si_code:-6()
> >>>> [0] func:/opt/psc/ompi/1.1.4/lib/libopal.so.0 [0x2a959ab02b]
> >>>> *** End of error message ***
> >>>>
> >>>>
> >>>>
> >>>> It looks to me that somewhere in the GW code, two or more MPI
> >>>> threads was trying to write to file "fort.7", and thus failed. Has
>
> >>>> anyone else encountered problem like this? Any suggestions would be
>
> >>>> appreciated.
> >>>>
> >>> Hi,
> >>>
> >>> I think the problem is located in the src/15gw/rdkss.F90 subroutine
>
> >>> at line 463 (abinit version 5.3.4)
> >>>
> >>> if (nsppol==2) then
> >>> write(6,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol),
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> write(ab_out,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol),
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> else
> >>> write(6,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik,
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> write(ab_out,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik,
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> end if
> >>>
> >>> As you said each processor/thread is trying to write on the main
> >>> output file and this causes the crash of the parallel run. You can
>
> >>> simply prevent all the other processors from writing on the main
> >>> output file by just adding the following statement
> >>>
> >>> if (me==0) then
> >>>
> >>> if (nsppol==2) then
> >>> write(6,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol),
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> write(ab_out,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol),
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> else
> >>> write(6,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik,
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> write(ab_out,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik,
> >>> (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
> >>> end if
> >>>
> >>> end if !of me==0
> >>>
> >>>
> >>> The same if statement must be inserted at line 514 of the same
> >>> subroutine (rdkss)
> >>>
> >>> Moreover it's safe to change line 1195 of 21drive/sigma.F90 as
> >>> follows :
> >>>
> >>> <OLD VERSION>
> >>> call write_sigma_results(sp,sr,ikcalc,ikibz,en)
> >>>
> >>> <NEW VERSION> if (me==0) call
> >>> write_sigma_results(sp,sr,ikcalc,ikibz,en)
> >>>
> >>>
> >>> Maybe there are other parts of the code where each processors is
> trying
> >>> to write on the main output file. We fixed such problems but you
> >>> have to wait for the next release.
> >>> For the moment try to run the automatic tests in parallel, and let
>
> >>> us known if you encounter other problems.
> >>>
> >>> Hope this helps,
> >>> Best Regards,
> >>> Matteo Giantomassi
> >>>
> >>>>
> >>>> Best,
> >>>>
> >>>>
> >>>>
> >>>> Chao Cao
> >>>>
> >>>> Quantum Theory Project,
> >>>> University of Florida
> >>>> Gainesville, FL 32608
> >>>>
> >>
> >>
>
>
----------------
Matteo Giantomassi
PCPM/FSA/UCL
-------------------------------------------------
This mail sent through IMP: atlas.pcpm.ucl.ac.be
- [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/10/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/11/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/15/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/11/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/10/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/09/2007
Archive powered by MHonArc 2.6.16.