forum@abinit.org
Subject: The ABINIT Users Mailing List ( CLOSED )
List archive
- From: Chao Cao <cao@qtp.ufl.edu>
- To: forum@abinit.org
- Subject: Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?
- Date: Wed, 09 May 2007 16:47:39 -0400
There are definitely more places to change...:) I got the following error after doing the modifications mentioned here:
forrtl: severe (37): inconsistent record length, unit 26, file /scratch/ufhpc/ccao/test/t88/fort.26
mpirun noticed that job rank 2 with PID 10384 on node "r4a-s16.local" exited on signal 13.
[r4a-s35.local:30975] [0,0,1]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104
And I'm still looking for the responsible part of the code... I am not familiar with abinit source code though, so it'll take long time... :p
Chao Cao
Matteo Giantomassi wrote:
On Wed, 9 May 2007, Chao Cao wrote:
Dear abinit users:Hi,
I compiled latest abinit (version 5.3.4, 5.2.4 was also tried) with mpi parallel support. It was compiled with pathscale compiler/ACML library/Open MPI 1.1.4. Everything compiled OK and all the sequential tests were passed. Parallel execution of abinit (abinip) for ground state calculations were also performed without error. However, when I test on GW calculations, the code would always stop with errors.
For example, when I perform test_v4 t88, the code will stop after it goes into data set 2. In the log file (redirected from standard output), it stops at:
......
......
End the ECHO of the ABINIT file header
===============================================================================
Results from ABINIT code
Ab-initio plane waves calculation
Results from ABINIT code Ab-initio plane waves calculation number of electrons 8
number of symmetries without inversion 24
number of bands 30
number of plane waves 89
......
......
vkbsign: 1.0 1.0 0.0
k eigenvalues [eV]
1 -6.16 5.85 5.85 5.85 8.37 8.37 8.37 8.97 13.42 13.85
2 -5.35 1.88 5.07 5.07 7.73 9.34 9.34 12.53 13.18 13.18
3 -3.79 -1.20 4.62 4.62 7.26 9.19 9.19 13.33 16.71 16.71
4 -5.07 2.33 3.91 3.91 6.90 8.90 11.60 11.60 13.71 15.07
5 -3.42 -0.57 2.24 3.60 7.28 10.25 11.49 11.78 15.93 16.53
6 -4.10 0.33 2.02 4.49 8.21 10.60 10.92 11.81 12.40 15.39
7 -2.00 -2.00 2.94 2.94 6.46 6.46 15.78 15.78 17.07 17.07
8 -1.84 -1.84 1.91 1.91 10.09 10.09 10.74 10.74 16.42 16.42
3 additional processes aborted (not shown)
**********************************END OF LOG FILE************************************
and in the mpi.out:
lib-4091 : UNRECOVERABLE library error
A WRITE operation was attempted on a file with no write permission.
Encountered during a sequential formatted WRITE to unit 7
Fortran unit 7 is connected to a sequential formatted text file: "fort.7"
Current format: (i3,7x,10f7.2/50(10x,10f7.2/))
^
Signal:6 info.si_errno:0(Success) si_code:-6()
[0] func:/opt/psc/ompi/1.1.4/lib/libopal.so.0 [0x2a959ab02b]
*** End of error message ***
It looks to me that somewhere in the GW code, two or more MPI threads was trying to write to file "fort.7", and thus failed. Has anyone else encountered problem like this? Any suggestions would be appreciated.
I think the problem is located in the src/15gw/rdkss.F90 subroutine at line 463 (abinit version 5.3.4)
if (nsppol==2) then
write(6,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol), (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
write(ab_out,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol), (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
else
write(6,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik, (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
write(ab_out,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik, (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
end if
As you said each processor/thread is trying to write on the main output file and this causes the crash of the parallel run. You can simply prevent all the other processors from writing on the main output file by just adding the following statement
if (me==0) then
if (nsppol==2) then
write(6,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol), (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
write(ab_out,'(i3,a,10f7.2/50(10x,10f7.2/))') ik,stag(isppol), (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
else
write(6,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik, (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
write(ab_out,'(i3,7x,10f7.2/50(10x,10f7.2/))') ik, (Ha_eV*en(ik,ib,isppol),ib=min_band_proc,max_band_proc)
end if
end if !of me==0
The same if statement must be inserted at line 514 of the same subroutine (rdkss)
Moreover it's safe to change line 1195 of 21drive/sigma.F90 as follows :
<OLD VERSION>
call write_sigma_results(sp,sr,ikcalc,ikibz,en)
<NEW VERSION> if (me==0) call write_sigma_results(sp,sr,ikcalc,ikibz,en)
Maybe there are other parts of the code where each processors is trying
to write on the main output file. We fixed such problems but you have to wait for the next release.
For the moment try to run the automatic tests in parallel, and let us known if you encounter other problems.
Hope this helps,
Best Regards,
Matteo Giantomassi
Best,
Chao Cao
Quantum Theory Project,
University of Florida
Gainesville, FL 32608
- [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/10/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/11/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/15/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/11/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/10/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Chao Cao, 05/09/2007
- Re: [abinit-forum] [BUG?] MPI Abinit 5 cannot calculate GW?, Matteo Giantomassi, 05/09/2007
Archive powered by MHonArc 2.6.16.