forum@abinit.org
Subject: The ABINIT Users Mailing List ( CLOSED )
List archive
- From: delaire@caltech.edu
- To: forum@abinit.org
- Subject: MPI_Comm_create(): Too many communicators
- Date: Wed, 23 Nov 2005 20:45:19 +0100
Hello,
I am encountering some MPI errors when running parallel Abinip jobs under
mpich2. This is when I run my newly g95-compiled Abinip-4.6.5 with
g95-compiled mpich2 under our Linux-Debian cluster.
First the good:
the g95-compiled Abinip-4.6.5 passes all parallel tests in Test_paral (tests
A through J, with adapted Run script). The only differences I get in the
fldiff reports are date/timing and some minor differences in numerical
results.
Now the bad:
when I try to run larger jobs in parallel (these jobs have been run in serial
without pb before), I get an error:
MPI_Comm_create(number): Too many communicators
when the program hits the response-function part of the calculation. This has
happened for several different input files. Note that this problem does not
seem to occur for the ground-state part of the calculations.
Here is what I found online on the meaning of this MPI error message:
------
0032-160 Too many communicators (number) in string, task number
Explanation: MPI is unable to create a new communicator because the maximum
number of simultaneous communicators would be exceeded.
User Response: Be sure to free unneeded communicators with MPI_Comm_free so
that they can be reused.
Error Class: MPI_ERR_COMM
------
Has anyone encountered this error before? Is there a way to make sure that
MPI communicators get freed? or a way to set parallelizaton options so that
this error would be prevented?
I am appending below the input file I'm running as well as the tail of the
log file.
Thanks for any help,
Olivier.
log
#################################
-P-0000 cgwf3: WARNING -
-P-0000 New trial energy at line 3 = -5.546092E+02
-P-0000 is higher than former: -5.546092E+02
-P-0000
-P-0000 leave_test : synchronization done...
vtorho3: loop on k-points and spins done in parallel
vtorho3 : MPI_ALLREDUCE, buffer of size 915896 bytes
ETOT 52 -8.70108545352650E-02-3.865E-12 2.117E-15 5.405E-09
At SCF step 52 vres2 = 5.40E-09 < tolvrs= 1.00E-08 =>converged.
-P-0000 leave_test : synchronization done...
nstdy3: loop on k-points and spins done in parallel
-P-0000 leave_test : synchronization done...
================================================================================
----iterations are completed or convergence reached----
outwf : write wavefunction to file bccV_ph2o_DS3_1WF1
-P-0000 leave_test : synchronization done...
aborting job:
Fatal error in MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(222): MPI_Comm_create(MPI_COMM_WORLD, group=0xc80101f8,
new_comm=0xfd3ad8) failed
MPI_Comm_create(120): Too many communicators
aborting job:
Fatal error in MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(222): MPI_Comm_create(MPI_COMM_WORLD, group=0xc80101f8,
new_comm=0xf88e58) failed
MPI_Comm_create(120): Too many communicators
aborting job:
Fatal error in MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(222): MPI_Comm_create(MPI_COMM_WORLD, group=0xc80101f8,
new_comm=0xf88e58) failed
MPI_Comm_create(120): Too many communicators
aborting job:
Fatal error in MPI_Comm_create: Other MPI error, error stack:
MPI_Comm_create(222): MPI_Comm_create(MPI_COMM_WORLD, group=0xc80101f8,
new_comm=0xfa9938) failed
MPI_Comm_create(120): Too many communicators
rank 2 in job 33 strongmad_33980 caused collective abort of all ranks
exit status of rank 2: killed by signal 9
rank 1 in job 33 strongmad_33980 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
rank 0 in job 33 strongmad_33980 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
#################################
input
#################################
# pure V : computation of the phonon spectrum
ndtset 9
#Set 1 : ground state self-consistency
getwfk1 0 # Cancel default
kptopt1 3 # Automatic generation of k points, no symmetry
tolvrs1 1.0d-18 # SCF stopping criterion (modify default)
prtden1 1
nqpt1 0 # Cancel default
rfphon1 0 # Cancel default
#Q vectors for all datasets
#Complete set of symmetry-inequivalent qpt chosen to be commensurate
# with kpt mesh so that only one set of GS wave functions is needed.
#Generated automatically by running GS calculation with kptopt=1,
# nshift=0, shiftk=0 0 0 (to include gamma) and taking output kpt set
# file as qpt set. Set nstep=1 so only one iteration runs.
nqpt 1 # One qpt for each dataset (only 0 or 1 allowed)
# This is the default for all datasets and must
# be explicitly turned off for dataset 1.
#### grid corresonding to nqpt 4 4 4 reduced by symmetry:
qpt2 0 0 0
qpt3 0.25 0 0
qpt4 0.5 0 0
qpt5 0.25 0.25 0
qpt6 0.25 0.25 0.25
qpt7 -0.25 0.25 0.25
qpt8 0.5 0.5 0.25
qpt9 0.5 0.5 0.5
#Set 2: phonon calc at Gamma
kptopt2 2 # k-point set reduced only by time-reversal symetry
#Sets 3-9 : Finite-wave-vector phonon calculations (defaults for all datasets)
getwfk 1 # Use GS wave functions from dataset1
kptopt 3 # Need full k-point set for finite-Q response
rfphon 1 # Do phonon response
rfatpol 1 1 # Treat displacements of all atoms
rfdir 1 1 1 # Do all directions (symmetry will be used)
tolvrs 1.0d-8 # This default is active for sets 2-9
#######################################################################
#Common input variables
#Definition of the unit cell
acell 3*5.8074 #
rprim -0.5 0.5 0.5 # BCC primitive vectors (to be scaled by acell)
0.5 -0.5 0.5
0.5 0.5 -0.5
#Definition of the atom types
ntypat 1 # There are two types of atom
znucl 23 # The keyword "znucl" refers to the atomic number of
the
# possible type(s) of atom. The pseudopotential(s)
# mentioned in the "files" file must correspond
# to the type(s) of atom.
#Definition of the atoms
natom 1 # There are two atoms
typat 1 # The first is of type 1 (Al), the second is of type 2
(As).
xred 0.0 0.0 0.0
#Gives the number of band, explicitely (do not take the default)
# nband 4 check the importance of this
#Exchange-correlation functional
ixc 11 # GGA-PBE
#Definition of the planewave basis set
ecut 40.0 # Maximal kinetic energy cut-off, in Hartree
#Definition of the k-point grid
ngkpt 8 8 8
# nshiftk 2 # Use one copy of grid only (default)
# shiftk 0.25 0.25 0.25 # shift vectors to apply to the grid
# -0.25 -0.25 -0.25
nshiftk 1
shiftk 0 0 0
#Definition of occupation numbers
occopt 4 # "cold smearing" option for occupation of levels
tsmear 0.005
nband 10
#Definition of the SCF procedure
iscf 5 # Self-consistent calculation, using algorithm 5
nstep 100 # Maximal number of SCF cycles
- MPI_Comm_create(): Too many communicators, delaire, 11/23/2005
Archive powered by MHonArc 2.6.16.