Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] steady increase in memory for band parallel job

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] steady increase in memory for band parallel job


Chronological Thread 
  • From: BOTTIN Francois <francois.bottin@cea.fr>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] steady increase in memory for band parallel job
  • Date: Fri, 04 Dec 2009 17:58:11 +0100
  • Organization: CEA-DAM

Hi Eric,

Concerning the changes you have just to nullify the various pointers (at the first time) in order to
have a well defined status at the beginning.
I hope this is the origin of the problem!

For bandpp, I don't know why the link does not appear in the whole list of input variables.
You will see the documentation at:
http://www.abinit.org/documentation/helpfiles/for-v5.8/input_variables/vardev.html#bandpp
Its use allows to increase the size of the block without increasing the number of band processors.
Moreover, as said in the documentation, "Put *bandpp*=2 when istwfk <http://www.abinit.org/documentation/helpfiles/for-v5.8/input_variables/vardev.html#istwfk>=2 (the time spent in FFTs is divided by two)".
This is well suited to your case (istwfk 2), ... if your memory leak is removed!

Regards,
Francois

Eric J. Walter a écrit :

Francois,

I have read the link and will try implementing the changes.
Thanks for the suggestions on the input file. Is bandpp an input variable? Perhaps it is undocumented? I don't see it on:

http://www.abinit.org/documentation/helpfiles/for-v5.8/input_variables/keyhr.html

Thanks again,

Eric



BOTTIN Francois wrote:
Hi Eric and Walter,

By setting istwfk=1, your calculations become 2 times larger and generally 2 or 3 times slower.
So, please, could you see if by correcting 66_wfs/prep_kg_sym_do.F90 as explained in section "danger
with pointer" at http://www.cs.rpi.edu/~szymansk/OOF90/bugs.html, the memory doesn't increase
any more (using istwfk=2).

I have strongly modified the way to do that in abinit5.9. The patch proposed above, would permit to wait
until a stable and production abinit6.0 release.

Concerning your input file Eric: I suggest to use wfoptalg 14, bandpp 2 (increase your performance) and iprcch 6 (for DM).

Regards,
Francois

Eric J. Walter a écrit :

Hi,

I have also run a few tests and agree that the memory does not increase when istwfk = 1 is set. Looks like you have found the problem, Francois. Thanks!

Perhaps it is moot now, but I have not been able to determine why I can seem to get g95 + openmpi to
work with abinip. If someone has some suggestions....
Regards,

Eric

Winfried Lorenzen wrote:
Hi,

I have tested it now and it works. The memory does not increase any more with istwfk=1.

Are there any drawbacks from setting istwfk = 1?

Regards,
Winfried


Am Donnerstag, 3. Dezember 2009 18:01:08 schrieb BOTTIN Francois:
Dear Eric and Manuel,

Thank you very much for your reports.
As we can see from your graph Eric (which is very helpful; we can see
nline, nnsclo, nstep...),
the problem appears 1 times inside and also 1 times outside LOBPCG per
electronic STEP.
I suspect the trouble comes from 66_wfs/prep_kg_sym_do.F90 which is called
i) 1 times/STEP through prep_getghc.F90 inside LOBPCG, and
ii) 1 times/STEP through prep_fourwf.F90 outside LOBPCG.
Within this routine, there are some pointers with the "save" attribute
but an undefined status (there are not nullified). In this case, I think
the "associated"
intrinsic can't be used to check and deallocate the pointers before
reallocation
(see for example the section "danger with pointer" at
http://www.cs.rpi.edu/~szymansk/OOF90/bugs.html).
It seems that these pointers are never deallocated.

One good news. This routine is only called using istwfk=2 (real
wavefunctions).
So, please, could you check if using istwfk=1 your memory still
increases or not.

Also, let me know if you succeed to compile and execute in parallel with
g95.
Thanks again,
Regards
Francois

Eric J. Walter a écrit :
Dear Francois,

I have put a copy of my input, output, log (called OUTFILE) and psps
at the following link:

http://piezo1.physics.wm.edu/~ewalter/Abinit_memoryleak/

NOTE: the input file is for a different system than the one I
attached previously (the atom species are
different). Please disregard the previous one.

I have tried compiling with g95... The serial code works fine, but
never showed the memory problem
anyway. The parallel version exits right after reading the psp
files. I am continuing to look into this problem.

I have also tried compiling with pathscale v3.0, the result is the
same as the Intel fortran result.

So far, my testing seems to show that the largest problem is in the
lobpcgwf / lobpcgccwf routine. In my
test, I run for 5 iterations and output the "free-(buffers+cache)"
result of "free -m" to the output file
(I have added call system('free -m') to various parts of the 5.8.4
code). The graph is posted at the same
url as above (called memgraph). You can clearly see 5 repeating
patterns, one for each iteration. In the
serial case using wfoptalg4 (green), you can see that the memory usage
from iteration to iteration doesn't
increase. For the parallel with the lobpcgwf routine (black) the
memory keeps increasing during each iteration.
However, when the lobpcgwf routine is commented out (red) the memory
increase is less quickly.

I will continue to try and track down the problem here.

Regards,

Eric

BOTTIN Francois wrote:
Hi,

Is it possible for you to compile abinit-5.8.4p with g95 in order to
track memory leaks?
Perhaps your code will stop at a well defined line, with a tab
"already allocated"!
It would be very nice to get rid of this trouble.

If not, could you sent your 3 pseudopotentials, output and log (of
the proc 0), if this file is not too large.
Regards,
Francois Bottin

Eric J. Walter a écrit :
Hi,

It appears as though my jobs are suffering from the same problem
outlined in this post from July of this year:

https://listes-2.sipr.ucl.ac.be/abinit.org/arc/forum/2009-07/msg00034.h
tml


When I run the attached input file for ~14 hrs, the nodes I am using
run out of memory.
I am using RHEL AS4 on dual core / dual processor Opteron 2200 with
2 GB memory per core (= 8 GB total).
I am compiling with Intel Fortran 10.1 and OpenMPI-1.2.5. This
job's output file claims that the calculation
should require ~200 MB.

I have found that this increase only occurs when using band/fft
parallel (kpt parallel and serial don't
have this steady increase). Besides version 5.8.4p, I have also
tested versions 5.7.4 and 5.6.4, all three versions
seem to show this behavior. Version 5.4.4 does not have this
problem, but is also much slower, for me at least.

I have tried changing versions of openmpi (from 1.2.5 to 1.3.3),
this has no effect.

Has any progress been made on either finding the leak or another
cause to this problem?
Thanks in advance for any help you can give.

Eric J. Walter
Department of Physics
College of William and Mary

---------------------------------------------------
Here is my input file:
---------------------------------------------------
ionmov 8 noseinert 5.0d5 mditemp
300 dtion 50 toldfe 1d-4
eV ntime 20 nsym
1 chkprim 0 kptopt
1 ngkpt 1 1 1 shiftk 0 0
0 occopt 3 tsmear
0.001 ecut 45 nstep
20 natom 80 ntypat
3 znucl 82 22 8 prtden
0 prtwf 0 wfoptalg 4
npfft 1 npband 26
npkpt 1
nloalg 4
fftalg 401
iprcch 4
intxc 0 fft_opt_lob 2
paral_kgb 1

typat 48*3 16*2 16*1
acell 3*20.599653
angdeg 3*56.39553912

xcart
-2.22069320900486E+00 -1.67615626136839E+00 1.02845225672214E+01
2.83216713500271E+00 -9.64895067984759E-01 1.06130211295748E+01
-3.07826006953531E-01 3.14017017198906E+00 1.02884003097039E+01
2.38571355304415E+00 1.94315354580616E+00 1.47667846684432E+01
-2.74032183791911E+00 1.44987913001174E+00 1.47058588105213E+01
4.24301353957222E-01 -3.02467161327066E+00 1.47270477618761E+01
4.37208548403877E-01 3.15701320773295E+00 1.63174232817734E+00
5.62754717842062E+00 3.65799392921832E+00 1.72482290066637E+00
2.56805130575382E+00 8.08007715450321E+00 1.75538816925331E+00
5.29946087386834E+00 6.62719979049667E+00 6.13774796347384E+00
-1.33476176608987E-01 6.02962682764747E+00 5.99150680371859E+00
3.25740331179827E+00 1.86629101076117E+00 5.92292965668993E+00
4.73721314332368E-01 -6.73544486518827E+00 1.53749001511875E+00
5.79909660316150E+00 -5.87516895113753E+00 1.52606696257784E+00
2.72661190097539E+00 -1.82547980521533E+00 1.72879478480763E+00
5.29308456737276E+00 -2.93742682967659E+00 5.98120816946857E+00
-1.02277107167427E-01 -3.47901581523419E+00 6.00815649851232E+00
3.22915540047783E+00 -7.91315859978646E+00 6.15697463378583E+00
3.20853423992771E+00 -1.70404358692119E+00 -6.97779629913824E+00
8.39651387970289E+00 -8.70615018894613E-01 -7.04102732844698E+00
5.39613269691849E+00 3.09966157186435E+00 -6.96223499113757E+00
8.17325435029689E+00 2.08765713206062E+00 -2.55251338829625E+00
2.82623828634786E+00 1.23322960474568E+00 -2.78889127839044E+00
6.11970263677967E+00 -3.17948832421889E+00 -2.34609526217649E+00
-7.91433373058299E+00 -1.81160718343090E+00 1.63075874446493E+00
-2.76694176578899E+00 -1.16579400400263E+00 1.74956240701256E+00
-5.81555263606813E+00 3.20412849389643E+00 1.72359061387011E+00
-2.79939696716810E+00 2.06735348854857E+00 6.15861322508217E+00
-8.31960266338488E+00 1.29355309750287E+00 5.93088060057318E+00
-5.10575950217045E+00 -2.98997425594191E+00 6.13900333992898E+00
-5.14327896212322E+00 3.29299293105111E+00 -7.09301147609970E+00
9.75650004473785E-02 3.78997253887449E+00 -6.91330351424288E+00
-2.92976440197528E+00 7.95211008443348E+00 -6.82517026645523E+00
-3.06355829923221E-01 6.76833507431424E+00 -2.77362691622797E+00
-5.53331641987666E+00 6.24968986891999E+00 -2.54578391594814E+00
-2.43999241882386E+00 1.80795237233396E+00 -2.67292251413726E+00
-5.14068521198933E+00 -6.68713377702076E+00 -6.86237306597303E+00
-6.12789514851627E-02 -5.98968660555720E+00 -6.97883625732929E+00
-2.99553503150223E+00 -1.83920851864552E+00 -6.83013099810669E+00
-1.80008738706434E-01 -2.89562125276255E+00 -2.65747984992241E+00
-5.56015371335734E+00 -3.55350951569161E+00 -2.53677784336301E+00
-2.23260539192524E+00 -7.88875944438475E+00 -2.59919279877736E+00
-2.31544564461024E+00 -1.60168511875929E+00 -1.57988254333016E+01
2.98206360781800E+00 -1.09336396682876E+00 -1.56367782886258E+01
-3.02653507337988E-01 3.20195569372773E+00 -1.56155029581190E+01
2.62978455935385E+00 1.90892236294971E+00 -1.12646465158811E+01
-2.74702844352294E+00 1.34758125055524E+00 -1.11329897640166E+01
3.92389502215297E-01 -3.14047927186374E+00 -1.13789217412000E+01
1.43789519936701E-01 8.95683744161872E-02 5.51308679103783E-02
3.43343139804145E-02 8.44171505400222E-02 1.31069788732053E+01
2.87456960507807E+00 4.84780682916232E+00 -8.54098683394433E+00
2.83533350827871E+00 4.91662126969937E+00 4.43850072950254E+00
2.95141488341089E+00 -4.80542390687403E+00 -8.59714703849626E+00
2.82501485426290E+00 -4.90110860896433E+00 4.40234973478692E+00
5.65634082404525E+00 -1.24851891953083E-02 -1.71524703303103E+01
5.66706037951402E+00 9.18065709228990E-02 -4.30774458733638E+00
-5.41740478793818E+00 1.03696823860055E-01 -8.46603127880005E+00
-5.52981835134213E+00 3.74816566677455E-02 4.36263803023674E+00
-2.73374622355905E+00 4.90669254100288E+00 -1.71056087884796E+01
-2.76475317059456E+00 4.97808833622533E+00 -4.26429469831174E+00
-2.71829980916229E+00 -4.81400423619281E+00 -1.71329000123123E+01
-2.78480871843135E+00 -4.79896324962904E+00 -4.16357601481035E+00
1.27116237724845E-01 5.57355728005296E-02 -2.58199597795567E+01
3.69147318259344E-02 1.01239700458947E-01 -1.28425126153755E+01
1.09315814613628E-01 2.02744841122104E-02 7.51595267544533E+00
9.28429983979248E-02 4.27836872194515E-02 2.02554355831334E+01
2.71495957648699E+00 4.91077718525781E+00 -8.53254132115559E-01
2.86538509412732E+00 4.97170584874535E+00 1.18102413232714E+01
3.09009565965701E+00 -4.82067158705315E+00 -1.15312475386682E+00
2.97965222241338E+00 -4.75118372415529E+00 1.17120712079335E+01
5.77396145656038E+00 1.07323885408925E-02 -9.98939448156541E+00
5.72978432984313E+00 -6.68544670080847E-03 3.19837381572586E+00
-5.80054029700240E+00 -3.59016258256638E-02 -1.07926803392243E+00
-5.41163712815481E+00 1.29254553589581E-01 1.16343816906280E+01
-2.62158506045681E+00 5.16922521499392E+00 -1.00544849478146E+01
-2.72015016982663E+00 4.98880486733811E+00 3.02504536215413E+00
-2.79050293194776E+00 -4.88384984087100E+00 -1.01240590641792E+01
-2.61941881955994E+00 -4.73421992471235E+00 3.07027251730700E+00
7.60779786630738E-02 2.19994741542859E-01 -1.87223237882607E+01
6.88549292157438E-02 -1.91082171805773E-02 -5.66940459435534E+00

vel
1.65929233257248E-04 -7.22899336586353E-06 2.26257574401143E-04
-4.41498838683481E-04 2.56696469185694E-05 3.31830998480140E-04
1.73881380446662E-04 -3.59274234563293E-04 2.29273421593672E-04
-7.02798094302199E-05 -2.60387333161425E-05 -2.65924872337890E-04
4.42536785165872E-04 -4.89827163710057E-04 -4.01073432982287E-04
1.95643632158351E-05 1.11653991217122E-03 -9.34094601194410E-04
-7.86319162726620E-05 1.98530438200753E-04 2.04688581300107E-04
3.00397575951297E-04 2.29731480818108E-04 -7.09537178564776E-05
-3.61460082461821E-04 2.48090453139150E-04 -5.49686708040239E-04
-8.68482177851845E-05 -1.08275834028475E-04 2.95658165452176E-05
-1.07412509904415E-04 -8.78012429709354E-05 1.54371802639068E-04
-2.09843740843053E-05 -3.01743488213807E-04 6.64260233761203E-05
-6.12496928754301E-05 1.36432817751052E-04 -9.40878503004121E-05
6.14104908278753E-04 1.68165113713754E-04 -5.12708955417303E-04
2.15443297114865E-04 -4.04241581459637E-04 2.39357619874606E-04
7.81430062976476E-05 1.59513106225212E-04 2.13405809058389E-04
-2.36012088016480E-04 1.38402390396325E-04 2.23418565305995E-04
-9.33414880283709E-05 1.20065355174723E-04 -5.40061134400772E-05
-2.80686746960019E-05 -3.62478762225168E-04 -2.99299750890718E-04
-4.40755174907394E-05 -1.32663871913867E-04 1.18770307640526E-04
-2.53558806288978E-04 4.01259820381485E-04 -2.81728409555392E-04
1.57361848603766E-04 3.41172196643058E-05 1.33022101611947E-04
4.01019710009567E-04 7.22573433787659E-07 -2.69262344059131E-04
2.99733136121017E-04 -6.68678893104353E-04 3.86496510503676E-04
3.95734899013453E-05 -2.80879543778806E-04 -2.77473937236590E-04
-2.93485746700284E-04 -1.00409592926771E-05 1.76552268367262E-04
-4.99995043152451E-04 5.34640365244488E-04 -4.16733250625040E-04
4.62540147703766E-04 2.69462418459427E-04 3.55370145754719E-04
2.11074019336284E-04 -5.09254836134716E-05 -6.07095017577757E-05
5.30486297370657E-05 -3.89516038918225E-04 4.59061055880998E-04
3.85969341185967E-05 -1.33647845375007E-04 1.45606759036967E-04
2.43990312107843E-04 1.62650517626908E-04 -1.51033371102077E-04
2.67081228198250E-04 -3.20863354500405E-04 5.19374366175999E-04
-3.10316561967053E-04 -1.85209360845414E-04 -4.03377994784274E-04
-3.01054183714974E-04 -1.79705768853539E-04 1.86267271696545E-05
2.66589291951177E-04 9.91038990779357E-04 -3.96599362099506E-04
4.08276960903040E-05 1.53432487219794E-04 1.22871735039336E-04
2.35590328814509E-04 2.76550673861414E-04 -1.90931581662904E-04
2.04967171400528E-04 -3.84948347563266E-04 1.08474133114518E-04
-6.97154912755278E-04 -4.71858129707306E-04 -3.50888447092869E-04
-1.24684436826259E-04 1.21627138377997E-04 1.68581444194268E-04
-7.70851773281092E-05 4.63714165425422E-05 1.81085183134115E-04
-3.17589866254135E-04 -5.50009675262097E-04 -4.30671530615520E-04
-1.57911296613577E-04 -4.54426250235638E-05 -4.92425558216806E-06
9.87305191973671E-06 1.80207048070693E-05 -1.79532949110393E-04
6.83671915911343E-05 -9.70105029125616E-05 1.82258963862073E-05
4.57857583174456E-04 -8.32122447198841E-05 -1.67508689699827E-04
1.94599797149855E-04 -2.23628588270493E-04 1.07780974532419E-04
-2.82971791158802E-05 4.68214184227891E-05 -8.36700118689787E-05
-4.24265368843893E-05 -3.21245875729528E-05 9.22433773212692E-05
2.51671799765403E-05 -1.77766618119046E-05 1.72494196852232E-05
6.11315117142581E-06 8.68163744077307E-05 5.67296042042437E-06
-7.70845081775687E-05 3.08561858749785E-05 1.52938314279569E-05
4.20820628111984E-05 3.70042677593407E-05 9.96812582225421E-05
1.40844649691300E-04 -7.97217755672014E-05 -1.31083237666120E-05
-5.81978615724723E-05 1.17490844352786E-04 4.40930378539056E-05
6.36078347459770E-05 6.69615793199388E-05 -8.39129585283473E-05
-6.07614472082574E-05 -4.37626480779298E-05 -4.29461393218481E-06
-6.36394987636312E-05 -3.12613972792605E-05 7.79309141070753E-05
2.39708852289351E-05 -9.39157455192732E-05 3.01944658201935E-05
-1.03662791966090E-05 3.87149926133824E-05 1.05050319203302E-04
1.10894432823969E-04 3.45992275343292E-05 -3.01976038789652E-05
-1.72253916094043E-05 3.96742670252505E-05 -5.25674305834915E-05
-8.48228304979956E-05 5.64908929641756E-05 4.47531393684651E-05
-1.47506234752193E-04 1.22556764449835E-04 -3.01480394930127E-04
-2.95445248698945E-04 -2.42094492146111E-04 2.66277938523650E-04
-7.77631153135254E-05 -1.09252568425864E-04 1.69153096659710E-04
2.60223872399644E-04 -7.70901517625762E-05 1.36621011724429E-04
4.79856202268419E-05 -3.58150632460491E-04 -1.54493852536122E-04
-7.14052845122851E-05 -2.69155466792118E-04 1.79749509700142E-04
1.45112055674323E-05 -1.80926466413756E-04 6.10956577883199E-05
9.60743159647168E-05 -1.00159903624474E-04 7.70024146584693E-05
-7.39105065049050E-05 -2.15075671717976E-04 2.20969494128146E-04
2.04020346009061E-04 1.56655746598131E-04 -1.77043062668635E-04
-4.03670312774535E-04 -5.52855555754880E-05 -1.18348120988931E-04
3.71601522478895E-04 4.88286475929969E-05 1.85192663041991E-04
4.16812192512686E-05 2.23401099965076E-04 1.46463434516355E-04
-3.25279267883869E-04 7.02921173153103E-05 1.05326094998430E-04
-6.05071951258716E-05 4.15488061404616E-04 1.16175008659989E-04
-1.93579262145139E-04 5.50540862463690E-05 4.37697584266472E-04






--
##############################################################
Francois Bottin tel: 01 69 26 41 73
CEA/DIF fax: 01 69 26 70 77
BP 12 Bruyeres-le-Chatel email: Francois.Bottin@cea.fr
##############################################################




Archive powered by MHonArc 2.6.16.

Top of Page