Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] steady increase in memory for band parallel job

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] steady increase in memory for band parallel job


Chronological Thread 
  • From: BOTTIN Francois <francois.bottin@cea.fr>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] steady increase in memory for band parallel job
  • Date: Thu, 03 Dec 2009 18:01:08 +0100
  • Organization: CEA-DAM

Dear Eric and Manuel,

Thank you very much for your reports.
As we can see from your graph Eric (which is very helpful; we can see nline, nnsclo, nstep...),
the problem appears 1 times inside and also 1 times outside LOBPCG per electronic STEP.
I suspect the trouble comes from 66_wfs/prep_kg_sym_do.F90 which is called
i) 1 times/STEP through prep_getghc.F90 inside LOBPCG, and
ii) 1 times/STEP through prep_fourwf.F90 outside LOBPCG.
Within this routine, there are some pointers with the "save" attribute
but an undefined status (there are not nullified). In this case, I think the "associated"
intrinsic can't be used to check and deallocate the pointers before reallocation
(see for example the section "danger with pointer" at http://www.cs.rpi.edu/~szymansk/OOF90/bugs.html).
It seems that these pointers are never deallocated.

One good news. This routine is only called using istwfk=2 (real wavefunctions).
So, please, could you check if using istwfk=1 your memory still increases or not.

Also, let me know if you succeed to compile and execute in parallel with g95.
Thanks again,
Regards
Francois

Eric J. Walter a écrit :

Dear Francois,

I have put a copy of my input, output, log (called OUTFILE) and psps at the following link:

http://piezo1.physics.wm.edu/~ewalter/Abinit_memoryleak/

NOTE: the input file is for a different system than the one I attached previously (the atom species are
different). Please disregard the previous one.

I have tried compiling with g95... The serial code works fine, but never showed the memory problem
anyway. The parallel version exits right after reading the psp files. I am continuing to look into this problem.

I have also tried compiling with pathscale v3.0, the result is the same as the Intel fortran result.

So far, my testing seems to show that the largest problem is in the lobpcgwf / lobpcgccwf routine. In my
test, I run for 5 iterations and output the "free-(buffers+cache)" result of "free -m" to the output file
(I have added call system('free -m') to various parts of the 5.8.4 code). The graph is posted at the same
url as above (called memgraph). You can clearly see 5 repeating patterns, one for each iteration. In the
serial case using wfoptalg4 (green), you can see that the memory usage from iteration to iteration doesn't
increase. For the parallel with the lobpcgwf routine (black) the memory keeps increasing during each iteration.
However, when the lobpcgwf routine is commented out (red) the memory increase is less quickly.

I will continue to try and track down the problem here.

Regards,

Eric




BOTTIN Francois wrote:
Hi,

Is it possible for you to compile abinit-5.8.4p with g95 in order to track memory leaks?
Perhaps your code will stop at a well defined line, with a tab "already allocated"!
It would be very nice to get rid of this trouble.

If not, could you sent your 3 pseudopotentials, output and log (of the proc 0), if this file is not too large.
Regards,
Francois Bottin


Eric J. Walter a écrit :

Hi,

It appears as though my jobs are suffering from the same problem outlined in this post from July of this year:

https://listes-2.sipr.ucl.ac.be/abinit.org/arc/forum/2009-07/msg00034.html

When I run the attached input file for ~14 hrs, the nodes I am using run out of memory.
I am using RHEL AS4 on dual core / dual processor Opteron 2200 with 2 GB memory per core (= 8 GB total).
I am compiling with Intel Fortran 10.1 and OpenMPI-1.2.5. This job's output file claims that the calculation
should require ~200 MB.

I have found that this increase only occurs when using band/fft parallel (kpt parallel and serial don't
have this steady increase). Besides version 5.8.4p, I have also tested versions 5.7.4 and 5.6.4, all three versions
seem to show this behavior. Version 5.4.4 does not have this problem, but is also much slower, for me at least.

I have tried changing versions of openmpi (from 1.2.5 to 1.3.3), this has no effect.

Has any progress been made on either finding the leak or another cause to this problem?
Thanks in advance for any help you can give.

Eric J. Walter
Department of Physics
College of William and Mary

---------------------------------------------------
Here is my input file:
---------------------------------------------------
ionmov 8 noseinert 5.0d5 mditemp 300 dtion 50 toldfe 1d-4 eV ntime 20 nsym 1 chkprim 0 kptopt 1 ngkpt 1 1 1 shiftk 0 0 0 occopt 3 tsmear 0.001 ecut 45 nstep 20 natom 80 ntypat 3 znucl 82 22 8 prtden 0 prtwf 0 wfoptalg 4
npfft 1 npband 26
npkpt 1
nloalg 4
fftalg 401
iprcch 4
intxc 0 fft_opt_lob 2
paral_kgb 1

typat 48*3 16*2 16*1
acell 3*20.599653
angdeg 3*56.39553912

xcart
-2.22069320900486E+00 -1.67615626136839E+00 1.02845225672214E+01
2.83216713500271E+00 -9.64895067984759E-01 1.06130211295748E+01
-3.07826006953531E-01 3.14017017198906E+00 1.02884003097039E+01
2.38571355304415E+00 1.94315354580616E+00 1.47667846684432E+01
-2.74032183791911E+00 1.44987913001174E+00 1.47058588105213E+01
4.24301353957222E-01 -3.02467161327066E+00 1.47270477618761E+01
4.37208548403877E-01 3.15701320773295E+00 1.63174232817734E+00
5.62754717842062E+00 3.65799392921832E+00 1.72482290066637E+00
2.56805130575382E+00 8.08007715450321E+00 1.75538816925331E+00
5.29946087386834E+00 6.62719979049667E+00 6.13774796347384E+00
-1.33476176608987E-01 6.02962682764747E+00 5.99150680371859E+00
3.25740331179827E+00 1.86629101076117E+00 5.92292965668993E+00
4.73721314332368E-01 -6.73544486518827E+00 1.53749001511875E+00
5.79909660316150E+00 -5.87516895113753E+00 1.52606696257784E+00
2.72661190097539E+00 -1.82547980521533E+00 1.72879478480763E+00
5.29308456737276E+00 -2.93742682967659E+00 5.98120816946857E+00
-1.02277107167427E-01 -3.47901581523419E+00 6.00815649851232E+00
3.22915540047783E+00 -7.91315859978646E+00 6.15697463378583E+00
3.20853423992771E+00 -1.70404358692119E+00 -6.97779629913824E+00
8.39651387970289E+00 -8.70615018894613E-01 -7.04102732844698E+00
5.39613269691849E+00 3.09966157186435E+00 -6.96223499113757E+00
8.17325435029689E+00 2.08765713206062E+00 -2.55251338829625E+00
2.82623828634786E+00 1.23322960474568E+00 -2.78889127839044E+00
6.11970263677967E+00 -3.17948832421889E+00 -2.34609526217649E+00
-7.91433373058299E+00 -1.81160718343090E+00 1.63075874446493E+00
-2.76694176578899E+00 -1.16579400400263E+00 1.74956240701256E+00
-5.81555263606813E+00 3.20412849389643E+00 1.72359061387011E+00
-2.79939696716810E+00 2.06735348854857E+00 6.15861322508217E+00
-8.31960266338488E+00 1.29355309750287E+00 5.93088060057318E+00
-5.10575950217045E+00 -2.98997425594191E+00 6.13900333992898E+00
-5.14327896212322E+00 3.29299293105111E+00 -7.09301147609970E+00
9.75650004473785E-02 3.78997253887449E+00 -6.91330351424288E+00
-2.92976440197528E+00 7.95211008443348E+00 -6.82517026645523E+00
-3.06355829923221E-01 6.76833507431424E+00 -2.77362691622797E+00
-5.53331641987666E+00 6.24968986891999E+00 -2.54578391594814E+00
-2.43999241882386E+00 1.80795237233396E+00 -2.67292251413726E+00
-5.14068521198933E+00 -6.68713377702076E+00 -6.86237306597303E+00
-6.12789514851627E-02 -5.98968660555720E+00 -6.97883625732929E+00
-2.99553503150223E+00 -1.83920851864552E+00 -6.83013099810669E+00
-1.80008738706434E-01 -2.89562125276255E+00 -2.65747984992241E+00
-5.56015371335734E+00 -3.55350951569161E+00 -2.53677784336301E+00
-2.23260539192524E+00 -7.88875944438475E+00 -2.59919279877736E+00
-2.31544564461024E+00 -1.60168511875929E+00 -1.57988254333016E+01
2.98206360781800E+00 -1.09336396682876E+00 -1.56367782886258E+01
-3.02653507337988E-01 3.20195569372773E+00 -1.56155029581190E+01
2.62978455935385E+00 1.90892236294971E+00 -1.12646465158811E+01
-2.74702844352294E+00 1.34758125055524E+00 -1.11329897640166E+01
3.92389502215297E-01 -3.14047927186374E+00 -1.13789217412000E+01
1.43789519936701E-01 8.95683744161872E-02 5.51308679103783E-02
3.43343139804145E-02 8.44171505400222E-02 1.31069788732053E+01
2.87456960507807E+00 4.84780682916232E+00 -8.54098683394433E+00
2.83533350827871E+00 4.91662126969937E+00 4.43850072950254E+00
2.95141488341089E+00 -4.80542390687403E+00 -8.59714703849626E+00
2.82501485426290E+00 -4.90110860896433E+00 4.40234973478692E+00
5.65634082404525E+00 -1.24851891953083E-02 -1.71524703303103E+01
5.66706037951402E+00 9.18065709228990E-02 -4.30774458733638E+00
-5.41740478793818E+00 1.03696823860055E-01 -8.46603127880005E+00
-5.52981835134213E+00 3.74816566677455E-02 4.36263803023674E+00
-2.73374622355905E+00 4.90669254100288E+00 -1.71056087884796E+01
-2.76475317059456E+00 4.97808833622533E+00 -4.26429469831174E+00
-2.71829980916229E+00 -4.81400423619281E+00 -1.71329000123123E+01
-2.78480871843135E+00 -4.79896324962904E+00 -4.16357601481035E+00
1.27116237724845E-01 5.57355728005296E-02 -2.58199597795567E+01
3.69147318259344E-02 1.01239700458947E-01 -1.28425126153755E+01
1.09315814613628E-01 2.02744841122104E-02 7.51595267544533E+00
9.28429983979248E-02 4.27836872194515E-02 2.02554355831334E+01
2.71495957648699E+00 4.91077718525781E+00 -8.53254132115559E-01
2.86538509412732E+00 4.97170584874535E+00 1.18102413232714E+01
3.09009565965701E+00 -4.82067158705315E+00 -1.15312475386682E+00
2.97965222241338E+00 -4.75118372415529E+00 1.17120712079335E+01
5.77396145656038E+00 1.07323885408925E-02 -9.98939448156541E+00
5.72978432984313E+00 -6.68544670080847E-03 3.19837381572586E+00
-5.80054029700240E+00 -3.59016258256638E-02 -1.07926803392243E+00
-5.41163712815481E+00 1.29254553589581E-01 1.16343816906280E+01
-2.62158506045681E+00 5.16922521499392E+00 -1.00544849478146E+01
-2.72015016982663E+00 4.98880486733811E+00 3.02504536215413E+00
-2.79050293194776E+00 -4.88384984087100E+00 -1.01240590641792E+01
-2.61941881955994E+00 -4.73421992471235E+00 3.07027251730700E+00
7.60779786630738E-02 2.19994741542859E-01 -1.87223237882607E+01
6.88549292157438E-02 -1.91082171805773E-02 -5.66940459435534E+00

vel
1.65929233257248E-04 -7.22899336586353E-06 2.26257574401143E-04
-4.41498838683481E-04 2.56696469185694E-05 3.31830998480140E-04
1.73881380446662E-04 -3.59274234563293E-04 2.29273421593672E-04
-7.02798094302199E-05 -2.60387333161425E-05 -2.65924872337890E-04
4.42536785165872E-04 -4.89827163710057E-04 -4.01073432982287E-04
1.95643632158351E-05 1.11653991217122E-03 -9.34094601194410E-04
-7.86319162726620E-05 1.98530438200753E-04 2.04688581300107E-04
3.00397575951297E-04 2.29731480818108E-04 -7.09537178564776E-05
-3.61460082461821E-04 2.48090453139150E-04 -5.49686708040239E-04
-8.68482177851845E-05 -1.08275834028475E-04 2.95658165452176E-05
-1.07412509904415E-04 -8.78012429709354E-05 1.54371802639068E-04
-2.09843740843053E-05 -3.01743488213807E-04 6.64260233761203E-05
-6.12496928754301E-05 1.36432817751052E-04 -9.40878503004121E-05
6.14104908278753E-04 1.68165113713754E-04 -5.12708955417303E-04
2.15443297114865E-04 -4.04241581459637E-04 2.39357619874606E-04
7.81430062976476E-05 1.59513106225212E-04 2.13405809058389E-04
-2.36012088016480E-04 1.38402390396325E-04 2.23418565305995E-04
-9.33414880283709E-05 1.20065355174723E-04 -5.40061134400772E-05
-2.80686746960019E-05 -3.62478762225168E-04 -2.99299750890718E-04
-4.40755174907394E-05 -1.32663871913867E-04 1.18770307640526E-04
-2.53558806288978E-04 4.01259820381485E-04 -2.81728409555392E-04
1.57361848603766E-04 3.41172196643058E-05 1.33022101611947E-04
4.01019710009567E-04 7.22573433787659E-07 -2.69262344059131E-04
2.99733136121017E-04 -6.68678893104353E-04 3.86496510503676E-04
3.95734899013453E-05 -2.80879543778806E-04 -2.77473937236590E-04
-2.93485746700284E-04 -1.00409592926771E-05 1.76552268367262E-04
-4.99995043152451E-04 5.34640365244488E-04 -4.16733250625040E-04
4.62540147703766E-04 2.69462418459427E-04 3.55370145754719E-04
2.11074019336284E-04 -5.09254836134716E-05 -6.07095017577757E-05
5.30486297370657E-05 -3.89516038918225E-04 4.59061055880998E-04
3.85969341185967E-05 -1.33647845375007E-04 1.45606759036967E-04
2.43990312107843E-04 1.62650517626908E-04 -1.51033371102077E-04
2.67081228198250E-04 -3.20863354500405E-04 5.19374366175999E-04
-3.10316561967053E-04 -1.85209360845414E-04 -4.03377994784274E-04
-3.01054183714974E-04 -1.79705768853539E-04 1.86267271696545E-05
2.66589291951177E-04 9.91038990779357E-04 -3.96599362099506E-04
4.08276960903040E-05 1.53432487219794E-04 1.22871735039336E-04
2.35590328814509E-04 2.76550673861414E-04 -1.90931581662904E-04
2.04967171400528E-04 -3.84948347563266E-04 1.08474133114518E-04
-6.97154912755278E-04 -4.71858129707306E-04 -3.50888447092869E-04
-1.24684436826259E-04 1.21627138377997E-04 1.68581444194268E-04
-7.70851773281092E-05 4.63714165425422E-05 1.81085183134115E-04
-3.17589866254135E-04 -5.50009675262097E-04 -4.30671530615520E-04
-1.57911296613577E-04 -4.54426250235638E-05 -4.92425558216806E-06
9.87305191973671E-06 1.80207048070693E-05 -1.79532949110393E-04
6.83671915911343E-05 -9.70105029125616E-05 1.82258963862073E-05
4.57857583174456E-04 -8.32122447198841E-05 -1.67508689699827E-04
1.94599797149855E-04 -2.23628588270493E-04 1.07780974532419E-04
-2.82971791158802E-05 4.68214184227891E-05 -8.36700118689787E-05
-4.24265368843893E-05 -3.21245875729528E-05 9.22433773212692E-05
2.51671799765403E-05 -1.77766618119046E-05 1.72494196852232E-05
6.11315117142581E-06 8.68163744077307E-05 5.67296042042437E-06
-7.70845081775687E-05 3.08561858749785E-05 1.52938314279569E-05
4.20820628111984E-05 3.70042677593407E-05 9.96812582225421E-05
1.40844649691300E-04 -7.97217755672014E-05 -1.31083237666120E-05
-5.81978615724723E-05 1.17490844352786E-04 4.40930378539056E-05
6.36078347459770E-05 6.69615793199388E-05 -8.39129585283473E-05
-6.07614472082574E-05 -4.37626480779298E-05 -4.29461393218481E-06
-6.36394987636312E-05 -3.12613972792605E-05 7.79309141070753E-05
2.39708852289351E-05 -9.39157455192732E-05 3.01944658201935E-05
-1.03662791966090E-05 3.87149926133824E-05 1.05050319203302E-04
1.10894432823969E-04 3.45992275343292E-05 -3.01976038789652E-05
-1.72253916094043E-05 3.96742670252505E-05 -5.25674305834915E-05
-8.48228304979956E-05 5.64908929641756E-05 4.47531393684651E-05
-1.47506234752193E-04 1.22556764449835E-04 -3.01480394930127E-04
-2.95445248698945E-04 -2.42094492146111E-04 2.66277938523650E-04
-7.77631153135254E-05 -1.09252568425864E-04 1.69153096659710E-04
2.60223872399644E-04 -7.70901517625762E-05 1.36621011724429E-04
4.79856202268419E-05 -3.58150632460491E-04 -1.54493852536122E-04
-7.14052845122851E-05 -2.69155466792118E-04 1.79749509700142E-04
1.45112055674323E-05 -1.80926466413756E-04 6.10956577883199E-05
9.60743159647168E-05 -1.00159903624474E-04 7.70024146584693E-05
-7.39105065049050E-05 -2.15075671717976E-04 2.20969494128146E-04
2.04020346009061E-04 1.56655746598131E-04 -1.77043062668635E-04
-4.03670312774535E-04 -5.52855555754880E-05 -1.18348120988931E-04
3.71601522478895E-04 4.88286475929969E-05 1.85192663041991E-04
4.16812192512686E-05 2.23401099965076E-04 1.46463434516355E-04
-3.25279267883869E-04 7.02921173153103E-05 1.05326094998430E-04
-6.05071951258716E-05 4.15488061404616E-04 1.16175008659989E-04
-1.93579262145139E-04 5.50540862463690E-05 4.37697584266472E-04






--
##############################################################
Francois Bottin tel: 01 69 26 41 73
CEA/DIF fax: 01 69 26 70 77
BP 12 Bruyeres-le-Chatel email: Francois.Bottin@cea.fr
##############################################################




Archive powered by MHonArc 2.6.16.

Top of Page