Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] steady increase in memory for band parallel job

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] steady increase in memory for band parallel job


Chronological Thread 
  • From: Winfried Lorenzen <knallio@gmx.de>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] steady increase in memory for band parallel job
  • Date: Thu, 3 Dec 2009 20:04:15 +0100

Hi,

I have tested it now and it works. The memory does not increase any more with
istwfk=1.

Are there any drawbacks from setting istwfk = 1?

Regards,
Winfried


Am Donnerstag, 3. Dezember 2009 18:01:08 schrieb BOTTIN Francois:
> Dear Eric and Manuel,
>
> Thank you very much for your reports.
> As we can see from your graph Eric (which is very helpful; we can see
> nline, nnsclo, nstep...),
> the problem appears 1 times inside and also 1 times outside LOBPCG per
> electronic STEP.
> I suspect the trouble comes from 66_wfs/prep_kg_sym_do.F90 which is called
> i) 1 times/STEP through prep_getghc.F90 inside LOBPCG, and
> ii) 1 times/STEP through prep_fourwf.F90 outside LOBPCG.
> Within this routine, there are some pointers with the "save" attribute
> but an undefined status (there are not nullified). In this case, I think
> the "associated"
> intrinsic can't be used to check and deallocate the pointers before
> reallocation
> (see for example the section "danger with pointer" at
> http://www.cs.rpi.edu/~szymansk/OOF90/bugs.html).
> It seems that these pointers are never deallocated.
>
> One good news. This routine is only called using istwfk=2 (real
> wavefunctions).
> So, please, could you check if using istwfk=1 your memory still
> increases or not.
>
> Also, let me know if you succeed to compile and execute in parallel with
> g95.
> Thanks again,
> Regards
> Francois
>
> Eric J. Walter a écrit :
> > Dear Francois,
> >
> > I have put a copy of my input, output, log (called OUTFILE) and psps
> > at the following link:
> >
> > http://piezo1.physics.wm.edu/~ewalter/Abinit_memoryleak/
> >
> > NOTE: the input file is for a different system than the one I
> > attached previously (the atom species are
> > different). Please disregard the previous one.
> >
> > I have tried compiling with g95... The serial code works fine, but
> > never showed the memory problem
> > anyway. The parallel version exits right after reading the psp
> > files. I am continuing to look into this problem.
> >
> > I have also tried compiling with pathscale v3.0, the result is the
> > same as the Intel fortran result.
> >
> > So far, my testing seems to show that the largest problem is in the
> > lobpcgwf / lobpcgccwf routine. In my
> > test, I run for 5 iterations and output the "free-(buffers+cache)"
> > result of "free -m" to the output file
> > (I have added call system('free -m') to various parts of the 5.8.4
> > code). The graph is posted at the same
> > url as above (called memgraph). You can clearly see 5 repeating
> > patterns, one for each iteration. In the
> > serial case using wfoptalg4 (green), you can see that the memory usage
> > from iteration to iteration doesn't
> > increase. For the parallel with the lobpcgwf routine (black) the
> > memory keeps increasing during each iteration.
> > However, when the lobpcgwf routine is commented out (red) the memory
> > increase is less quickly.
> >
> > I will continue to try and track down the problem here.
> >
> > Regards,
> >
> > Eric
> >
> > BOTTIN Francois wrote:
> >> Hi,
> >>
> >> Is it possible for you to compile abinit-5.8.4p with g95 in order to
> >> track memory leaks?
> >> Perhaps your code will stop at a well defined line, with a tab
> >> "already allocated"!
> >> It would be very nice to get rid of this trouble.
> >>
> >> If not, could you sent your 3 pseudopotentials, output and log (of
> >> the proc 0), if this file is not too large.
> >> Regards,
> >> Francois Bottin
> >>
> >> Eric J. Walter a écrit :
> >>> Hi,
> >>>
> >>> It appears as though my jobs are suffering from the same problem
> >>> outlined in this post from July of this year:
> >>>
> >>> https://listes-2.sipr.ucl.ac.be/abinit.org/arc/forum/2009-07/msg00034.h
> >>>tml
> >>>
> >>>
> >>> When I run the attached input file for ~14 hrs, the nodes I am using
> >>> run out of memory.
> >>> I am using RHEL AS4 on dual core / dual processor Opteron 2200 with
> >>> 2 GB memory per core (= 8 GB total).
> >>> I am compiling with Intel Fortran 10.1 and OpenMPI-1.2.5. This
> >>> job's output file claims that the calculation
> >>> should require ~200 MB.
> >>>
> >>> I have found that this increase only occurs when using band/fft
> >>> parallel (kpt parallel and serial don't
> >>> have this steady increase). Besides version 5.8.4p, I have also
> >>> tested versions 5.7.4 and 5.6.4, all three versions
> >>> seem to show this behavior. Version 5.4.4 does not have this
> >>> problem, but is also much slower, for me at least.
> >>>
> >>> I have tried changing versions of openmpi (from 1.2.5 to 1.3.3),
> >>> this has no effect.
> >>>
> >>> Has any progress been made on either finding the leak or another
> >>> cause to this problem?
> >>> Thanks in advance for any help you can give.
> >>>
> >>> Eric J. Walter
> >>> Department of Physics
> >>> College of William and Mary
> >>>
> >>> ---------------------------------------------------
> >>> Here is my input file:
> >>> ---------------------------------------------------
> >>> ionmov 8 noseinert 5.0d5 mditemp
> >>> 300 dtion 50 toldfe 1d-4
> >>> eV ntime 20 nsym
> >>> 1 chkprim 0 kptopt
> >>> 1 ngkpt 1 1 1 shiftk 0 0
> >>> 0 occopt 3 tsmear
> >>> 0.001 ecut 45 nstep
> >>> 20 natom 80 ntypat
> >>> 3 znucl 82 22 8 prtden
> >>> 0 prtwf 0 wfoptalg 4
> >>> npfft 1 npband 26
> >>> npkpt 1
> >>> nloalg 4
> >>> fftalg 401
> >>> iprcch 4
> >>> intxc 0 fft_opt_lob 2
> >>> paral_kgb 1
> >>>
> >>> typat 48*3 16*2 16*1
> >>> acell 3*20.599653
> >>> angdeg 3*56.39553912
> >>>
> >>> xcart
> >>> -2.22069320900486E+00 -1.67615626136839E+00 1.02845225672214E+01
> >>> 2.83216713500271E+00 -9.64895067984759E-01 1.06130211295748E+01
> >>> -3.07826006953531E-01 3.14017017198906E+00 1.02884003097039E+01
> >>> 2.38571355304415E+00 1.94315354580616E+00 1.47667846684432E+01
> >>> -2.74032183791911E+00 1.44987913001174E+00 1.47058588105213E+01
> >>> 4.24301353957222E-01 -3.02467161327066E+00 1.47270477618761E+01
> >>> 4.37208548403877E-01 3.15701320773295E+00 1.63174232817734E+00
> >>> 5.62754717842062E+00 3.65799392921832E+00 1.72482290066637E+00
> >>> 2.56805130575382E+00 8.08007715450321E+00 1.75538816925331E+00
> >>> 5.29946087386834E+00 6.62719979049667E+00 6.13774796347384E+00
> >>> -1.33476176608987E-01 6.02962682764747E+00 5.99150680371859E+00
> >>> 3.25740331179827E+00 1.86629101076117E+00 5.92292965668993E+00
> >>> 4.73721314332368E-01 -6.73544486518827E+00 1.53749001511875E+00
> >>> 5.79909660316150E+00 -5.87516895113753E+00 1.52606696257784E+00
> >>> 2.72661190097539E+00 -1.82547980521533E+00 1.72879478480763E+00
> >>> 5.29308456737276E+00 -2.93742682967659E+00 5.98120816946857E+00
> >>> -1.02277107167427E-01 -3.47901581523419E+00 6.00815649851232E+00
> >>> 3.22915540047783E+00 -7.91315859978646E+00 6.15697463378583E+00
> >>> 3.20853423992771E+00 -1.70404358692119E+00 -6.97779629913824E+00
> >>> 8.39651387970289E+00 -8.70615018894613E-01 -7.04102732844698E+00
> >>> 5.39613269691849E+00 3.09966157186435E+00 -6.96223499113757E+00
> >>> 8.17325435029689E+00 2.08765713206062E+00 -2.55251338829625E+00
> >>> 2.82623828634786E+00 1.23322960474568E+00 -2.78889127839044E+00
> >>> 6.11970263677967E+00 -3.17948832421889E+00 -2.34609526217649E+00
> >>> -7.91433373058299E+00 -1.81160718343090E+00 1.63075874446493E+00
> >>> -2.76694176578899E+00 -1.16579400400263E+00 1.74956240701256E+00
> >>> -5.81555263606813E+00 3.20412849389643E+00 1.72359061387011E+00
> >>> -2.79939696716810E+00 2.06735348854857E+00 6.15861322508217E+00
> >>> -8.31960266338488E+00 1.29355309750287E+00 5.93088060057318E+00
> >>> -5.10575950217045E+00 -2.98997425594191E+00 6.13900333992898E+00
> >>> -5.14327896212322E+00 3.29299293105111E+00 -7.09301147609970E+00
> >>> 9.75650004473785E-02 3.78997253887449E+00 -6.91330351424288E+00
> >>> -2.92976440197528E+00 7.95211008443348E+00 -6.82517026645523E+00
> >>> -3.06355829923221E-01 6.76833507431424E+00 -2.77362691622797E+00
> >>> -5.53331641987666E+00 6.24968986891999E+00 -2.54578391594814E+00
> >>> -2.43999241882386E+00 1.80795237233396E+00 -2.67292251413726E+00
> >>> -5.14068521198933E+00 -6.68713377702076E+00 -6.86237306597303E+00
> >>> -6.12789514851627E-02 -5.98968660555720E+00 -6.97883625732929E+00
> >>> -2.99553503150223E+00 -1.83920851864552E+00 -6.83013099810669E+00
> >>> -1.80008738706434E-01 -2.89562125276255E+00 -2.65747984992241E+00
> >>> -5.56015371335734E+00 -3.55350951569161E+00 -2.53677784336301E+00
> >>> -2.23260539192524E+00 -7.88875944438475E+00 -2.59919279877736E+00
> >>> -2.31544564461024E+00 -1.60168511875929E+00 -1.57988254333016E+01
> >>> 2.98206360781800E+00 -1.09336396682876E+00 -1.56367782886258E+01
> >>> -3.02653507337988E-01 3.20195569372773E+00 -1.56155029581190E+01
> >>> 2.62978455935385E+00 1.90892236294971E+00 -1.12646465158811E+01
> >>> -2.74702844352294E+00 1.34758125055524E+00 -1.11329897640166E+01
> >>> 3.92389502215297E-01 -3.14047927186374E+00 -1.13789217412000E+01
> >>> 1.43789519936701E-01 8.95683744161872E-02 5.51308679103783E-02
> >>> 3.43343139804145E-02 8.44171505400222E-02 1.31069788732053E+01
> >>> 2.87456960507807E+00 4.84780682916232E+00 -8.54098683394433E+00
> >>> 2.83533350827871E+00 4.91662126969937E+00 4.43850072950254E+00
> >>> 2.95141488341089E+00 -4.80542390687403E+00 -8.59714703849626E+00
> >>> 2.82501485426290E+00 -4.90110860896433E+00 4.40234973478692E+00
> >>> 5.65634082404525E+00 -1.24851891953083E-02 -1.71524703303103E+01
> >>> 5.66706037951402E+00 9.18065709228990E-02 -4.30774458733638E+00
> >>> -5.41740478793818E+00 1.03696823860055E-01 -8.46603127880005E+00
> >>> -5.52981835134213E+00 3.74816566677455E-02 4.36263803023674E+00
> >>> -2.73374622355905E+00 4.90669254100288E+00 -1.71056087884796E+01
> >>> -2.76475317059456E+00 4.97808833622533E+00 -4.26429469831174E+00
> >>> -2.71829980916229E+00 -4.81400423619281E+00 -1.71329000123123E+01
> >>> -2.78480871843135E+00 -4.79896324962904E+00 -4.16357601481035E+00
> >>> 1.27116237724845E-01 5.57355728005296E-02 -2.58199597795567E+01
> >>> 3.69147318259344E-02 1.01239700458947E-01 -1.28425126153755E+01
> >>> 1.09315814613628E-01 2.02744841122104E-02 7.51595267544533E+00
> >>> 9.28429983979248E-02 4.27836872194515E-02 2.02554355831334E+01
> >>> 2.71495957648699E+00 4.91077718525781E+00 -8.53254132115559E-01
> >>> 2.86538509412732E+00 4.97170584874535E+00 1.18102413232714E+01
> >>> 3.09009565965701E+00 -4.82067158705315E+00 -1.15312475386682E+00
> >>> 2.97965222241338E+00 -4.75118372415529E+00 1.17120712079335E+01
> >>> 5.77396145656038E+00 1.07323885408925E-02 -9.98939448156541E+00
> >>> 5.72978432984313E+00 -6.68544670080847E-03 3.19837381572586E+00
> >>> -5.80054029700240E+00 -3.59016258256638E-02 -1.07926803392243E+00
> >>> -5.41163712815481E+00 1.29254553589581E-01 1.16343816906280E+01
> >>> -2.62158506045681E+00 5.16922521499392E+00 -1.00544849478146E+01
> >>> -2.72015016982663E+00 4.98880486733811E+00 3.02504536215413E+00
> >>> -2.79050293194776E+00 -4.88384984087100E+00 -1.01240590641792E+01
> >>> -2.61941881955994E+00 -4.73421992471235E+00 3.07027251730700E+00
> >>> 7.60779786630738E-02 2.19994741542859E-01 -1.87223237882607E+01
> >>> 6.88549292157438E-02 -1.91082171805773E-02 -5.66940459435534E+00
> >>>
> >>> vel
> >>> 1.65929233257248E-04 -7.22899336586353E-06 2.26257574401143E-04
> >>> -4.41498838683481E-04 2.56696469185694E-05 3.31830998480140E-04
> >>> 1.73881380446662E-04 -3.59274234563293E-04 2.29273421593672E-04
> >>> -7.02798094302199E-05 -2.60387333161425E-05 -2.65924872337890E-04
> >>> 4.42536785165872E-04 -4.89827163710057E-04 -4.01073432982287E-04
> >>> 1.95643632158351E-05 1.11653991217122E-03 -9.34094601194410E-04
> >>> -7.86319162726620E-05 1.98530438200753E-04 2.04688581300107E-04
> >>> 3.00397575951297E-04 2.29731480818108E-04 -7.09537178564776E-05
> >>> -3.61460082461821E-04 2.48090453139150E-04 -5.49686708040239E-04
> >>> -8.68482177851845E-05 -1.08275834028475E-04 2.95658165452176E-05
> >>> -1.07412509904415E-04 -8.78012429709354E-05 1.54371802639068E-04
> >>> -2.09843740843053E-05 -3.01743488213807E-04 6.64260233761203E-05
> >>> -6.12496928754301E-05 1.36432817751052E-04 -9.40878503004121E-05
> >>> 6.14104908278753E-04 1.68165113713754E-04 -5.12708955417303E-04
> >>> 2.15443297114865E-04 -4.04241581459637E-04 2.39357619874606E-04
> >>> 7.81430062976476E-05 1.59513106225212E-04 2.13405809058389E-04
> >>> -2.36012088016480E-04 1.38402390396325E-04 2.23418565305995E-04
> >>> -9.33414880283709E-05 1.20065355174723E-04 -5.40061134400772E-05
> >>> -2.80686746960019E-05 -3.62478762225168E-04 -2.99299750890718E-04
> >>> -4.40755174907394E-05 -1.32663871913867E-04 1.18770307640526E-04
> >>> -2.53558806288978E-04 4.01259820381485E-04 -2.81728409555392E-04
> >>> 1.57361848603766E-04 3.41172196643058E-05 1.33022101611947E-04
> >>> 4.01019710009567E-04 7.22573433787659E-07 -2.69262344059131E-04
> >>> 2.99733136121017E-04 -6.68678893104353E-04 3.86496510503676E-04
> >>> 3.95734899013453E-05 -2.80879543778806E-04 -2.77473937236590E-04
> >>> -2.93485746700284E-04 -1.00409592926771E-05 1.76552268367262E-04
> >>> -4.99995043152451E-04 5.34640365244488E-04 -4.16733250625040E-04
> >>> 4.62540147703766E-04 2.69462418459427E-04 3.55370145754719E-04
> >>> 2.11074019336284E-04 -5.09254836134716E-05 -6.07095017577757E-05
> >>> 5.30486297370657E-05 -3.89516038918225E-04 4.59061055880998E-04
> >>> 3.85969341185967E-05 -1.33647845375007E-04 1.45606759036967E-04
> >>> 2.43990312107843E-04 1.62650517626908E-04 -1.51033371102077E-04
> >>> 2.67081228198250E-04 -3.20863354500405E-04 5.19374366175999E-04
> >>> -3.10316561967053E-04 -1.85209360845414E-04 -4.03377994784274E-04
> >>> -3.01054183714974E-04 -1.79705768853539E-04 1.86267271696545E-05
> >>> 2.66589291951177E-04 9.91038990779357E-04 -3.96599362099506E-04
> >>> 4.08276960903040E-05 1.53432487219794E-04 1.22871735039336E-04
> >>> 2.35590328814509E-04 2.76550673861414E-04 -1.90931581662904E-04
> >>> 2.04967171400528E-04 -3.84948347563266E-04 1.08474133114518E-04
> >>> -6.97154912755278E-04 -4.71858129707306E-04 -3.50888447092869E-04
> >>> -1.24684436826259E-04 1.21627138377997E-04 1.68581444194268E-04
> >>> -7.70851773281092E-05 4.63714165425422E-05 1.81085183134115E-04
> >>> -3.17589866254135E-04 -5.50009675262097E-04 -4.30671530615520E-04
> >>> -1.57911296613577E-04 -4.54426250235638E-05 -4.92425558216806E-06
> >>> 9.87305191973671E-06 1.80207048070693E-05 -1.79532949110393E-04
> >>> 6.83671915911343E-05 -9.70105029125616E-05 1.82258963862073E-05
> >>> 4.57857583174456E-04 -8.32122447198841E-05 -1.67508689699827E-04
> >>> 1.94599797149855E-04 -2.23628588270493E-04 1.07780974532419E-04
> >>> -2.82971791158802E-05 4.68214184227891E-05 -8.36700118689787E-05
> >>> -4.24265368843893E-05 -3.21245875729528E-05 9.22433773212692E-05
> >>> 2.51671799765403E-05 -1.77766618119046E-05 1.72494196852232E-05
> >>> 6.11315117142581E-06 8.68163744077307E-05 5.67296042042437E-06
> >>> -7.70845081775687E-05 3.08561858749785E-05 1.52938314279569E-05
> >>> 4.20820628111984E-05 3.70042677593407E-05 9.96812582225421E-05
> >>> 1.40844649691300E-04 -7.97217755672014E-05 -1.31083237666120E-05
> >>> -5.81978615724723E-05 1.17490844352786E-04 4.40930378539056E-05
> >>> 6.36078347459770E-05 6.69615793199388E-05 -8.39129585283473E-05
> >>> -6.07614472082574E-05 -4.37626480779298E-05 -4.29461393218481E-06
> >>> -6.36394987636312E-05 -3.12613972792605E-05 7.79309141070753E-05
> >>> 2.39708852289351E-05 -9.39157455192732E-05 3.01944658201935E-05
> >>> -1.03662791966090E-05 3.87149926133824E-05 1.05050319203302E-04
> >>> 1.10894432823969E-04 3.45992275343292E-05 -3.01976038789652E-05
> >>> -1.72253916094043E-05 3.96742670252505E-05 -5.25674305834915E-05
> >>> -8.48228304979956E-05 5.64908929641756E-05 4.47531393684651E-05
> >>> -1.47506234752193E-04 1.22556764449835E-04 -3.01480394930127E-04
> >>> -2.95445248698945E-04 -2.42094492146111E-04 2.66277938523650E-04
> >>> -7.77631153135254E-05 -1.09252568425864E-04 1.69153096659710E-04
> >>> 2.60223872399644E-04 -7.70901517625762E-05 1.36621011724429E-04
> >>> 4.79856202268419E-05 -3.58150632460491E-04 -1.54493852536122E-04
> >>> -7.14052845122851E-05 -2.69155466792118E-04 1.79749509700142E-04
> >>> 1.45112055674323E-05 -1.80926466413756E-04 6.10956577883199E-05
> >>> 9.60743159647168E-05 -1.00159903624474E-04 7.70024146584693E-05
> >>> -7.39105065049050E-05 -2.15075671717976E-04 2.20969494128146E-04
> >>> 2.04020346009061E-04 1.56655746598131E-04 -1.77043062668635E-04
> >>> -4.03670312774535E-04 -5.52855555754880E-05 -1.18348120988931E-04
> >>> 3.71601522478895E-04 4.88286475929969E-05 1.85192663041991E-04
> >>> 4.16812192512686E-05 2.23401099965076E-04 1.46463434516355E-04
> >>> -3.25279267883869E-04 7.02921173153103E-05 1.05326094998430E-04
> >>> -6.05071951258716E-05 4.15488061404616E-04 1.16175008659989E-04
> >>> -1.93579262145139E-04 5.50540862463690E-05 4.37697584266472E-04
>



Archive powered by MHonArc 2.6.16.

Top of Page