Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] steady increase in memory for band parallel job

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] steady increase in memory for band parallel job


Chronological Thread 
  • From: Manuel Cotelo <mcotelo@gmail.com>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] steady increase in memory for band parallel job
  • Date: Tue, 1 Dec 2009 15:49:02 +0100
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=eiRwqs+5ZeHMXYIap8Th6W4A4d+DDIweL32w6g6hrVaaqcXSmaOsj2+3BLxr14/rDD NFaVE8FCvRTlY72Rve0arrUeoNZJYT6H8i97ajNdPcXNxMNTjvDfgOYVtARlFibLA08a EuXxfPmOeGxR42vYDCqNHp/4dyTNmT324WZ1Q=

Hi,

I found the same memory problems with parallelization when I use
option paral_kgb=1.

I compiled ABINIT with different MPI implementations (OpenMPI and
MPICH) and compilers (gnu and intel) and the problem is still there. I
also compiled ABINIT with the flag --with-mpi-level=1 to use the first
MPI standard, but it was useless.

I agree with Eric when he said that the problem is in the lobpcgwf /
lobpcgccwf routine. When I use abinip without the option paral_kgb
(only parallelization over k-points), I don't have any memory leakage
problem.

I attach a picture of the cluster monitor, where the memory leakage is
evident.

Regards,
Manuel Cotelo

2009/12/1 Eric J. Walter <ejwalt@wm.edu>:
>
> Dear Francois,
>
> I have put a copy of my input, output, log (called OUTFILE) and psps at the
> following link:
>
> http://piezo1.physics.wm.edu/~ewalter/Abinit_memoryleak/
>
> NOTE:  the input file is for a different system than the one I attached
> previously (the atom species are
> different).  Please disregard the previous one.
>
> I have tried compiling with g95...  The serial code works fine, but never
> showed the memory problem
> anyway.  The parallel version exits right after reading the psp files.  I am
> continuing to look into this problem.
>
> I have also tried compiling with pathscale v3.0, the result is the same as
> the Intel fortran result.
>
> So far, my testing seems to show that the largest problem is in the lobpcgwf
> / lobpcgccwf routine.  In my
> test, I run for 5 iterations and output the "free-(buffers+cache)" result of
>  "free -m" to the output file
> (I have added call system('free -m') to various parts of the 5.8.4 code).
>  The graph is posted at the same
> url as above (called memgraph).  You can clearly see 5 repeating patterns,
> one for each iteration.  In the
> serial case using wfoptalg4 (green), you can see that the memory usage from
> iteration to iteration doesn't
> increase.  For the parallel with the lobpcgwf routine (black) the memory
> keeps increasing during each iteration.
> However, when the lobpcgwf routine is commented out (red) the memory
> increase is less quickly.
>
> I will continue to try and track down the problem here.
>
> Regards,
>
> Eric
>
>
>
>
> BOTTIN Francois wrote:
>>
>> Hi,
>>
>> Is it possible for you to compile abinit-5.8.4p with g95 in order to track
>> memory leaks?
>> Perhaps your code will stop at a well defined line, with a tab "already
>> allocated"!
>> It would be very nice to get rid of this trouble.
>>
>> If not, could you sent your 3 pseudopotentials, output and log (of the
>> proc 0), if this file is not too large.
>> Regards,
>> Francois Bottin
>>
>>
>> Eric J. Walter a écrit :
>>>
>>> Hi,
>>>
>>> It appears as though my jobs are suffering from the same problem outlined
>>> in this post from July of this year:
>>>
>>>
>>> https://listes-2.sipr.ucl.ac.be/abinit.org/arc/forum/2009-07/msg00034.html
>>>
>>> When I run the attached input file for ~14 hrs, the nodes I am using run
>>> out of memory.
>>> I am using RHEL AS4 on dual core / dual processor Opteron 2200 with 2 GB
>>> memory per core (= 8 GB total).
>>> I am compiling with Intel Fortran 10.1 and OpenMPI-1.2.5.   This job's
>>> output file claims that the calculation
>>> should require ~200 MB.
>>>
>>> I have found that this increase only occurs when using band/fft parallel
>>> (kpt parallel and serial don't
>>> have this steady increase).  Besides version 5.8.4p, I have also tested
>>> versions 5.7.4 and 5.6.4, all three versions
>>> seem to show this behavior.  Version 5.4.4 does not have this problem,
>>> but is also much slower, for me at least.
>>>
>>> I have tried changing versions of openmpi (from 1.2.5 to 1.3.3), this has
>>> no effect.
>>>
>>> Has any progress been made on either finding the leak or another cause to
>>> this problem?
>>> Thanks in advance for any help you can give.
>>>
>>> Eric J. Walter
>>> Department of Physics
>>> College of William and Mary
>>>
>>> ---------------------------------------------------
>>> Here is my input file:
>>> ---------------------------------------------------
>>> ionmov 8                     noseinert 5.0d5              mditemp 300
>>>              dtion 50                     toldfe 1d-4 eV              
>>> ntime
>>> 20                      nsym 1                       chkprim 0
>>>      kptopt 1                     ngkpt 1 1 1                  shiftk 0 0 >>> 0
>>>                 occopt 3                     tsmear 0.001
>>> ecut 45                      nstep 20                     natom 80
>>>           ntypat 3                     znucl   82 22 8              
>>> prtden 0
>>>                     prtwf 0                     wfoptalg  4
>>> npfft 1  npband 26
>>> npkpt 1
>>> nloalg 4
>>> fftalg 401
>>> iprcch 4
>>> intxc 0 fft_opt_lob 2
>>> paral_kgb 1
>>>
>>> typat 48*3 16*2 16*1
>>> acell     3*20.599653
>>> angdeg    3*56.39553912
>>>
>>> xcart
>>> -2.22069320900486E+00 -1.67615626136839E+00  1.02845225672214E+01
>>> 2.83216713500271E+00 -9.64895067984759E-01  1.06130211295748E+01
>>> -3.07826006953531E-01  3.14017017198906E+00  1.02884003097039E+01
>>> 2.38571355304415E+00  1.94315354580616E+00  1.47667846684432E+01
>>> -2.74032183791911E+00  1.44987913001174E+00  1.47058588105213E+01
>>> 4.24301353957222E-01 -3.02467161327066E+00  1.47270477618761E+01
>>> 4.37208548403877E-01  3.15701320773295E+00  1.63174232817734E+00
>>> 5.62754717842062E+00  3.65799392921832E+00  1.72482290066637E+00
>>> 2.56805130575382E+00  8.08007715450321E+00  1.75538816925331E+00
>>> 5.29946087386834E+00  6.62719979049667E+00  6.13774796347384E+00
>>> -1.33476176608987E-01  6.02962682764747E+00  5.99150680371859E+00
>>> 3.25740331179827E+00  1.86629101076117E+00  5.92292965668993E+00
>>> 4.73721314332368E-01 -6.73544486518827E+00  1.53749001511875E+00
>>> 5.79909660316150E+00 -5.87516895113753E+00  1.52606696257784E+00
>>> 2.72661190097539E+00 -1.82547980521533E+00  1.72879478480763E+00
>>> 5.29308456737276E+00 -2.93742682967659E+00  5.98120816946857E+00
>>> -1.02277107167427E-01 -3.47901581523419E+00  6.00815649851232E+00
>>> 3.22915540047783E+00 -7.91315859978646E+00  6.15697463378583E+00
>>> 3.20853423992771E+00 -1.70404358692119E+00 -6.97779629913824E+00
>>> 8.39651387970289E+00 -8.70615018894613E-01 -7.04102732844698E+00
>>> 5.39613269691849E+00  3.09966157186435E+00 -6.96223499113757E+00
>>> 8.17325435029689E+00  2.08765713206062E+00 -2.55251338829625E+00
>>> 2.82623828634786E+00  1.23322960474568E+00 -2.78889127839044E+00
>>> 6.11970263677967E+00 -3.17948832421889E+00 -2.34609526217649E+00
>>> -7.91433373058299E+00 -1.81160718343090E+00  1.63075874446493E+00
>>> -2.76694176578899E+00 -1.16579400400263E+00  1.74956240701256E+00
>>> -5.81555263606813E+00  3.20412849389643E+00  1.72359061387011E+00
>>> -2.79939696716810E+00  2.06735348854857E+00  6.15861322508217E+00
>>> -8.31960266338488E+00  1.29355309750287E+00  5.93088060057318E+00
>>> -5.10575950217045E+00 -2.98997425594191E+00  6.13900333992898E+00
>>> -5.14327896212322E+00  3.29299293105111E+00 -7.09301147609970E+00
>>> 9.75650004473785E-02  3.78997253887449E+00 -6.91330351424288E+00
>>> -2.92976440197528E+00  7.95211008443348E+00 -6.82517026645523E+00
>>> -3.06355829923221E-01  6.76833507431424E+00 -2.77362691622797E+00
>>> -5.53331641987666E+00  6.24968986891999E+00 -2.54578391594814E+00
>>> -2.43999241882386E+00  1.80795237233396E+00 -2.67292251413726E+00
>>> -5.14068521198933E+00 -6.68713377702076E+00 -6.86237306597303E+00
>>> -6.12789514851627E-02 -5.98968660555720E+00 -6.97883625732929E+00
>>> -2.99553503150223E+00 -1.83920851864552E+00 -6.83013099810669E+00
>>> -1.80008738706434E-01 -2.89562125276255E+00 -2.65747984992241E+00
>>> -5.56015371335734E+00 -3.55350951569161E+00 -2.53677784336301E+00
>>> -2.23260539192524E+00 -7.88875944438475E+00 -2.59919279877736E+00
>>> -2.31544564461024E+00 -1.60168511875929E+00 -1.57988254333016E+01
>>> 2.98206360781800E+00 -1.09336396682876E+00 -1.56367782886258E+01
>>> -3.02653507337988E-01  3.20195569372773E+00 -1.56155029581190E+01
>>> 2.62978455935385E+00  1.90892236294971E+00 -1.12646465158811E+01
>>> -2.74702844352294E+00  1.34758125055524E+00 -1.11329897640166E+01
>>> 3.92389502215297E-01 -3.14047927186374E+00 -1.13789217412000E+01
>>> 1.43789519936701E-01  8.95683744161872E-02  5.51308679103783E-02
>>> 3.43343139804145E-02  8.44171505400222E-02  1.31069788732053E+01
>>> 2.87456960507807E+00  4.84780682916232E+00 -8.54098683394433E+00
>>> 2.83533350827871E+00  4.91662126969937E+00  4.43850072950254E+00
>>> 2.95141488341089E+00 -4.80542390687403E+00 -8.59714703849626E+00
>>> 2.82501485426290E+00 -4.90110860896433E+00  4.40234973478692E+00
>>> 5.65634082404525E+00 -1.24851891953083E-02 -1.71524703303103E+01
>>> 5.66706037951402E+00  9.18065709228990E-02 -4.30774458733638E+00
>>> -5.41740478793818E+00  1.03696823860055E-01 -8.46603127880005E+00
>>> -5.52981835134213E+00  3.74816566677455E-02  4.36263803023674E+00
>>> -2.73374622355905E+00  4.90669254100288E+00 -1.71056087884796E+01
>>> -2.76475317059456E+00  4.97808833622533E+00 -4.26429469831174E+00
>>> -2.71829980916229E+00 -4.81400423619281E+00 -1.71329000123123E+01
>>> -2.78480871843135E+00 -4.79896324962904E+00 -4.16357601481035E+00
>>> 1.27116237724845E-01  5.57355728005296E-02 -2.58199597795567E+01
>>> 3.69147318259344E-02  1.01239700458947E-01 -1.28425126153755E+01
>>> 1.09315814613628E-01  2.02744841122104E-02  7.51595267544533E+00
>>> 9.28429983979248E-02  4.27836872194515E-02  2.02554355831334E+01
>>> 2.71495957648699E+00  4.91077718525781E+00 -8.53254132115559E-01
>>> 2.86538509412732E+00  4.97170584874535E+00  1.18102413232714E+01
>>> 3.09009565965701E+00 -4.82067158705315E+00 -1.15312475386682E+00
>>> 2.97965222241338E+00 -4.75118372415529E+00  1.17120712079335E+01
>>> 5.77396145656038E+00  1.07323885408925E-02 -9.98939448156541E+00
>>> 5.72978432984313E+00 -6.68544670080847E-03  3.19837381572586E+00
>>> -5.80054029700240E+00 -3.59016258256638E-02 -1.07926803392243E+00
>>> -5.41163712815481E+00  1.29254553589581E-01  1.16343816906280E+01
>>> -2.62158506045681E+00  5.16922521499392E+00 -1.00544849478146E+01
>>> -2.72015016982663E+00  4.98880486733811E+00  3.02504536215413E+00
>>> -2.79050293194776E+00 -4.88384984087100E+00 -1.01240590641792E+01
>>> -2.61941881955994E+00 -4.73421992471235E+00  3.07027251730700E+00
>>> 7.60779786630738E-02  2.19994741542859E-01 -1.87223237882607E+01
>>> 6.88549292157438E-02 -1.91082171805773E-02 -5.66940459435534E+00
>>>
>>> vel
>>> 1.65929233257248E-04 -7.22899336586353E-06  2.26257574401143E-04
>>> -4.41498838683481E-04  2.56696469185694E-05  3.31830998480140E-04
>>> 1.73881380446662E-04 -3.59274234563293E-04  2.29273421593672E-04
>>> -7.02798094302199E-05 -2.60387333161425E-05 -2.65924872337890E-04
>>> 4.42536785165872E-04 -4.89827163710057E-04 -4.01073432982287E-04
>>> 1.95643632158351E-05  1.11653991217122E-03 -9.34094601194410E-04
>>> -7.86319162726620E-05  1.98530438200753E-04  2.04688581300107E-04
>>> 3.00397575951297E-04  2.29731480818108E-04 -7.09537178564776E-05
>>> -3.61460082461821E-04  2.48090453139150E-04 -5.49686708040239E-04
>>> -8.68482177851845E-05 -1.08275834028475E-04  2.95658165452176E-05
>>> -1.07412509904415E-04 -8.78012429709354E-05  1.54371802639068E-04
>>> -2.09843740843053E-05 -3.01743488213807E-04  6.64260233761203E-05
>>> -6.12496928754301E-05  1.36432817751052E-04 -9.40878503004121E-05
>>> 6.14104908278753E-04  1.68165113713754E-04 -5.12708955417303E-04
>>> 2.15443297114865E-04 -4.04241581459637E-04  2.39357619874606E-04
>>> 7.81430062976476E-05  1.59513106225212E-04  2.13405809058389E-04
>>> -2.36012088016480E-04  1.38402390396325E-04  2.23418565305995E-04
>>> -9.33414880283709E-05  1.20065355174723E-04 -5.40061134400772E-05
>>> -2.80686746960019E-05 -3.62478762225168E-04 -2.99299750890718E-04
>>> -4.40755174907394E-05 -1.32663871913867E-04  1.18770307640526E-04
>>> -2.53558806288978E-04  4.01259820381485E-04 -2.81728409555392E-04
>>> 1.57361848603766E-04  3.41172196643058E-05  1.33022101611947E-04
>>> 4.01019710009567E-04  7.22573433787659E-07 -2.69262344059131E-04
>>> 2.99733136121017E-04 -6.68678893104353E-04  3.86496510503676E-04
>>> 3.95734899013453E-05 -2.80879543778806E-04 -2.77473937236590E-04
>>> -2.93485746700284E-04 -1.00409592926771E-05  1.76552268367262E-04
>>> -4.99995043152451E-04  5.34640365244488E-04 -4.16733250625040E-04
>>> 4.62540147703766E-04  2.69462418459427E-04  3.55370145754719E-04
>>> 2.11074019336284E-04 -5.09254836134716E-05 -6.07095017577757E-05
>>> 5.30486297370657E-05 -3.89516038918225E-04  4.59061055880998E-04
>>> 3.85969341185967E-05 -1.33647845375007E-04  1.45606759036967E-04
>>> 2.43990312107843E-04  1.62650517626908E-04 -1.51033371102077E-04
>>> 2.67081228198250E-04 -3.20863354500405E-04  5.19374366175999E-04
>>> -3.10316561967053E-04 -1.85209360845414E-04 -4.03377994784274E-04
>>> -3.01054183714974E-04 -1.79705768853539E-04  1.86267271696545E-05
>>> 2.66589291951177E-04  9.91038990779357E-04 -3.96599362099506E-04
>>> 4.08276960903040E-05  1.53432487219794E-04  1.22871735039336E-04
>>> 2.35590328814509E-04  2.76550673861414E-04 -1.90931581662904E-04
>>> 2.04967171400528E-04 -3.84948347563266E-04  1.08474133114518E-04
>>> -6.97154912755278E-04 -4.71858129707306E-04 -3.50888447092869E-04
>>> -1.24684436826259E-04  1.21627138377997E-04  1.68581444194268E-04
>>> -7.70851773281092E-05  4.63714165425422E-05  1.81085183134115E-04
>>> -3.17589866254135E-04 -5.50009675262097E-04 -4.30671530615520E-04
>>> -1.57911296613577E-04 -4.54426250235638E-05 -4.92425558216806E-06
>>> 9.87305191973671E-06  1.80207048070693E-05 -1.79532949110393E-04
>>> 6.83671915911343E-05 -9.70105029125616E-05  1.82258963862073E-05
>>> 4.57857583174456E-04 -8.32122447198841E-05 -1.67508689699827E-04
>>> 1.94599797149855E-04 -2.23628588270493E-04  1.07780974532419E-04
>>> -2.82971791158802E-05  4.68214184227891E-05 -8.36700118689787E-05
>>> -4.24265368843893E-05 -3.21245875729528E-05  9.22433773212692E-05
>>> 2.51671799765403E-05 -1.77766618119046E-05  1.72494196852232E-05
>>> 6.11315117142581E-06  8.68163744077307E-05  5.67296042042437E-06
>>> -7.70845081775687E-05  3.08561858749785E-05  1.52938314279569E-05
>>> 4.20820628111984E-05  3.70042677593407E-05  9.96812582225421E-05
>>> 1.40844649691300E-04 -7.97217755672014E-05 -1.31083237666120E-05
>>> -5.81978615724723E-05  1.17490844352786E-04  4.40930378539056E-05
>>> 6.36078347459770E-05  6.69615793199388E-05 -8.39129585283473E-05
>>> -6.07614472082574E-05 -4.37626480779298E-05 -4.29461393218481E-06
>>> -6.36394987636312E-05 -3.12613972792605E-05  7.79309141070753E-05
>>> 2.39708852289351E-05 -9.39157455192732E-05  3.01944658201935E-05
>>> -1.03662791966090E-05  3.87149926133824E-05  1.05050319203302E-04
>>> 1.10894432823969E-04  3.45992275343292E-05 -3.01976038789652E-05
>>> -1.72253916094043E-05  3.96742670252505E-05 -5.25674305834915E-05
>>> -8.48228304979956E-05  5.64908929641756E-05  4.47531393684651E-05
>>> -1.47506234752193E-04  1.22556764449835E-04 -3.01480394930127E-04
>>> -2.95445248698945E-04 -2.42094492146111E-04  2.66277938523650E-04
>>> -7.77631153135254E-05 -1.09252568425864E-04  1.69153096659710E-04
>>> 2.60223872399644E-04 -7.70901517625762E-05  1.36621011724429E-04
>>> 4.79856202268419E-05 -3.58150632460491E-04 -1.54493852536122E-04
>>> -7.14052845122851E-05 -2.69155466792118E-04  1.79749509700142E-04
>>> 1.45112055674323E-05 -1.80926466413756E-04  6.10956577883199E-05
>>> 9.60743159647168E-05 -1.00159903624474E-04  7.70024146584693E-05
>>> -7.39105065049050E-05 -2.15075671717976E-04  2.20969494128146E-04
>>> 2.04020346009061E-04  1.56655746598131E-04 -1.77043062668635E-04
>>> -4.03670312774535E-04 -5.52855555754880E-05 -1.18348120988931E-04
>>> 3.71601522478895E-04  4.88286475929969E-05  1.85192663041991E-04
>>> 4.16812192512686E-05  2.23401099965076E-04  1.46463434516355E-04
>>> -3.25279267883869E-04  7.02921173153103E-05  1.05326094998430E-04
>>> -6.05071951258716E-05  4.15488061404616E-04  1.16175008659989E-04
>>> -1.93579262145139E-04  5.50540862463690E-05  4.37697584266472E-04
>>
>>
>
>
>

Attachment: graph.png
Description: PNG image




Archive powered by MHonArc 2.6.16.

Top of Page