forum@abinit.org
Subject: The ABINIT Users Mailing List ( CLOSED )
List archive
- From: Manuel Cotelo <mcotelo@gmail.com>
- To: forum@abinit.org
- Subject: Re: [abinit-forum] steady increase in memory for band parallel job
- Date: Tue, 1 Dec 2009 15:49:02 +0100
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=eiRwqs+5ZeHMXYIap8Th6W4A4d+DDIweL32w6g6hrVaaqcXSmaOsj2+3BLxr14/rDD NFaVE8FCvRTlY72Rve0arrUeoNZJYT6H8i97ajNdPcXNxMNTjvDfgOYVtARlFibLA08a EuXxfPmOeGxR42vYDCqNHp/4dyTNmT324WZ1Q=
Hi,
I found the same memory problems with parallelization when I use
option paral_kgb=1.
I compiled ABINIT with different MPI implementations (OpenMPI and
MPICH) and compilers (gnu and intel) and the problem is still there. I
also compiled ABINIT with the flag --with-mpi-level=1 to use the first
MPI standard, but it was useless.
I agree with Eric when he said that the problem is in the lobpcgwf /
lobpcgccwf routine. When I use abinip without the option paral_kgb
(only parallelization over k-points), I don't have any memory leakage
problem.
I attach a picture of the cluster monitor, where the memory leakage is
evident.
Regards,
Manuel Cotelo
2009/12/1 Eric J. Walter <ejwalt@wm.edu>:
>
> Dear Francois,
>
> I have put a copy of my input, output, log (called OUTFILE) and psps at the
> following link:
>
> http://piezo1.physics.wm.edu/~ewalter/Abinit_memoryleak/
>
> NOTE: the input file is for a different system than the one I attached
> previously (the atom species are
> different). Please disregard the previous one.
>
> I have tried compiling with g95... The serial code works fine, but never
> showed the memory problem
> anyway. The parallel version exits right after reading the psp files. I am
> continuing to look into this problem.
>
> I have also tried compiling with pathscale v3.0, the result is the same as
> the Intel fortran result.
>
> So far, my testing seems to show that the largest problem is in the lobpcgwf
> / lobpcgccwf routine. In my
> test, I run for 5 iterations and output the "free-(buffers+cache)" result of
> "free -m" to the output file
> (I have added call system('free -m') to various parts of the 5.8.4 code).
> The graph is posted at the same
> url as above (called memgraph). You can clearly see 5 repeating patterns,
> one for each iteration. In the
> serial case using wfoptalg4 (green), you can see that the memory usage from
> iteration to iteration doesn't
> increase. For the parallel with the lobpcgwf routine (black) the memory
> keeps increasing during each iteration.
> However, when the lobpcgwf routine is commented out (red) the memory
> increase is less quickly.
>
> I will continue to try and track down the problem here.
>
> Regards,
>
> Eric
>
>
>
>
> BOTTIN Francois wrote:
>>
>> Hi,
>>
>> Is it possible for you to compile abinit-5.8.4p with g95 in order to track
>> memory leaks?
>> Perhaps your code will stop at a well defined line, with a tab "already
>> allocated"!
>> It would be very nice to get rid of this trouble.
>>
>> If not, could you sent your 3 pseudopotentials, output and log (of the
>> proc 0), if this file is not too large.
>> Regards,
>> Francois Bottin
>>
>>
>> Eric J. Walter a écrit :
>>>
>>> Hi,
>>>
>>> It appears as though my jobs are suffering from the same problem outlined
>>> in this post from July of this year:
>>>
>>>
>>> https://listes-2.sipr.ucl.ac.be/abinit.org/arc/forum/2009-07/msg00034.html
>>>
>>> When I run the attached input file for ~14 hrs, the nodes I am using run
>>> out of memory.
>>> I am using RHEL AS4 on dual core / dual processor Opteron 2200 with 2 GB
>>> memory per core (= 8 GB total).
>>> I am compiling with Intel Fortran 10.1 and OpenMPI-1.2.5. This job's
>>> output file claims that the calculation
>>> should require ~200 MB.
>>>
>>> I have found that this increase only occurs when using band/fft parallel
>>> (kpt parallel and serial don't
>>> have this steady increase). Besides version 5.8.4p, I have also tested
>>> versions 5.7.4 and 5.6.4, all three versions
>>> seem to show this behavior. Version 5.4.4 does not have this problem,
>>> but is also much slower, for me at least.
>>>
>>> I have tried changing versions of openmpi (from 1.2.5 to 1.3.3), this has
>>> no effect.
>>>
>>> Has any progress been made on either finding the leak or another cause to
>>> this problem?
>>> Thanks in advance for any help you can give.
>>>
>>> Eric J. Walter
>>> Department of Physics
>>> College of William and Mary
>>>
>>> ---------------------------------------------------
>>> Here is my input file:
>>> ---------------------------------------------------
>>> ionmov 8 noseinert 5.0d5 mditemp 300
>>> dtion 50 toldfe 1d-4 eV
>>> ntime
>>> 20 nsym 1 chkprim 0
>>> kptopt 1 ngkpt 1 1 1 shiftk 0 0 >>> 0
>>> occopt 3 tsmear 0.001
>>> ecut 45 nstep 20 natom 80
>>> ntypat 3 znucl 82 22 8
>>> prtden 0
>>> prtwf 0 wfoptalg 4
>>> npfft 1 npband 26
>>> npkpt 1
>>> nloalg 4
>>> fftalg 401
>>> iprcch 4
>>> intxc 0 fft_opt_lob 2
>>> paral_kgb 1
>>>
>>> typat 48*3 16*2 16*1
>>> acell 3*20.599653
>>> angdeg 3*56.39553912
>>>
>>> xcart
>>> -2.22069320900486E+00 -1.67615626136839E+00 1.02845225672214E+01
>>> 2.83216713500271E+00 -9.64895067984759E-01 1.06130211295748E+01
>>> -3.07826006953531E-01 3.14017017198906E+00 1.02884003097039E+01
>>> 2.38571355304415E+00 1.94315354580616E+00 1.47667846684432E+01
>>> -2.74032183791911E+00 1.44987913001174E+00 1.47058588105213E+01
>>> 4.24301353957222E-01 -3.02467161327066E+00 1.47270477618761E+01
>>> 4.37208548403877E-01 3.15701320773295E+00 1.63174232817734E+00
>>> 5.62754717842062E+00 3.65799392921832E+00 1.72482290066637E+00
>>> 2.56805130575382E+00 8.08007715450321E+00 1.75538816925331E+00
>>> 5.29946087386834E+00 6.62719979049667E+00 6.13774796347384E+00
>>> -1.33476176608987E-01 6.02962682764747E+00 5.99150680371859E+00
>>> 3.25740331179827E+00 1.86629101076117E+00 5.92292965668993E+00
>>> 4.73721314332368E-01 -6.73544486518827E+00 1.53749001511875E+00
>>> 5.79909660316150E+00 -5.87516895113753E+00 1.52606696257784E+00
>>> 2.72661190097539E+00 -1.82547980521533E+00 1.72879478480763E+00
>>> 5.29308456737276E+00 -2.93742682967659E+00 5.98120816946857E+00
>>> -1.02277107167427E-01 -3.47901581523419E+00 6.00815649851232E+00
>>> 3.22915540047783E+00 -7.91315859978646E+00 6.15697463378583E+00
>>> 3.20853423992771E+00 -1.70404358692119E+00 -6.97779629913824E+00
>>> 8.39651387970289E+00 -8.70615018894613E-01 -7.04102732844698E+00
>>> 5.39613269691849E+00 3.09966157186435E+00 -6.96223499113757E+00
>>> 8.17325435029689E+00 2.08765713206062E+00 -2.55251338829625E+00
>>> 2.82623828634786E+00 1.23322960474568E+00 -2.78889127839044E+00
>>> 6.11970263677967E+00 -3.17948832421889E+00 -2.34609526217649E+00
>>> -7.91433373058299E+00 -1.81160718343090E+00 1.63075874446493E+00
>>> -2.76694176578899E+00 -1.16579400400263E+00 1.74956240701256E+00
>>> -5.81555263606813E+00 3.20412849389643E+00 1.72359061387011E+00
>>> -2.79939696716810E+00 2.06735348854857E+00 6.15861322508217E+00
>>> -8.31960266338488E+00 1.29355309750287E+00 5.93088060057318E+00
>>> -5.10575950217045E+00 -2.98997425594191E+00 6.13900333992898E+00
>>> -5.14327896212322E+00 3.29299293105111E+00 -7.09301147609970E+00
>>> 9.75650004473785E-02 3.78997253887449E+00 -6.91330351424288E+00
>>> -2.92976440197528E+00 7.95211008443348E+00 -6.82517026645523E+00
>>> -3.06355829923221E-01 6.76833507431424E+00 -2.77362691622797E+00
>>> -5.53331641987666E+00 6.24968986891999E+00 -2.54578391594814E+00
>>> -2.43999241882386E+00 1.80795237233396E+00 -2.67292251413726E+00
>>> -5.14068521198933E+00 -6.68713377702076E+00 -6.86237306597303E+00
>>> -6.12789514851627E-02 -5.98968660555720E+00 -6.97883625732929E+00
>>> -2.99553503150223E+00 -1.83920851864552E+00 -6.83013099810669E+00
>>> -1.80008738706434E-01 -2.89562125276255E+00 -2.65747984992241E+00
>>> -5.56015371335734E+00 -3.55350951569161E+00 -2.53677784336301E+00
>>> -2.23260539192524E+00 -7.88875944438475E+00 -2.59919279877736E+00
>>> -2.31544564461024E+00 -1.60168511875929E+00 -1.57988254333016E+01
>>> 2.98206360781800E+00 -1.09336396682876E+00 -1.56367782886258E+01
>>> -3.02653507337988E-01 3.20195569372773E+00 -1.56155029581190E+01
>>> 2.62978455935385E+00 1.90892236294971E+00 -1.12646465158811E+01
>>> -2.74702844352294E+00 1.34758125055524E+00 -1.11329897640166E+01
>>> 3.92389502215297E-01 -3.14047927186374E+00 -1.13789217412000E+01
>>> 1.43789519936701E-01 8.95683744161872E-02 5.51308679103783E-02
>>> 3.43343139804145E-02 8.44171505400222E-02 1.31069788732053E+01
>>> 2.87456960507807E+00 4.84780682916232E+00 -8.54098683394433E+00
>>> 2.83533350827871E+00 4.91662126969937E+00 4.43850072950254E+00
>>> 2.95141488341089E+00 -4.80542390687403E+00 -8.59714703849626E+00
>>> 2.82501485426290E+00 -4.90110860896433E+00 4.40234973478692E+00
>>> 5.65634082404525E+00 -1.24851891953083E-02 -1.71524703303103E+01
>>> 5.66706037951402E+00 9.18065709228990E-02 -4.30774458733638E+00
>>> -5.41740478793818E+00 1.03696823860055E-01 -8.46603127880005E+00
>>> -5.52981835134213E+00 3.74816566677455E-02 4.36263803023674E+00
>>> -2.73374622355905E+00 4.90669254100288E+00 -1.71056087884796E+01
>>> -2.76475317059456E+00 4.97808833622533E+00 -4.26429469831174E+00
>>> -2.71829980916229E+00 -4.81400423619281E+00 -1.71329000123123E+01
>>> -2.78480871843135E+00 -4.79896324962904E+00 -4.16357601481035E+00
>>> 1.27116237724845E-01 5.57355728005296E-02 -2.58199597795567E+01
>>> 3.69147318259344E-02 1.01239700458947E-01 -1.28425126153755E+01
>>> 1.09315814613628E-01 2.02744841122104E-02 7.51595267544533E+00
>>> 9.28429983979248E-02 4.27836872194515E-02 2.02554355831334E+01
>>> 2.71495957648699E+00 4.91077718525781E+00 -8.53254132115559E-01
>>> 2.86538509412732E+00 4.97170584874535E+00 1.18102413232714E+01
>>> 3.09009565965701E+00 -4.82067158705315E+00 -1.15312475386682E+00
>>> 2.97965222241338E+00 -4.75118372415529E+00 1.17120712079335E+01
>>> 5.77396145656038E+00 1.07323885408925E-02 -9.98939448156541E+00
>>> 5.72978432984313E+00 -6.68544670080847E-03 3.19837381572586E+00
>>> -5.80054029700240E+00 -3.59016258256638E-02 -1.07926803392243E+00
>>> -5.41163712815481E+00 1.29254553589581E-01 1.16343816906280E+01
>>> -2.62158506045681E+00 5.16922521499392E+00 -1.00544849478146E+01
>>> -2.72015016982663E+00 4.98880486733811E+00 3.02504536215413E+00
>>> -2.79050293194776E+00 -4.88384984087100E+00 -1.01240590641792E+01
>>> -2.61941881955994E+00 -4.73421992471235E+00 3.07027251730700E+00
>>> 7.60779786630738E-02 2.19994741542859E-01 -1.87223237882607E+01
>>> 6.88549292157438E-02 -1.91082171805773E-02 -5.66940459435534E+00
>>>
>>> vel
>>> 1.65929233257248E-04 -7.22899336586353E-06 2.26257574401143E-04
>>> -4.41498838683481E-04 2.56696469185694E-05 3.31830998480140E-04
>>> 1.73881380446662E-04 -3.59274234563293E-04 2.29273421593672E-04
>>> -7.02798094302199E-05 -2.60387333161425E-05 -2.65924872337890E-04
>>> 4.42536785165872E-04 -4.89827163710057E-04 -4.01073432982287E-04
>>> 1.95643632158351E-05 1.11653991217122E-03 -9.34094601194410E-04
>>> -7.86319162726620E-05 1.98530438200753E-04 2.04688581300107E-04
>>> 3.00397575951297E-04 2.29731480818108E-04 -7.09537178564776E-05
>>> -3.61460082461821E-04 2.48090453139150E-04 -5.49686708040239E-04
>>> -8.68482177851845E-05 -1.08275834028475E-04 2.95658165452176E-05
>>> -1.07412509904415E-04 -8.78012429709354E-05 1.54371802639068E-04
>>> -2.09843740843053E-05 -3.01743488213807E-04 6.64260233761203E-05
>>> -6.12496928754301E-05 1.36432817751052E-04 -9.40878503004121E-05
>>> 6.14104908278753E-04 1.68165113713754E-04 -5.12708955417303E-04
>>> 2.15443297114865E-04 -4.04241581459637E-04 2.39357619874606E-04
>>> 7.81430062976476E-05 1.59513106225212E-04 2.13405809058389E-04
>>> -2.36012088016480E-04 1.38402390396325E-04 2.23418565305995E-04
>>> -9.33414880283709E-05 1.20065355174723E-04 -5.40061134400772E-05
>>> -2.80686746960019E-05 -3.62478762225168E-04 -2.99299750890718E-04
>>> -4.40755174907394E-05 -1.32663871913867E-04 1.18770307640526E-04
>>> -2.53558806288978E-04 4.01259820381485E-04 -2.81728409555392E-04
>>> 1.57361848603766E-04 3.41172196643058E-05 1.33022101611947E-04
>>> 4.01019710009567E-04 7.22573433787659E-07 -2.69262344059131E-04
>>> 2.99733136121017E-04 -6.68678893104353E-04 3.86496510503676E-04
>>> 3.95734899013453E-05 -2.80879543778806E-04 -2.77473937236590E-04
>>> -2.93485746700284E-04 -1.00409592926771E-05 1.76552268367262E-04
>>> -4.99995043152451E-04 5.34640365244488E-04 -4.16733250625040E-04
>>> 4.62540147703766E-04 2.69462418459427E-04 3.55370145754719E-04
>>> 2.11074019336284E-04 -5.09254836134716E-05 -6.07095017577757E-05
>>> 5.30486297370657E-05 -3.89516038918225E-04 4.59061055880998E-04
>>> 3.85969341185967E-05 -1.33647845375007E-04 1.45606759036967E-04
>>> 2.43990312107843E-04 1.62650517626908E-04 -1.51033371102077E-04
>>> 2.67081228198250E-04 -3.20863354500405E-04 5.19374366175999E-04
>>> -3.10316561967053E-04 -1.85209360845414E-04 -4.03377994784274E-04
>>> -3.01054183714974E-04 -1.79705768853539E-04 1.86267271696545E-05
>>> 2.66589291951177E-04 9.91038990779357E-04 -3.96599362099506E-04
>>> 4.08276960903040E-05 1.53432487219794E-04 1.22871735039336E-04
>>> 2.35590328814509E-04 2.76550673861414E-04 -1.90931581662904E-04
>>> 2.04967171400528E-04 -3.84948347563266E-04 1.08474133114518E-04
>>> -6.97154912755278E-04 -4.71858129707306E-04 -3.50888447092869E-04
>>> -1.24684436826259E-04 1.21627138377997E-04 1.68581444194268E-04
>>> -7.70851773281092E-05 4.63714165425422E-05 1.81085183134115E-04
>>> -3.17589866254135E-04 -5.50009675262097E-04 -4.30671530615520E-04
>>> -1.57911296613577E-04 -4.54426250235638E-05 -4.92425558216806E-06
>>> 9.87305191973671E-06 1.80207048070693E-05 -1.79532949110393E-04
>>> 6.83671915911343E-05 -9.70105029125616E-05 1.82258963862073E-05
>>> 4.57857583174456E-04 -8.32122447198841E-05 -1.67508689699827E-04
>>> 1.94599797149855E-04 -2.23628588270493E-04 1.07780974532419E-04
>>> -2.82971791158802E-05 4.68214184227891E-05 -8.36700118689787E-05
>>> -4.24265368843893E-05 -3.21245875729528E-05 9.22433773212692E-05
>>> 2.51671799765403E-05 -1.77766618119046E-05 1.72494196852232E-05
>>> 6.11315117142581E-06 8.68163744077307E-05 5.67296042042437E-06
>>> -7.70845081775687E-05 3.08561858749785E-05 1.52938314279569E-05
>>> 4.20820628111984E-05 3.70042677593407E-05 9.96812582225421E-05
>>> 1.40844649691300E-04 -7.97217755672014E-05 -1.31083237666120E-05
>>> -5.81978615724723E-05 1.17490844352786E-04 4.40930378539056E-05
>>> 6.36078347459770E-05 6.69615793199388E-05 -8.39129585283473E-05
>>> -6.07614472082574E-05 -4.37626480779298E-05 -4.29461393218481E-06
>>> -6.36394987636312E-05 -3.12613972792605E-05 7.79309141070753E-05
>>> 2.39708852289351E-05 -9.39157455192732E-05 3.01944658201935E-05
>>> -1.03662791966090E-05 3.87149926133824E-05 1.05050319203302E-04
>>> 1.10894432823969E-04 3.45992275343292E-05 -3.01976038789652E-05
>>> -1.72253916094043E-05 3.96742670252505E-05 -5.25674305834915E-05
>>> -8.48228304979956E-05 5.64908929641756E-05 4.47531393684651E-05
>>> -1.47506234752193E-04 1.22556764449835E-04 -3.01480394930127E-04
>>> -2.95445248698945E-04 -2.42094492146111E-04 2.66277938523650E-04
>>> -7.77631153135254E-05 -1.09252568425864E-04 1.69153096659710E-04
>>> 2.60223872399644E-04 -7.70901517625762E-05 1.36621011724429E-04
>>> 4.79856202268419E-05 -3.58150632460491E-04 -1.54493852536122E-04
>>> -7.14052845122851E-05 -2.69155466792118E-04 1.79749509700142E-04
>>> 1.45112055674323E-05 -1.80926466413756E-04 6.10956577883199E-05
>>> 9.60743159647168E-05 -1.00159903624474E-04 7.70024146584693E-05
>>> -7.39105065049050E-05 -2.15075671717976E-04 2.20969494128146E-04
>>> 2.04020346009061E-04 1.56655746598131E-04 -1.77043062668635E-04
>>> -4.03670312774535E-04 -5.52855555754880E-05 -1.18348120988931E-04
>>> 3.71601522478895E-04 4.88286475929969E-05 1.85192663041991E-04
>>> 4.16812192512686E-05 2.23401099965076E-04 1.46463434516355E-04
>>> -3.25279267883869E-04 7.02921173153103E-05 1.05326094998430E-04
>>> -6.05071951258716E-05 4.15488061404616E-04 1.16175008659989E-04
>>> -1.93579262145139E-04 5.50540862463690E-05 4.37697584266472E-04
>>
>>
>
>
>
Attachment:
graph.png
Description: PNG image
- Re: [abinit-forum] steady increase in memory for band parallel job, Winfried Lorenzen, 12/01/2009
- <Possible follow-up(s)>
- Re: [abinit-forum] steady increase in memory for band parallel job, Eric J. Walter, 12/01/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, Manuel Cotelo, 12/01/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, BOTTIN Francois, 12/03/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, Winfried Lorenzen, 12/03/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, Eric J. Walter, 12/03/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, BOTTIN Francois, 12/04/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, Eric J. Walter, 12/04/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, BOTTIN Francois, 12/04/2009
- {Spam?} Re: [abinit-forum] steady increase in memory for band parallel job, Eric J. Walter, 12/07/2009
- Re: {Spam?} Re: [abinit-forum] steady increase in memory for band parallel job, Manuel Cotelo, 12/07/2009
- Re: {Spam?} Re: [abinit-forum] steady increase in memory for band parallel job, Eric J. Walter, 12/07/2009
- Re: {Spam?} Re: [abinit-forum] steady increase in memory for band parallel job, Manuel Cotelo, 12/11/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, Eric J. Walter, 12/04/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, BOTTIN Francois, 12/04/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, Eric J. Walter, 12/03/2009
- Re: [abinit-forum] steady increase in memory for band parallel job, Winfried Lorenzen, 12/03/2009
Archive powered by MHonArc 2.6.16.