Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] parallelism over bands in ABINIT

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] parallelism over bands in ABINIT


Chronological Thread 
  • From: "Guillaume Dumont" <dumont.guillaume@gmail.com>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] parallelism over bands in ABINIT
  • Date: Wed, 29 Nov 2006 18:08:27 -0500
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=MK/Mhp4oyDt2Wr3JYw5qsgGBzgwGwwDCBZcrZ3vMJMhwg18hsU81TkhBCgzebffNxej4lsVWdknIvNzyA29odDdS/z9OHEACY9HdFHlmfBqcrjbzYC8x2hyRwMhUYJRgJbkHLUlY7EWgrSYpHAeVy7DzfWnp9qjRb4SPN0ht56E=

Hi,

I have noticed in my output files (for the gold case) that when the number of processors is increased the inwffil routine starts to become very time consuming, around 30% of total time at 144cpus. Any idea what could cause this routine to take that much time?

regards

On 11/28/06, Anglade Pierre-Matthieu <anglade@gmail.com> wrote:
Hi

ncache is an hard coded parameter of Pr. S. Goedecker FFT  lib. To increase it you  have to
proceed manually on each of the following file:
in src/lib01fftnew:
accrho.F90
forw.F90
back_wf.F90
forw_wf.F90
applypot.F90
back.F90

Despite the warning it is possible that this operation won't reduce the efficiency of the FFT because the present value has been chosen quite long ago. So it might be useful for you to tune it to your itanimum's very large cache.

regards

PMA

On 11/28/06, Guillaume Dumont < dumont.guillaume@gmail.com> wrote:
Dear Dr Anglade and Bottin,

No, I haven't looked in code yet... I'm working on that. I'll try with g95 and let you know if I find anything interesting...

Another quick question: I tried to run another case... a GaAsN quantum well with 35 layers of GaAs and one layer of GaN.

natom 108
nband 230
ngfft 45 45 1875

On 50 processors and got this error message on the 48th processor:

  ncache has to be enlarged to be able to hold at
  least one 1-d FFT of each size even though this will
  reduce the performance for shorter transform lengths


What does this mean? Is ncache an input variable or do I have to modify the code and recompile? Would changing ngfft to 1875 45 45 or 45 1875 45 help?


Thanks


On 11/28/06, Anglade Pierre-Matthieu < anglade@gmail.com> wrote:
PS: one of the simplest way to discover possible memory leaks is to make a run with a binary compiled with g95. At the end of the run it will report the names of the routines where memory was not deallocated.


On 11/28/06, Anglade Pierre-Matthieu <anglade@gmail.com> wrote:
>Is there a reason why the code as such a great memory need? Why did the code run for 2 scf
>cycles and than crashed?

At early development stage of band parallelism there were a lot of memory leaks in lobpcgxx. It may be possible that some of them remains. Have you check for this ?

A very simple way in f90 to get ride of memory leaks when the memory scheme is complex is to add at the end of the routines some statements like
if(allocated(XX)) deallocate(XX)

regards

PMA


On 11/27/06, Guillaume Dumont <dumont.guillaume@gmail.com > wrote:
Oops I forgot the attachments...


On 11/27/06, Guillaume Dumont < dumont.guillaume@gmail.com> wrote:
Dear Dr Bottin,

I tried to reproduce your superlinear scaling up to 144 cpus. Here are the results. The scaling is superlinear up to 54 cpus for your gold case. However, keeping the number of processors constant, some sets of npband and npfft do not give the superlinear behavior (see graph speedup.eps.)

For the superlinear regime most of the time is spent in the lobpcgxx routine, but as the number of processors increase more and more time is spent in gstate->kpgsph.

I also noticed that the memory requirement is proportional to the number of processors ( memory.eps). This is causing problems with cases where you need more than the memory accessible to a single processor. For example, I tried to run a total energy calculation on a 216 atoms GaAsN supercell with nband 480 and ngfft 180 180 180. I was able to run it on 32 processors and it did 2 scf cycles and then crashed with an error message indicating that the memory need exceeded the available memory.

Is there a reason why the code as such a great memory need? Why did the code run for 2 scf cycles and than crashed? Shouldn't it allocate all the memory before doing the calculation? (Memory leeks?)

This calculation needs a little more than 4 GB on a single processor run.

To answer your other questions:

In the cases of both Au and GaAsN systems? For gold, the code is two
times faster (if I remember correctly) with the -O3 flag compilation.

I did not test the gold on case with the -O2 flag, but I'll let you know when I do it.

Does the lobpcg part in these two systems weight equally? In Au, the
lobpcg part corresponds approximatively to the total time. Its perfect
scaling gives the supelinear behaviour of ABINIT.

Does your FFT part (fourwf) strongly increase (more than 2 times)
between 1 and 32 processors? And what is its weight? Even if this FFT is
strongly optimized, the scaling does not remain linear.

Unfortunately some of the calculations where done with timopt 2 instead of -1 or -2 so I cannot answer this question yet.


Regards,

--
Guillaume Dumont
=========================
guillaume.dumont.1@umontreal.ca
dumont.guillaume@gmail.com
(514) 341 5298
(514) 343 6111 ext. 13279



--
Guillaume Dumont
=========================
guillaume.dumont.1@umontreal.ca
dumont.guillaume@gmail.com
(514) 341 5298
(514) 343 6111 ext. 13279




--
Pierre-Matthieu Anglade



--
Pierre-Matthieu Anglade



--
Guillaume Dumont
=========================
guillaume.dumont.1@umontreal.ca
dumont.guillaume@gmail.com
(514) 341 5298
(514) 343 6111 ext. 13279



--
Pierre-Matthieu Anglade



--
Guillaume Dumont
=========================
guillaume.dumont.1@umontreal.ca
dumont.guillaume@gmail.com
(514) 341 5298
(514) 343 6111 ext. 13279


Archive powered by MHonArc 2.6.16.

Top of Page