Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] abinit 5.8.4 /bigdft and CUDA

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] abinit 5.8.4 /bigdft and CUDA


Chronological Thread 
  • From: Luigi Genovese <luigi.genovese@gmail.com>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] abinit 5.8.4 /bigdft and CUDA
  • Date: Tue, 22 Sep 2009 09:09:46 +0200
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=pWdOK06XVTzxxTAEbeOG0+ql62vLWMkdtOVOvf2+tVihD/F6bvVLtxDUOQ0BDNa1sj J84Lvs1hJMT5BFOqo7k+xsajVB12LHM2dZyOTXGK6JFUTnj00RArvQpkbNz0w8Fq2eLr 2t7jE1pRnQKd2215DNLfLwiKyJeDo17To1ooc=

Hi Jyh-Shyong,

for the Ca case, what you found is normal, since the CUDA case is
enabled only for 3D periodic boundary conditions for the moment.
Actually we should put a clean stop in these cases.

The GPU tests should work instead.
In you run, you can see from the line

Total electronic charge= 0.000000000000

That the card did not execute anything at the moment of calculating the
density.
Once it happens like that, in general some card was corrupted by some
buggy run which happened before.
To solve this problem either you reboot your machine (annoying) or you
unload/load the nvidia driver.
The latter method works flawlessly also on machines of Hybrid supercomputers.

To summarize, try again for the GPU case. The output you should have
is in the tests/Refs/GPU.out.ref file.

Let me know

Luigi

On 9/22/09, Jyh-Shyong <c00jsh00@nchc.org.tw> wrote:
> Hi, Luigi,
>
> I re-built bigdft-1.3.0 with the following configuration:
>
> export CC=icc
> export LIBS="-libverbs -libumad -libcommon -lpthread"
> export CXX=icpc
> export FC=ifort
> export FCFLAGS="-assume 2underscore -O3"
> ./configure -prefix=/package/chem/bigdft --enable-cuda-gpu
> --with-mpi-include="-I/opt/mvapich2/intel/include"
> --with-mpi-ldflags="-L/opt/mvapich2/intel
> /lib64" --with-mpi-libs="-lmpich -lfmpich" --with-cuda-path=/opt/cuda
> --with-lib-cutils=/opt/cuda/lib --with-ext-linalg="-lmkl_intel_lp64
> -lmkl_sequent
> ial -lmkl_core -lpthread -ldl -lsysfs"
> --with-ext-linalg-path=-L/opt/intel/mkl/lib
>
> and run a test case Ca2 with file GPU.config
> USE_SHARED=1
>
> MPI_TASKS_PER_NODE=1
> NUM_GPU=1
>
> GPU_CPUS_AFF_0=0,1,2,3
> GPU_CPUS_AFF_1=4,5,6,7
>
>
> USE_GPU_BLAS=1
> USE_GPU_CONV=1
>
> I got the following output:
> ...
> Shifted atomic positions, Atomic Units: grid spacing
> units:
> 1 Ca 9.25000E+00 9.25000E+00 9.16541E+00 18.500
> 18.500 18.331
> 2 Ca 9.25000E+00 9.25000E+00 1.48346E+01 18.500
> 18.500 29.669
> Shift of= 9.25000E+00 9.25000E+00 9.16541E+00 Grid Spacings=
> 0.50 0.50 0.50
> Box Sizes= 1.85000E+01 1.85000E+01 2.40000E+01 37
> 37 48
> Extremes for the high resolution grid points: 11<26
> 11<26 10<38
> wavelet localization is ON
> ------------------------------------------------------------ Poisson
> Kernel Creation
> Calculating Poisson solver kernel, free BC...done.
> Memory occ. per proc. (Bytes): Density=11943936 Kernel=12261192
> Full Grid Arrays=11201400
> ------------------------------------------------- Wavefunctions
> Descriptors Creation
> Coarse resolution grid: Number of segments= 1528 points= 40132
> Fine resolution grid: Number of segments= 396 points= 4624
> ------------------------------------------------------------ PSP
> Projectors Creation
> Type Name Number of atoms Number of projectors
> 1 Ca 2 14
> ------ On-the-fly
> projectors application
> Total number of projectors = 28
> Total number of components = 633584
> Percent of zero components = 7
> ------------------------------------------------------------------
> Memory Estimation
> Number of atoms= 2 Number of orbitals= 2 Sim. Box Dimensions=
> 37 37 48
> Estimation performed for 1 processors.
> Memory occupation for principal arrays:
> Poisson Solver Kernel (K): 11 MB 710 KB
> Poisson Solver Density (D): 11 MB 400 KB
> Single Wavefunction for one orbital: 0 MB 567 KB
> All Wavefunctions for each processor: 2 MB 218 KB
> Wavefunctions + DIIS per proc (W): 15 MB 500 KB
> Nonlocal Pseudopotential Arrays (P): 4 MB 854 KB
> Arrays of full uncompressed grid (U): 10 MB 699 KB
> Estimation of Memory requirements for principal code sections:
> Kernel calculation | Density Construction | Poisson Solver |
> Hamiltonian application
> ~11*K | ~W+(~3)*U+P | ~8*D+K+W+P |
> ~W+(~3)*U+P
> 128MB | 57MB | 123MB
> | 61MB
> The overall memory requirement needed for this calculation is thus: 128 MB
> By reducing the DIIS history and/or increasing the number of processors
> the amount of
> memory can be reduced but for this system it will never be less than 46 MB
> Wavefunctions memory occupation per processor (Bytes): 1160000
> ion-ion interaction energy 7.05569614400000E-01
> ----------------------------------------------------------- Ionic
> Potential Creation
> total ionic charge, leaked charge -4.000000000000 0.000E+00
> PSolver, free BC, dimensions: 105 105 127 proc 1 ixc: 0
> ...done.
> ------------------------------------------------------- Input
> Wavefunctions Creation
> Input wavefunction data for atom Ca NOT found, automatic
> generation... done.
> Generating 2 Atomic Input Orbitals
> Wavefunctions memory occupation per processor (Bytes): 0
> Calculating AIO wavefunctions... done.
> Wavefunctions memory occupation per processor (Bytes): 1160000
> Writing wavefunctions in wavelet form done.
> Deviation from normalization of the imported orbitals 2.18E-03
> CUDA error: unspecified launch failure, line 100
> **** ERROR *** : c_cuda_gpu_recv_pi STREAM
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>
>
> In directory tests/GPU, I ran command
>
> ../../bin/cluster
>
> and got the following output:
> ...
> chem@gch:/package/chem/bigdft/tests/GPU> ../../bin/cluster
> TTTT F DDDDD
> T T D
> T T F D
> T T F D D
> TTTTT F D D
> T T F D D
> T T F D D
> T T F D D
> T T FFFF D D
> T TTTT F D D
> T F D D
> TTTTTTTTT FFFFF DDDDDD
> gggggg iiiii BBBBBBBBB
> g g i B
> g g i BBBB B
> g g iiii B B
> g g i B B
> g g i B B
> g g i B B
> g g i BBBBB
> g g i B B
> g i B B
> g B B
> ggggg i
> BBBB (Ver 1.3.0)
> ------------------------------------------------------------------------------------
> | Daubechies Wavelets for DFT Pseudopotential
> Calculations |
> ------------------------------------------------------------------------------------
> The Journal of Chemical Physics 129,
> 014109 (2008)
> Number of atoms = 8
> Number of atom types= 1
> Atoms of type 1 are Si
> Waiting for registration of all process...
> OK, all process has semaphores : 1414758410
> Logic CPU : 0 with 0
> static repartition
> Unix process (not MPI) 0 has GPU : 0
> Check card on all nodes....
> Spin-polarised calculation: NO
> ===================== BigDFT Wavefunction Optimization ===============
> inputPsiId= 0
> -------------------------------------------------------------------
> Input Parameters
> System Choice Resolution Radii SCF Iteration
> Finite Size Corr.
> Max. hgrid = 0.450 | Coarse Wfs.= 6.00 | Wavefns Conv.= 1.0E-04 |
> Calculate= F
> XC id= 1 | Fine Wfs.= 8.00 | Max. N. Iter.= 50x10 |
> Extension= 0.0
> total charge= 0 | | CG Prec.Steps= 7 |
> CG Steps= 30
> elec. field=0.0E+00 | | DIIS Hist. N.= 6
> Geom. Code= P | Box Sizes (Bohr) = 1.02609E+01 1.02609E+01
> 1.02609E+01
> ------------------------------------------------------------------
> System Properties
> Atom N.Electr. PSP Code Radii: Coarse Fine CoarsePSP
> Calculated File
> Si 4 2 1.71389 0.50000
> 0.91005 X
> ------------------------------------ Pseudopotential coefficients
> (Upper Triangular)
> Atom Name rloc C1 C2 C3 C4
> Si 0.44000 -6.91363
> l=0 rl h1j h2j h3j
> 0.42433 3.20813 0.00000
> 2.58888
> l=1 rl h1j h2j h3j
> 0.48536 2.65622
> Total Number of Electrons 32
> Total Number of Orbitals 16
> occup(1:16)= 2.0000
> Shifted atomic positions, Atomic Units: grid spacing
> units:
> 1 Si 0.00000E+00 0.00000E+00 0.00000E+00 0.000
> 0.000 0.000
> 2 Si 5.13043E+00 5.13043E+00 0.00000E+00 12.000
> 12.000 0.000
> 3 Si 5.13043E+00 0.00000E+00 5.13043E+00 12.000
> 0.000 12.000
> 4 Si 0.00000E+00 5.13043E+00 5.13043E+00 0.000
> 12.000 12.000
> 5 Si 2.56521E+00 2.56521E+00 2.56521E+00 6.000
> 6.000 6.000
> 6 Si 7.69564E+00 7.69564E+00 2.56521E+00 18.000
> 18.000 6.000
> 7 Si 7.69564E+00 2.56521E+00 7.69564E+00 18.000
> 6.000 18.000
> 8 Si 2.56521E+00 7.69564E+00 7.69564E+00 6.000
> 18.000 18.000
> Shift of= 0.00000E+00 0.00000E+00 0.00000E+00 Grid Spacings=
> 0.43 0.43 0.43
> Box Sizes= 1.02609E+01 1.02609E+01 1.02609E+01 23
> 23 23
> Extremes for the high resolution grid points: 0<23
> 0<23 0<23
> wavelet localization is OFF
> ------------------------------------------------------------ Poisson
> Kernel Creation
> Poisson solver for periodic BC, no kernel calculation...done.
> Memory occ. per proc. (Bytes): Density=884736 Kernel=125000
> Full Grid Arrays=884736
> ------------------------------------------------- Wavefunctions
> Descriptors Creation
> Coarse resolution grid: Number of segments= 576 points= 13824
> Fine resolution grid: Number of segments= 826 points= 13448
> ------------------------------------------------------------ PSP
> Projectors Creation
> Type Name Number of atoms Number of projectors
> 1 Si 8 5
> ------ On-the-fly
> projectors application
> Total number of projectors = 40
> Total number of components = 185880
> Percent of zero components = 22
> ------------------------------------------------------------------
> Memory Estimation
> Number of atoms= 8 Number of orbitals= 16 Sim. Box Dimensions=
> 23 23 23
> Estimation performed for 1 processors.
> Memory occupation for principal arrays:
> Poisson Solver Kernel (K): 0 MB 920 KB
> Poisson Solver Density (D): 0 MB 864 KB
> Single Wavefunction for one orbital: 0 MB 844 KB
> All Wavefunctions for each processor: 26 MB 366 KB
> Wavefunctions + DIIS per proc (W): 184 MB 514 KB
> Nonlocal Pseudopotential Arrays (P): 1 MB 429 KB
> Arrays of full uncompressed grid (U): 0 MB 96 KB
> Estimation of Memory requirements for principal code sections:
> Kernel calculation | Density Construction | Poisson Solver |
> Hamiltonian application
> ~11*K | ~W+(~3)*U+P | ~8*D+K+W+P |
> ~W+(~3)*U+P
> 9MB | 187MB | 193MB |
> 187MB
> The overall memory requirement needed for this calculation is thus: 193 MB
> By reducing the DIIS history and/or increasing the number of processors
> the amount of
> memory can be reduced but for this system it will never be less than 4 MB
> Wavefunctions memory occupation per processor (Bytes): 13818880
> ion-ion interaction energy -3.46440265472504E+01
> ----------------------------------------------------------- Ionic
> Potential Creation
> total ionic charge, leaked charge -32.000000000000 0.000E+00
> PSolver, periodic BC, dimensions: 48 48 48 proc 1 ixc: 0
> ...done.
> ------------------------------------------------------- Input
> Wavefunctions Creation
> Input wavefunction data for atom Si NOT found, automatic
> generation... done.
> Generating 32 Atomic Input Orbitals
> Wavefunctions memory occupation per processor (Bytes): 0
> Calculating AIO wavefunctions... done.
> Wavefunctions memory occupation per processor (Bytes): 27637760
> Writing wavefunctions in wavelet form done.
> Deviation from normalization of the imported orbitals 6.90E-02
> Calculation of charge density... done. Total electronic charge=
> 0.000000000000
> PSolver, periodic BC, dimensions: 48 48 48 proc 1 ixc: 1 ...
>
> invcb : BUG -
> Fast computation of inverse cubic root failed.
> exiting...
>
>
>
>
> Something is wrong.
> Here is the status of GPUs on the computer:
>
> chem@gch:/package/chem/bigdft/tests/GPU> deviceQuery
>
> CUDA Device Query (Runtime API) version (CUDART static linking)
> There are 4 devices supporting CUDA
>
> Device 0: "Tesla C1060"
> CUDA Capability Major revision number: 1
> CUDA Capability Minor revision number: 3
> Total amount of global memory: 4294705152 bytes
> Number of multiprocessors: 30
> Number of cores: 240
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 16384 bytes
> Total number of registers available per block: 16384
> Warp size: 32
> Maximum number of threads per block: 512
> Maximum sizes of each dimension of a block: 512 x 512 x 64
> Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
> Maximum memory pitch: 262144 bytes
> Texture alignment: 256 bytes
> Clock rate: 1.44 GHz
> Concurrent copy and execution: Yes
> Run time limit on kernels: No
> Integrated: No
> Support host page-locked memory mapping: Yes
> Compute mode: Exclusive (only one
> host thread at a time can use this device)
>
> Device 1: "Tesla C1060"
> CUDA Capability Major revision number: 1
> CUDA Capability Minor revision number: 3
> Total amount of global memory: 4294705152 bytes
> Number of multiprocessors: 30
> Number of cores: 240
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 16384 bytes
> Total number of registers available per block: 16384
> Warp size: 32
> Maximum number of threads per block: 512
> Maximum sizes of each dimension of a block: 512 x 512 x 64
> Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
> Maximum memory pitch: 262144 bytes
> Texture alignment: 256 bytes
> Clock rate: 1.44 GHz
> Concurrent copy and execution: Yes
> Run time limit on kernels: No
> Integrated: No
> Support host page-locked memory mapping: Yes
> Compute mode: Exclusive (only one
> host thread at a time can use this device)
>
> Device 2: "Tesla C1060"
> CUDA Capability Major revision number: 1
> CUDA Capability Minor revision number: 3
> Total amount of global memory: 4294705152 bytes
> Number of multiprocessors: 30
> Number of cores: 240
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 16384 bytes
> Total number of registers available per block: 16384
> Warp size: 32
> Maximum number of threads per block: 512
> Maximum sizes of each dimension of a block: 512 x 512 x 64
> Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
> Maximum memory pitch: 262144 bytes
> Texture alignment: 256 bytes
> Clock rate: 1.44 GHz
> Concurrent copy and execution: Yes
> Run time limit on kernels: No
> Integrated: No
> Support host page-locked memory mapping: Yes
> Compute mode: Exclusive (only one
> host thread at a time can use this device)
>
> Device 3: "Tesla C1060"
> CUDA Capability Major revision number: 1
> CUDA Capability Minor revision number: 3
> Total amount of global memory: 4294705152 bytes
> Number of multiprocessors: 30
> Number of cores: 240
> Total amount of constant memory: 65536 bytes
> Total amount of shared memory per block: 16384 bytes
> Total number of registers available per block: 16384
> Warp size: 32
> Maximum number of threads per block: 512
> Maximum sizes of each dimension of a block: 512 x 512 x 64
> Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
> Maximum memory pitch: 262144 bytes
> Texture alignment: 256 bytes
> Clock rate: 1.44 GHz
> Concurrent copy and execution: Yes
> Run time limit on kernels: No
> Integrated: No
> Support host page-locked memory mapping: Yes
> Compute mode: Exclusive (only one
> host thread at a time can use this device)
>
> Test PASSED
>
>
> Thanks for your help.
>
> Jyh-Shyong Ho
>
>
>
>
>
> Luigi Genovese 提到:
>> On Mon, Sep 21, 2009 at 1:29 PM, Jyh-Shyong <c00jsh00@nchc.org.tw> wrote:
>>
>>> Luigi Genovese 提到:
>>>
>>>> Have you did "make clean" before?
>>>>
>>>> It seems that the previously compiled routines were still without the
>>>> second underscore.
>>>>
>>>> Let me know
>>>>
>>>> Luigi
>>>>
>>>>
>>>>
>>> Hi Luigi,
>>>
>>> Thanks, I tried again from the source, and it works, the compilation of
>>> bigdft 1.3 with cuda and mpi
>>> completed without error.
>>>
>>> However, the configure script of abinit-5.8 does not have the
>>> -enable-cuda-gpu option for bigdft, how
>>> do I specify this option when I configure abinit so that the plugin
>>> bigdft
>>> can be built with cuda support?
>>> By the way, each node of my GPU cluster has 2 or 4 C1060 GPU cards, I
>>> wonder
>>> if bigdft can use
>>> more than one GPU? if yes, where is the device assigned?
>>>
>>
>> yes, there is an example in the tests/GPU subdirectory.
>> GPU can be specified in the GPU.config file which must be present in
>> the same directory as input.dft file (BigDFT standalone version does
>> not have the same input files as abinit)
>>
>>
>> If you want to run a calculation for a system which has 8 MPI
>> processes and 2 GPU per node you have to put these line in the
>> GPU.config:
>>
>> USE_SHARED=1
>>
>> MPI_TASKS_PER_NODE=8
>> NUM_GPU=2
>>
>> GPU_CPUS_AFF_0=0,1,2,3
>> GPU_CPUS_AFF_1=4,5,6,7
>>
>> USE_GPU_BLAS=1
>> USE_GPU_CONV=1
>>
>> In that way the cores with the different IDs in the node (hard coded
>> on the node) are associated (affinity) to different card.
>> In this example CPUs 0,1,2,3 will share the same GPU (0) as co-processor.
>> Calculation and data transfers are then executed in an asynchronous
>> way to avoid overlapping.
>>
>> Another approach is a static assignation (USE_SHARED=0), but in that
>> case the number of tasks must be lower than the number of GPU, and in
>> that case the affinity is ignored.
>>
>> I hope it will help
>>
>> Luigi
>>
>>
>>> Best regards
>>>
>>> Jyh-Shyong Ho
>>>
>>>
>>>
>>
>>
>
>



Archive powered by MHonArc 2.6.16.

Top of Page