Skip to Content.
Sympa Menu

forum - Re: [abinit-forum] abinit 5.8.4 /bigdft and CUDA

forum@abinit.org

Subject: The ABINIT Users Mailing List ( CLOSED )

List archive

Re: [abinit-forum] abinit 5.8.4 /bigdft and CUDA


Chronological Thread 
  • From: Jyh-Shyong <c00jsh00@nchc.org.tw>
  • To: forum@abinit.org
  • Subject: Re: [abinit-forum] abinit 5.8.4 /bigdft and CUDA
  • Date: Tue, 22 Sep 2009 10:58:24 +0800

Hi, Luigi,

I re-built bigdft-1.3.0 with the following configuration:

export CC=icc
export LIBS="-libverbs -libumad -libcommon -lpthread"
export CXX=icpc
export FC=ifort
export FCFLAGS="-assume 2underscore -O3"
./configure -prefix=/package/chem/bigdft --enable-cuda-gpu --with-mpi-include="-I/opt/mvapich2/intel/include" --with-mpi-ldflags="-L/opt/mvapich2/intel
/lib64" --with-mpi-libs="-lmpich -lfmpich" --with-cuda-path=/opt/cuda --with-lib-cutils=/opt/cuda/lib --with-ext-linalg="-lmkl_intel_lp64 -lmkl_sequent
ial -lmkl_core -lpthread -ldl -lsysfs" --with-ext-linalg-path=-L/opt/intel/mkl/lib

and run a test case Ca2 with file GPU.config
USE_SHARED=1

MPI_TASKS_PER_NODE=1
NUM_GPU=1

GPU_CPUS_AFF_0=0,1,2,3
GPU_CPUS_AFF_1=4,5,6,7


USE_GPU_BLAS=1
USE_GPU_CONV=1

I got the following output:
...
Shifted atomic positions, Atomic Units: grid spacing units:
1 Ca 9.25000E+00 9.25000E+00 9.16541E+00 18.500 18.500 18.331
2 Ca 9.25000E+00 9.25000E+00 1.48346E+01 18.500 18.500 29.669
Shift of= 9.25000E+00 9.25000E+00 9.16541E+00 Grid Spacings= 0.50 0.50 0.50
Box Sizes= 1.85000E+01 1.85000E+01 2.40000E+01 37 37 48
Extremes for the high resolution grid points: 11<26 11<26 10<38
wavelet localization is ON
------------------------------------------------------------ Poisson Kernel Creation
Calculating Poisson solver kernel, free BC...done.
Memory occ. per proc. (Bytes): Density=11943936 Kernel=12261192
Full Grid Arrays=11201400
------------------------------------------------- Wavefunctions Descriptors Creation
Coarse resolution grid: Number of segments= 1528 points= 40132
Fine resolution grid: Number of segments= 396 points= 4624
------------------------------------------------------------ PSP Projectors Creation
Type Name Number of atoms Number of projectors
1 Ca 2 14
------ On-the-fly projectors application
Total number of projectors = 28
Total number of components = 633584
Percent of zero components = 7
------------------------------------------------------------------ Memory Estimation
Number of atoms= 2 Number of orbitals= 2 Sim. Box Dimensions= 37 37 48
Estimation performed for 1 processors.
Memory occupation for principal arrays:
Poisson Solver Kernel (K): 11 MB 710 KB
Poisson Solver Density (D): 11 MB 400 KB
Single Wavefunction for one orbital: 0 MB 567 KB
All Wavefunctions for each processor: 2 MB 218 KB
Wavefunctions + DIIS per proc (W): 15 MB 500 KB
Nonlocal Pseudopotential Arrays (P): 4 MB 854 KB
Arrays of full uncompressed grid (U): 10 MB 699 KB
Estimation of Memory requirements for principal code sections:
Kernel calculation | Density Construction | Poisson Solver | Hamiltonian application
~11*K | ~W+(~3)*U+P | ~8*D+K+W+P | ~W+(~3)*U+P
128MB | 57MB | 123MB | 61MB
The overall memory requirement needed for this calculation is thus: 128 MB
By reducing the DIIS history and/or increasing the number of processors the amount of
memory can be reduced but for this system it will never be less than 46 MB
Wavefunctions memory occupation per processor (Bytes): 1160000
ion-ion interaction energy 7.05569614400000E-01
----------------------------------------------------------- Ionic Potential Creation
total ionic charge, leaked charge -4.000000000000 0.000E+00
PSolver, free BC, dimensions: 105 105 127 proc 1 ixc: 0 ...done.
------------------------------------------------------- Input Wavefunctions Creation
Input wavefunction data for atom Ca NOT found, automatic generation... done.
Generating 2 Atomic Input Orbitals
Wavefunctions memory occupation per processor (Bytes): 0
Calculating AIO wavefunctions... done.
Wavefunctions memory occupation per processor (Bytes): 1160000
Writing wavefunctions in wavelet form done.
Deviation from normalization of the imported orbitals 2.18E-03
CUDA error: unspecified launch failure, line 100
**** ERROR *** : c_cuda_gpu_recv_pi STREAM
forrtl: severe (174): SIGSEGV, segmentation fault occurred


In directory tests/GPU, I ran command

../../bin/cluster

and got the following output:
...
chem@gch:/package/chem/bigdft/tests/GPU> ../../bin/cluster
TTTT F DDDDD T T D T T F D T T F D D
TTTTT F D D
T T F D D
T T F D D
T T F D D
T T FFFF D D
T TTTT F D D
T F D D
TTTTTTTTT FFFFF DDDDDD gggggg iiiii BBBBBBBBB
g g i B g g i BBBB B g g iiii B B g g i B B g g i B B g g i B B g g i BBBBB g g i B B g i B B g B B ggggg i BBBB (Ver 1.3.0)
------------------------------------------------------------------------------------
| Daubechies Wavelets for DFT Pseudopotential Calculations |
------------------------------------------------------------------------------------
The Journal of Chemical Physics 129, 014109 (2008)
Number of atoms = 8
Number of atom types= 1
Atoms of type 1 are Si
Waiting for registration of all process...
OK, all process has semaphores : 1414758410
Logic CPU : 0 with 0
static repartition
Unix process (not MPI) 0 has GPU : 0
Check card on all nodes....
Spin-polarised calculation: NO
===================== BigDFT Wavefunction Optimization =============== inputPsiId= 0
------------------------------------------------------------------- Input Parameters
System Choice Resolution Radii SCF Iteration Finite Size Corr.
Max. hgrid = 0.450 | Coarse Wfs.= 6.00 | Wavefns Conv.= 1.0E-04 | Calculate= F
XC id= 1 | Fine Wfs.= 8.00 | Max. N. Iter.= 50x10 | Extension= 0.0
total charge= 0 | | CG Prec.Steps= 7 | CG Steps= 30
elec. field=0.0E+00 | | DIIS Hist. N.= 6
Geom. Code= P | Box Sizes (Bohr) = 1.02609E+01 1.02609E+01 1.02609E+01
------------------------------------------------------------------ System Properties
Atom N.Electr. PSP Code Radii: Coarse Fine CoarsePSP Calculated File
Si 4 2 1.71389 0.50000 0.91005 X ------------------------------------ Pseudopotential coefficients (Upper Triangular)
Atom Name rloc C1 C2 C3 C4 Si 0.44000 -6.91363
l=0 rl h1j h2j h3j
0.42433 3.20813 0.00000
2.58888
l=1 rl h1j h2j h3j
0.48536 2.65622
Total Number of Electrons 32
Total Number of Orbitals 16
occup(1:16)= 2.0000
Shifted atomic positions, Atomic Units: grid spacing units:
1 Si 0.00000E+00 0.00000E+00 0.00000E+00 0.000 0.000 0.000
2 Si 5.13043E+00 5.13043E+00 0.00000E+00 12.000 12.000 0.000
3 Si 5.13043E+00 0.00000E+00 5.13043E+00 12.000 0.000 12.000
4 Si 0.00000E+00 5.13043E+00 5.13043E+00 0.000 12.000 12.000
5 Si 2.56521E+00 2.56521E+00 2.56521E+00 6.000 6.000 6.000
6 Si 7.69564E+00 7.69564E+00 2.56521E+00 18.000 18.000 6.000
7 Si 7.69564E+00 2.56521E+00 7.69564E+00 18.000 6.000 18.000
8 Si 2.56521E+00 7.69564E+00 7.69564E+00 6.000 18.000 18.000
Shift of= 0.00000E+00 0.00000E+00 0.00000E+00 Grid Spacings= 0.43 0.43 0.43
Box Sizes= 1.02609E+01 1.02609E+01 1.02609E+01 23 23 23
Extremes for the high resolution grid points: 0<23 0<23 0<23
wavelet localization is OFF
------------------------------------------------------------ Poisson Kernel Creation
Poisson solver for periodic BC, no kernel calculation...done.
Memory occ. per proc. (Bytes): Density=884736 Kernel=125000
Full Grid Arrays=884736
------------------------------------------------- Wavefunctions Descriptors Creation
Coarse resolution grid: Number of segments= 576 points= 13824
Fine resolution grid: Number of segments= 826 points= 13448
------------------------------------------------------------ PSP Projectors Creation
Type Name Number of atoms Number of projectors
1 Si 8 5
------ On-the-fly projectors application
Total number of projectors = 40
Total number of components = 185880
Percent of zero components = 22
------------------------------------------------------------------ Memory Estimation
Number of atoms= 8 Number of orbitals= 16 Sim. Box Dimensions= 23 23 23
Estimation performed for 1 processors.
Memory occupation for principal arrays:
Poisson Solver Kernel (K): 0 MB 920 KB
Poisson Solver Density (D): 0 MB 864 KB
Single Wavefunction for one orbital: 0 MB 844 KB
All Wavefunctions for each processor: 26 MB 366 KB
Wavefunctions + DIIS per proc (W): 184 MB 514 KB
Nonlocal Pseudopotential Arrays (P): 1 MB 429 KB
Arrays of full uncompressed grid (U): 0 MB 96 KB
Estimation of Memory requirements for principal code sections:
Kernel calculation | Density Construction | Poisson Solver | Hamiltonian application
~11*K | ~W+(~3)*U+P | ~8*D+K+W+P | ~W+(~3)*U+P
9MB | 187MB | 193MB | 187MB
The overall memory requirement needed for this calculation is thus: 193 MB
By reducing the DIIS history and/or increasing the number of processors the amount of
memory can be reduced but for this system it will never be less than 4 MB
Wavefunctions memory occupation per processor (Bytes): 13818880
ion-ion interaction energy -3.46440265472504E+01
----------------------------------------------------------- Ionic Potential Creation
total ionic charge, leaked charge -32.000000000000 0.000E+00
PSolver, periodic BC, dimensions: 48 48 48 proc 1 ixc: 0 ...done.
------------------------------------------------------- Input Wavefunctions Creation
Input wavefunction data for atom Si NOT found, automatic generation... done.
Generating 32 Atomic Input Orbitals
Wavefunctions memory occupation per processor (Bytes): 0
Calculating AIO wavefunctions... done.
Wavefunctions memory occupation per processor (Bytes): 27637760
Writing wavefunctions in wavelet form done.
Deviation from normalization of the imported orbitals 6.90E-02
Calculation of charge density... done. Total electronic charge= 0.000000000000
PSolver, periodic BC, dimensions: 48 48 48 proc 1 ixc: 1 ...

invcb : BUG -
Fast computation of inverse cubic root failed.
exiting...




Something is wrong.
Here is the status of GPUs on the computer:

chem@gch:/package/chem/bigdft/tests/GPU> deviceQuery

CUDA Device Query (Runtime API) version (CUDART static linking)
There are 4 devices supporting CUDA

Device 0: "Tesla C1060"
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.44 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)

Device 1: "Tesla C1060"
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.44 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)

Device 2: "Tesla C1060"
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.44 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)

Device 3: "Tesla C1060"
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.44 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)

Test PASSED


Thanks for your help.

Jyh-Shyong Ho





Luigi Genovese 提到:
On Mon, Sep 21, 2009 at 1:29 PM, Jyh-Shyong <c00jsh00@nchc.org.tw> wrote:
Luigi Genovese 提到:
Have you did "make clean" before?

It seems that the previously compiled routines were still without the
second underscore.

Let me know

Luigi


Hi Luigi,

Thanks, I tried again from the source, and it works, the compilation of
bigdft 1.3 with cuda and mpi
completed without error.

However, the configure script of abinit-5.8 does not have the
-enable-cuda-gpu option for bigdft, how
do I specify this option when I configure abinit so that the plugin bigdft
can be built with cuda support?
By the way, each node of my GPU cluster has 2 or 4 C1060 GPU cards, I wonder
if bigdft can use
more than one GPU? if yes, where is the device assigned?

yes, there is an example in the tests/GPU subdirectory.
GPU can be specified in the GPU.config file which must be present in
the same directory as input.dft file (BigDFT standalone version does
not have the same input files as abinit)


If you want to run a calculation for a system which has 8 MPI
processes and 2 GPU per node you have to put these line in the
GPU.config:

USE_SHARED=1

MPI_TASKS_PER_NODE=8
NUM_GPU=2

GPU_CPUS_AFF_0=0,1,2,3
GPU_CPUS_AFF_1=4,5,6,7

USE_GPU_BLAS=1
USE_GPU_CONV=1

In that way the cores with the different IDs in the node (hard coded
on the node) are associated (affinity) to different card.
In this example CPUs 0,1,2,3 will share the same GPU (0) as co-processor.
Calculation and data transfers are then executed in an asynchronous
way to avoid overlapping.

Another approach is a static assignation (USE_SHARED=0), but in that
case the number of tasks must be lower than the number of GPU, and in
that case the affinity is ignored.

I hope it will help

Luigi

Best regards

Jyh-Shyong Ho







Archive powered by MHonArc 2.6.16.

Top of Page