Challenges#in#GPU#por>ng:## the#Quantum#ESPRESSO#case
October 30, 2017 | Author: Anonymous | Category: N/A
Short Description
effec>ve#charges#and#dielectric#tensors#. – .. What%is%their% scienfific% workflow? Typically Central Processing U&nb...
Description
Challenges#in#GPU#por>ng:## the#Quantum#ESPRESSO#case# Filippo#SPIGA#1,2## 1#High#Performance#Compu>ng#Service,#University#of#Cambridge# 2#Quantum#ESPRESSO#Founda>on#
«What%I%cannot%compute,%I%do%not%understand.»#(adapted#from#Richard#P.#Feynman)#
#
##
Outline# • Overview# • The#Quantum#ESPRESSO#project# • PWSCF#GPU#por>ng# • Challenges#in#GPU#por>ng# • Performance#evalua>on## • Final#considera>ons:#performance,#correctness#and#sustainability#
2#
QUANTUM'ESPRESSO'
What#is#QUANTUM#ESPRESSO?# •
QUANTUM#ESPRESSO#is#an#integrated#suite#of#computer#codes#for#atomis>c# simula>ons#based#on#DFT,#pseudo\poten>als,#and#plane#waves#
•
"ESPRESSO"#stands#for#opEn#Source#Package#for#Research#in#Electronic#Structure,# Simula>on,#and#Op>miza>on#
•
QUANTUM#ESPRESSO#is#an#ini>a>ve#of#SISSA,#EPFL,#and#ICTP,#with#many#partners#in# Europe#and#worldwide#
•
QUANTUM#ESPRESSO#is#free#so`ware#that#can#be#freely#downloaded.#Everybody#is# free#to#use#it#and#welcome#to#contribute#to#its#development#
4#
What#can#QUANTUM#ESPRESSO#do?# • • • • •
•
norm\conserving#as#well#as#ultra\so`#and#PAW#pseudo\poten>als# many#different#energy#func>onals,#including#meta\GGA,#DFT+U,#and# hybrids#(van#der#Waals#soon#to#be#available)# scalar\rela>vis>c#as#well#as#fully#rela>vis>c#(spin\orbit)#calcula>ons# magne>c#systems,#including#non\collinear#magne>sm# ground3state'calcula:ons' – Kohn\Sham#orbitals#and#energies,#total#energies#and#atomic# forces# – finite#as#well#as#infinite#system# – any#crystal#structure#or#supercell# – insulators#and#metals#(different#schemes#of#BZ#integra>on)# – structural#op>miza>on#(many#minimiza>on#schemes# available)# – transi>on#states#and#minimum\energy#paths#(via#NEB#or# string#dynamics)#electronic#polariza>on#via#Berry’s#phase# – finite#electric#fields#via#saw\tooth#poten>al#or#electric# enthalpy# Wannier'intepola:ons'
•
•
•
ab3ini:o'molecular'dynamics' – Car\Parrinello#(many#ensembles#and#flavors)## – Born\Oppenheimer#(many#ensembles#and#flavors)## – QM\MM#(interface#with#LAMMPS)# linear'response'and'vibra:onal'dynamics' – phonon#dispersions,#real\space#interatomic#force#constants# – electron\phonon#interac>ons#and#superconduc>vity# effec>ve#charges#and#dielectric#tensors# – third\order#an\harmonici>es#and#phonon#life>mes# – infrared#and#(off\resonance)#Raman#cross#sec>ons# – thermal#proper>es#via#the#quasi\harmonic#approxima>on# electronic'excited'states' – TDDFT#for#very#large#systems#(both#real\>me#and#“turbo\ Lanczos”)# – MBPT#for#very#large#systems#(GW,#BSE)#
# plus%several%post%processing%tools!%
5#
QUANTUM#ESPRESSO#in#numbers# •
350,000+#lines#of#FORTRAN/C#code#
•
46#registered#developers#
•
1600+#registered#users#
•
5700+#downloads#of#the#latest#5.x.x#version#
•
2#web\sites#(QUANTUM\ESPRESSO.ORG#QE\FORGE.ORG)#
•
1#user#mailing\list,#1#developer#mailing\list#
•
24#interna>onal#schools#and#training#courses#(1000+#par>cipants)#
6#
The#development#model# •
•
•
QUANTUM#ESPRESSO#is#not'a'monolithic'applica:on,#but#an#integrated#ecosystem# thriving#around#a#small#number#of#core#components#developed#and#maintained#by# a#small#number#of#developers# the#ecosystem#is#designed#so#as#to#be#alien\friendly:#a#number#of#third3party'QE3 compa:ble'applica:ons'and'add3ons,#o`en#designed#to#be#code\agnos>c,#are# distributed#with#QE#(notable#examples#include#wannier90,#Yambo,#EPW,#WanT,# XCrysDen,#...)# the#environment#that#allows#the#ecosystem#to#prosper#is#provided#by#the#QE\ FORGE.ORG#plauorm,#freely'available'to#researchers#and#developers#from#all#over# the#world# 7#
Quantum#ESPRESSO#package#poruolio# Wannier90#
WanT#
PLUMED#
SaX#
Yambo# TDDFPT#
Xspectra#
GIPAW#
GWL# NEB#
PHonon# PWscf#
Atomic#
PWCOND# CP#
CORE'' modules' 8#
J. Phys.: Condens. Matter 21 (2009) 395502
ns. Matter 21 (2009) 395502
P Gia
P Giannozzi et al
QE#Scalability# (2010)#
CP##
(Aβ\pep>de#in#water:#838#atoms#and#2311# Figure 3. Scalability for medium-size calculations (CP code). CPU#time (s) per time step (left panel) and speedup with re # electronic ##electrons,#gamma#point)##
32 processors (right panel) as a function of the number of processors and for different numbers n task of task groups, on an IBM Blue (BG/P) and on an SGI Altix. The system is a fragment of an Aβ -peptide in water containing 838 atoms and 2311 electrons in a ˚ 3 cell, ultrasoft pseudopotentials, " point, 25 and 250 Ry cutoff for the orbitals and the charge density respect 22.1 × 22.9 × 19.9 A
ability for medium-size calculations (CP code). CPU time (s) per electronic time step (left panel) and speedup with respect to (right panel) as a function of the number of processors and for different numbers n task of task groups, on an IBM BlueGene/P an SGI Altix. The system is a fragment of an Aβ -peptide in water containing 838 atoms and 2311 electrons in a ˚ 3 cell, ultrasoft pseudopotentials, " point, 25 and 250 Ry cutoff for the orbitals and the charge density respectively. 19.9 A
PWscf#
(PSIWAT:#587#atoms,#2552#electrons,#4#k\points)# (CNT:#1532#atoms,#5232#electrons,#gamma#point)#
9# of pro Figure 4. Scalability for large-scale calculations: wall time (left panel) and speedup (right panel) as a function of the number PSIWAT: PWscf code, n pool = 4, n task = 4, on a Cray XT 4. The system is a gold surface covered by thiols in interaction with water ˚ 3 cell, 587 atoms, 2552 electrons. CNT (1): PWscf code, n task = 4, on a Cray XT 4. The system k-points, 10.59 × 20.53 × 32.66 A porphyrin-functionalized nanotube, " point, 1532 atoms, 5232 electrons. CNT (2): CP code on a Cray XT3, same system as for CN Times for PSIWAT and CNT (1) are for 10 and 2 self-consistency iterations, respectively; times for CNT (2) are for 10 electronic ste Car–Parrinello step, divided by 2 so that they fall in the same range as for CNT (1).
PWSCF'GPU'PORTING'
Plane#Wave#Self\Consistency#Field#(PWSCF)# The#solu>on#of#the#Kohn\Sham#requires#the#diagonaliza>on#of#the#matrix#HKS# whose#matrix#elements#are# # 2 ~ hk + G|T|k + G0 i = (k + G0 )2 2m
hk + G|VH |k + G0 i = VH (G
G,G0
0 n(G G ) 0 2 G ) = 4⇡e |(G G0 )|2
hk + G|Vxc |k + G0 i = F T [Vxc (r)]
Kine>c#energy# Hartree#term# Exchange#correla>on#
11#
PWscf#paralleliza>on#hierarchy# Images#
• Only#for#Nudged#Elas>c#Band#(NEB)#calcula>ons#
K\points#
• Distribu>on#over#k\points#(if#more#than#one)# • Scalability#over#k\points#(if#more#than#one)# • No#memory#scaling#
BGRP_COMM' Plane\waves#
• Distribu>on#of#wave\func>on#coefficients# • Distribu>on#of#real\grid#points# • Good#memory#scale,#good#overall#scalability,#load\balancing#
Linear#algebra#task# • Itera>ve#diagonaliza>on#(fully\parallel#or#serial)# groups# • Smart#grouping#of#3DFFTs#to#reduce#compulsory#MPI#communica>ons# Mul>\threaded# kernels#
• OpenMP#handled#explicitly#or#implicitly% • Extend#the#scaling#on#mul>\core#machines#with#“limited”#memory# 12#
Simplified#PWSCF#life\cycle# 3D\FFT#+#GEMM#+#LAPACK# CALCULATIO N OF NEW ATOMIC POSITIONS (AND LATTICE PARAMETERS)
STRUCTURE OPTIMIZATION
CALCULATION OF NEW FORCES (AND STRESSES) SELF CONSISTENCY ACHIEVED
SETUP OF THE INITIAL PHYSICAL QUANTITIES
SELF CONSISTENCY
ELSE ELSE
CALCULATION OF WAVEFUNCTIONS
DIAGONALIZATION OF THE HAMILTONIAN MATRIX (FOR EACH K POINT)
WAVEFUNCTION ORTHO NORMALIZATION FORCES AND STRESSES EQUAL TO ZERO
CALCULATION OF THE NEW NON LOCAL POTENTIALS
3D\FFT#
CALCULATION OF THE NEW POTENTIALS
CALCULATION OF THE CHARGE DENSITY
3D\FFT#+#GEMM# 13#
A#"lucky"#star>ng#point# AUSURF,#112#Au#atoms,#2#k\point#
1#SCF#itera>on#
1#full#SCF#cycle# 14#
GPU#developments# • • • •
• •
MPI\GPU#binding#GPU#memory#management## NEWD#!#CUDA#NEWD#(mul>ple#kernels#combined)# ADDUSDENS#!#CUDA#ADDUSDENS# VLOC_PSI#!#CUDA#VLOC_PSI#(CUDA#kernels#+#CUFFT)# BLAS3#*GEMM#!#PHIGEMM#library#(CUBLAS)# (serial)#LAPACK#!#MAGMA#library# VLOC_PSI'acts'over'distributed'data' NEWD/ADDUSDENS'act'over'local'data'
15#
CUDA#kernels# •
ADDUSDENS# – compute\bounded#kernel# – Best#performance#measured:#20x*#(Realis>c?#9x~10x)#
•
NEWD# – compute\bounded#kernels# – Best#performance#measured:#7.2x*#(Realis>c?#3x~4x)#
•
VLOC_PSI# – memory\bounded#kernels# – Best#(serial)#performance#measured:#9x#(Realis>c?#...)##
•
All#the#data#is#moved#to#GPU#memory#only#at#once#
•
External#loops#over#atomic#species#are#kept#on#the#CPU#side# 16#
QE\GPU#
CPU\only#with#GEMM#profile# capability#
GPU'configure' Y# CUDA?#
phiGEMM?#
profiling?#
Y#
GPU\CPU#with#serial# diagonaliza>on#(MAGMA)#
Y# MAGMA?#
fast#CUDA?# GPU\CPU#with#serial# diagonaliza>on#(MAGMA)#and# advanced#op>miza>ons*#
Y# Y#
Y#
ELPA?#
phiGEMM?#
GPU\CPU#with#parallel# diagonaliza>on#based#on# ScaLAPACK#(CPU#only)#
GPU\CPU#with#parallel# diagonaliza>on#based#on#ELPA# (CPU#only)#
GPU\CPU#with#parallel# diagonaliza>on#based#on#ELPA# (CPU#+#GPU*)#
17#
CHALLENGES'IN'GPU'PORTING'
The#"Condensed#Ma~er#Physics"#happy#family# CASTEP#and#QE#shares...# •
similar#data#distribu>on#(by#plane#waves,#by#k\point,#by#bands,#...)#
•
similar#MPI#communica>on#pa~erns#(MPI_alltoall,#MPI_reduce)#
•
similar#constrains#in#scalability#
# GPU#matches#between#CASTEP#and#QE#...# •
Block%Davidson%solver%with%density%mixing%!#eigen\solvers,#vloc_psi#
•
Support%of%ultrasoA/normCconserving%pseudopotenDal%!#newd#addusdens#
•
Geometry%opDmizaDon%!#stress#forces#calcula>ons#
•
BLAS#3#opera>ons#!#CUBLAS,#phiGEMM# 19#
QE#ad\hoc#parallel#3D\FFT# There#are#two#"FFT#grid"#representa>on#in# Reciprocal#Space:#wave#func>ons#(Ecut)#and#charge# density#(4Ecut)## Transform along Z
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Ec
4Ec
y z
x ~ Nx Ny / 5 FFT along z
Parallel Transpose ~ Nx Ny Nz / (5 Np) data exchanged per PE
A#single#3D\FFT#is#divided#in#independent#1D\FFTs' Transform along X
Transform along Y 0
0
1
1
2
2
3
3
z
y
z x
y Nx Nz / 2 FFT along y 0 1
Data#are#not#con>guous#and#not#“trivially”#distributed#across#processors'
PE 0 PE 1
2 3
x Ny Nz FFT along x
PE 2 PE 3
Zeros#are#not#transformed.#Different#cut\offs#preserve#accuracy#
20#
H#*#psi## compute/update H * psi:! ! !compute kinetic and non-local term (in G space)! ! ! complexity : Ni × (N × Ng+ Ng × N × Np)! ! !Loop over (not converged) bands:! !! ! !FFT (psi) to R space! !! ! ! complexity : Ni × Nb × FFT(Nr)! compute V * psi! !! ! ! complexity : Ni × Nb × Nr! FFT (V * psi) back to G space! !! ! ! complexity : Ni × Nb × FFT(Nr)! compute Vexx! !! ! ! complexity : Ni × Nc × Nq × Nb × (5 × Nr + 2×FFT(Nr))! N#=#2×Nb#(where#Nb#=#number#of#valence#bands)# Ng#=#number#of#G#vectors# Ni#=#number#of#Davidson#itera>on#
Np#=#number#of#PP#projector# Nr#=#size#of#the#3D#FFT#grid# Nq#=#number#of#q\point#(may#be#different#from#Nk)# 21#
Task\group#paralleliza>on# P0# P1# P2# P3#
do i = 1, n compute 3D FFT( psi(i) ) end do
G0
G1
G2
G3
do i = 1, n/groupsize merge( psi( i ), psi( i + 1 ) ) compute groupsize 3D FFT (at the same time) end do 22#
Parallel#VLOC_PSI# FFT G!R
CUFFT G!R PSI
PSIC
PSI
PSIC PSIC
Mul>ple#LOCAL#grid#to# compute#
“MPI_Allgatherv” DISTRIBUTED
products
products
“MPI_Allscatterv”
HPSI
HPSI
PSIC
FFT R!G
PSIC PSIC
CUFFT R!G Overlapping'is'possible!!'
23#
Parallel#VLOC_PSI#\\#limita>ons#
512#
256#
64# 24#
Dependencies,"workflows,"and"itera= What#scien>sts#do?#
Courtesy#of#Prof.#Nicola%Marzari%(THEOS,#EPFL)#
GPU#"#
NO#GPU### 25#
Next#developments/priori>es# TOP#PRIORITY:# •
Specific#NVIDIA#Kepler#op>miza>ons#!#NVIDIA#Dev#Tech#"#
•
non#collinear#spin\orbital#magne>za>on#
•
PHonon#using#OpenACC#
#
"LESS"#PRIORITY:# •
first#GPU\accelerated#CP#(special#focus#on#DFORCE)#
•
Alterna>ves#for#current#eigen\solvers#(ELPA+GPU?#Lancroz?)#
#
and...#DOCUMENTATION!#
26#
PERFORMANCE'EVALUATION'
AUSURF112,#serial#
7.81x#
5.54x# 3.49x# 3.54x#
Tests%run%early%2012% 28#
Performance#Power#consump>on#(serial)# AUSURF112, k-point (C2050)
0.06 0.04 0.02 1200
5
0.7
\58%#
0.5 0.3 0.1 9000
4
\54%#
0.12 0.08 0.04
2100
4
3.2x#
4
3
300
3
1400
3
3000
2
700
2
1
0
Time [s]
600
6000
Time [s]
Time [s]
0.16
3.1x#
3.67x# 900
Water-on-Calcite (C2050) Power [KW/h]
\57%#
0.08
Power [KW/h]
Power [KW/h]
Shilu-3 (C2050) 0.1
2
0
1
6 OMP
6 MPI
6 OMP 1 GPU
0
6 OMP
6 MPI
6 OMP 1 GPU
1
6 OMP
6 MPI
6 OMP 1 GPU
Tests%run%early%2012% 29#
MGST\hex#
216#atoms#of#{Ge,#Mn,#Te,#Sb},#gamma\only#(courtesy#of#Zhang%W.%–#RWTH/AACHEN)#
2.43x' 2.96x' 3.38x'
Tests%run%early%2012% 30#
CdSe159#
159#atoms#of#{Cd,#Se},#gamma\only#(courtesy#of#Calzolari%A.%–#CNR/NANO)#
2.52x' - Walltime of 1 full SCF CdSe-159 35000
2.40x' 2.17x'
30000
CdSe-159 - Speedup 7
CPU only CPU+GPU
6
2.06x'
20000
2.19x'
15000
speed-up
Time [s]
25000
1.89x'
10000
CPU only CPU+GPU Ideal
5 4 3 2
5000 0
1 2 (16)
4 (32)
6 (48)
8 (64)
10 (80)
12 (96)
Total number of Cores
14 (112)
2 (16)
4 (32)
6 (48)
8 (64)
10 (80)
12 (96)
14 (112)
Total number of CPU cores
Tests%run%early%2012% 31#
IRMOF\M11#
130#atoms#of#{O,#Zn,#C,#H},##1#K\point#(courtesy#of##Clima%S.%\#IMEC)#
IR-MOF (130) - Walltime of full SCF (x3) 7000 6000 5000 Time [s]
CPU only CPU+GPU
1.87x' 1.79x'
1.97x'
4000
1.37x'
3000 2000 1000 0 1 (8)
2 (16)
3 (24)
4 (32)
5 (40)
6 (48)
7 (56)
8 (64)
8
Total number of Cores
Tests%run%early%2012% 32#
HiÜng#the#limit#
130#atoms#of#{O,#Zn,#C,#H},##1#K\point#(courtesy#of##Clima%S.%\#IMEC)#
30.9%' IR-MOF 45.1%' (130) - 3 relax
1.49x'(1.87x)'
IR-MOF (130) - 3 relax
10000
only CPU+GPU
9000 8000
9000
40.9%'
Time [s]
7000
6000 5000 4000
init electron updatepot forces stress
30.6%'
8000
1.26x'(1.41x)'
7000 Time [s]
10000
1.46x'(1.79x)' CPU
26.3%'
6000 5000
40.1%'
4000
3000
3000
2000
2000
1000
1000 0
0 1 (8)
2 3 4 5 6 7 8 (16) (24) (32) (40) (48) (56) (64)
Total number of Cores
8
C G
1 (8)
C G
C G
C G
C G
C G
C G
C G
2 3 4 5 6 7 8 (16) (24) (32) (40) (48) (56) (64)
Tests%run%early%2012% 33#
HiÜng#the#limit#
489#atoms#of#{Cd,#Se},#3(+1/2)#SCF#steps#STRESS#FORCES,#24#MPI#x#6#OMP#(144#cores)#@#PLX#(CINECA)#
3.2x'
~#1.73x# Tests%run%early%2012% 34#
Chea>ng#using#parallelism:#USE_3D_FFT#
120#atoms#of#{Bi,#Fe,#O,#LS,#Mn},#reduced#to#8#k\point#(courtesy#of#Ferreira%R.#–#Rio#de#Janeiro#Federal#Univ.)#
###pool#=###MPI#processes## (#dimension#each#pool#=#1,#k\point#distributed#in#"round\robin")# Computer#Nodes#
Execu>on#Time#[s]## #10#self\consistency#cycles#
2#x#iDataPlex#DX360M3,#dual#Xeon#E5645#6\cores#2.40#GHz# (24#cores)#
52057.22'
2#x#iDataPlex#DX360M3,#dual#Xeon#E5645#6\cores#2.40#GHz# (24#cores)#+#4'NVIDIA'2070'(__USE_3D_FFT)'
10029.1'
Speed\up#
'5.2x''
Tests%run%early%2012% 35#
GPU\accelerated#PWscf,#when#and#why?# • Physical#system#criteria# – – – –
"crazy"#amount#of#atoms#(#O(1000)#)#in#gamma\only#(or#1~2#k\point)# reasonable#amount#of#atoms#(O(100)#)#' small#amount#of#atoms#(O(10)#)#but#many#k\points# O(10)~O(100)#but#with#lot#of#atomic#species#
• Hardware#criteria*# – worksta>ons#equipped#with#commodity#GTX#and/or#high\end#TESLA#GPUs# – reasonable#HPC#clusters#(O(100)#nodes,#1:1#ra>on#GPU#versus#CPU#socket#)#
• Work\flow#criteria# – long#SCF#energy#minimiza>on# – high#throughput#
FINAL'CONSIDERATIONS'
+
PERFORMANCE,+CORRECTNESS+AND+SUSTAINABILITY
37#
What#drive#(my)#agenda?#
Sustainability'
Performance'
Correctness'
38#
Sustainability#vs%Performance#vs#Correctness# •
CUDA#or#not\CUDA?#(Shakespearean%dilemma)#
•
Third\part#libraries#hide#complexity#but#also#introduce#"noise"#
•
Reduc>on#opera>ons#(explicit#managed#or#implicitly#inherit)#are#the#nightmare#
•
Bugs#can#be#very#tricky#and#not#easy#to#detect#
•
An#embedded#Unit#Test#suite#is#needed#but#also#an#Acceptance%TesDng%Procedure#
•
The#need#of#cer>fied#QE\GPU#benchmarks*#
•
Guarantee#performance#portability#is#very#difficult#
39#
OpenACC:#yes,#no,#maybe?# PROs:# • easy#learning#curve# • fast#first#deployment# • acceptable#performance#compromises# # #CONs:# • Code#has#to#be#rewri~en#to#suite#the#accelerator# • Lack#of#direct#control#of#the#generated#GPU#code# • memory#sharing#and#compe>>on# # and#now#...#OpenMP#4.0#RC2#with#DirecDves%for%aRached%accelerators% 40#
(pseudo\random)#Personal#thoughts# •
Training#users#is#challenging#(we#kept#everything#simple#but...)#
•
User#support#is#>me#consuming#
•
CUDA#is#easier#than#most#people#think#
•
Focusing#on#fine\grain#performance#ONLY#at#the#final#stage#
•
There#is#not#only#the#parallel#world#
•
No#more#old\fashion#por>ng#
The#OpDmizaDon%Dilemma%
?
Is#it#worth#to#spend#>me/effort#op>mizing#the#code#for#the#latest# available#architecture#by#using#the#latest#cool#set#of#features#provided# by#the#most#updated#version#of#CUDA#toolkit/driver?# Who%are%the%users?%Are%they%users%or%developer?%What%hardware%they%have?%What%is%their% scienDfic%workflow?%What%are%the%scienDfic%challenges%they%have?%%What%is%their%budget?% What%ROI%they%expect?%%
42#
THANK'YOU'FOR'YOUR'ATTENTION!' Links:# • h~p://www.quantum\espresso.org/# • h~p://founda>on.quantum\espresso.org/# • h~p://qe\forge.org/gf/project/q\e/# • h~p://qe\forge.org/gf/project/q\e\gpu/#
43#
Porting the ONETEP Linear-Scaling Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures
Karl Wilkinson Skylaris Group, University of Southampton
2
Outline 1. Computational Chemistry 2. Heterogeneous Architectures and Programming models 3. Order-N Electronic Total Energy Package (ONETEP) 4. GPU Code Developments and Performance 5. Hybrid MPI-OpenMP Parallelism 6. Future Work 7. Acknowledgements
3
COMPUTATIONAL CHEMISTRY
4
Computational Chemistry ! Trade offs: – – –
! Drug design
Accuracy System size Level of sampling
Low Molecular mechanics
– –
! Materials High
Accuracy
Lead structure design Protein-lead structure interactions
Ab initio
–
Epoxidised linseed oil – ships hulls
! Catalysis ! Energy research
Low 10’s of atoms
High System size
Low Single point energy
Millions of atoms
High Sampling
Free energy surface
–
Enzyme-biomass interactions
5
HETEROGENEOUS ARCHITECTURES AND PROGRAMMING MODELS
Heterogeneous Compute Architectures ! Heterogeneous: Multiple distinct types of processors – – – –
More than 10% of top 500 supercomputer use some form of accelerator Typically Central Processing Units (CPUs) coupled with Graphical Processing Units (GPUs) Common in high powered workstations Intel Xeon Phis emerging
! Development considerations – – –
Generally need to adapt code to take advantage of the architecture Single code will contain multiple algorithms - execute on most suitable processing unit May result in new bottlenecks (PCIe data transfer)
6
Parallelism in Heterogeneous Machines
7
Multicore CPUs
Node
CPU Socket
O(100-10,000) Nodes
CPU Socket
GPU based accelerator boards O(10) Streaming Multiprocessors
Titan 18,688 Nodes
8-16 CPU cores
PCIe Socket
18,688 CPUs (hexadecacore) 299,008 CPU Cores 18,688 K20X GPUs (14MPs) 261,632 Multiprocessors 560,640 Total cores
PCIe Socket
32-192 CUDA cores (SP/DP)
Memory in Heterogeneous Machines CPU ! Complex memory hierarchy ! GPUs designed to hide the movement of data on the GPU itself
! Major issues: – – –
– –
QPI
Socket
CPU Socket
System RAM
System RAM
GPU
GPU
Global GPU Memory
Global GPU Memory
Multiprocessor Memory
Multiprocessor Memory
Core caches
Core caches
Host-Device - Typical data transfer bottleneck
Host – device data transfer CPU speed for data parallel execution GPU speed for serial execution
! Hardware manufacturers addressing these issues: Uncertainty over future hardware –
8
CPUs becoming more like GPUs (increasing core count) GPUs gaining CPU-like functionality (Project Denver) Hybrid “System-on-chip” models
9
Motivation for choosing OpenACC ! Hardware Considerations – –
Multiple hardware models already available Uncertainty over future hardware models
! Solutions –
Ensure portability and future proof
–
Abstractions help to protect the developer
–
Pragmas and directives
! Development Considerations – – – – – –
Complexity of hardware - average developer is a domain specialist Porting existing rather than writing from scratch Avoid changing original version of the code Avoid mixing languages Avoid reinventing the wheel Ensure maintainability of code
PGI Accelerator OpenACC
10
ORDER-N ELECTRONIC TOTAL ENERGY PACKAGE (ONETEP)
11
ONETEP ! Commercial code, distributed by Accelrys as part of Materials Studio ! Actively developed by a number of research groups –
Central SVN repository
! Standard Fortran 95 ! 233,032 lines of code ! Parallel design from onset – Message Passing Interface –
Aiming to extend beyond MPI parallelism
12
ONETEP ! Highly accurate linear-scaling quantum chemistry package – – –
Number of atoms Number of cores Maintain number of iterations
! Wide range of functionality – Applicable to wide range of problems – Used for materials science and biological systems
! Scales to tens of thousands of atoms ! Runs on hundreds of cores using MPI
13
ONETEP – Brief Theory ! Utilizes a reformulation of Density Functional Theory in terms of the one-particle density matrix – –
Broad strokes: The key parameter is the electron density Electrons exist within atomic orbitals, orbitals represented by mathematical functions
! Scaling with number of atoms achieved through exploitation of the “nearsightedness” of electrons in many atom systems – –
Localized density kernel, K, and functions Not sums over all possible pairs
14
ONETEP – Methodology
! Iterative process – –
Optimize the NGWFs ( (r)) Optimize the density kernel (K)
! Strictly localized non-orthogonal generalized Wannier functions (NGWFs) ( (r)) used to represent atomic orbitals
15
ONETEP – Methodology ! NGWFs are expressed on a real space grid Technically: psinc basis set, equivalent to a plane wave basis set
Simula(on+Cell+
N1+direc(on+
! Parallel scaling achieved through decomposition of data across processors ! Three types of data ! ! !
Atomic Simulation cell Sparse matrix
A
FFT+Box++
Computa(onally+efficient+
! Localized operations performed within FFT boxes. ! FFT box – 3x diameter of largest localized function
N2+direc(on+
–
ONETEP – Data distribution and movement ! Atomic (FFT box) operations produce data for simulation cell operations and vice versa
! Memory efficient data abstraction required
16
ONETEP – Data distribution and movement ! Parallelpiped (PPD) is the basic unit of data used in this process
! Data contiguity within PPDs is inconsistent relative to other representations
! Natural barrier to initial areas of accelerated code
17
18
ONETEP – Profiling ! Small chemical system by ONETEP standards (Tennis ball dimer: 181 atoms)
! Basic calculation
! Representative level of accuracy ! Most common density kernel optimization scheme ! Single iteration of NGWF optimizations (over represents NGWF gradient step)
19
CODE DEVELOPMENTS
Scope of the Developments
20
! Focus on atom localized operations: FFT boxes ! Focus on core algorithms that will benefit full range of functionality ! Maintain data structures and current communications schemes
! Minimize disruption to rest of code
Submitted to Journal of Computational Chemistry
21
Naive Implementation ! Simple switch from CPU FFT library (MKL/FFTW/ACML) to a GPU library (CUFFT) ! No consideration given to data transfer optimization ! Little/no benefit
! Extend to other areas of code…
22
Implementation Notes ! Hybrid PGI Accelerator/CUDA Fortran – –
PGI for Kernels CUDA Fortran for data transfers
! CUFFT library
! PGI Accelerator
! Asynchronous data transfers by default In “pure” PGI Accelerator –
! Very straightforward code, no complex use of pragmas.
OpenACC
Hybrid model breaks this
23
Data Transfer Implementation
24
Charge Density (36% of runtime) Diagonal elements of the oneparticle density matrix:
using n(r) =
(r; r).
1. Extract NGWFs 2. Populate FFT boxes 3. Transform to reciprocal space 4. Up-sample to fine grid (avoid aliasing errors) 5. Transform to real space 6. Calculate product and sum 7. Deposit charge density data in simulation cell
25
Charge Density (36% of runtime) !Data transfer to GPU not shown
1. Extract NGWFs 2. Populate FFT boxes 3. Transform to reciprocal space
!$acc data region !$acc region coarse_work(1:n1,1:n2,1:n3) = scalefac * cmplx( & fftbox_batch(:,:,:,is,3,batch_count), & fftbox_batch(:,:,:,is,4,batch_count),kind=DP) !$acc end region ! Fourier transform to reciprocal space on coarse grid call cufftExec(cufftplan_coarse,planType,coarse_work, & coarse_work,CUFFT_FORWARD
26
Charge Density (36% of runtime) 4. Up-sample to fine grid (avoid aliasing errors)
• Populate corners of a cube
Coarse array
Fine array
• Typical dimensions of course grid: 75-125 (odd numbers), double this for the fine grid
27
5. Transform to real space 6. Calculate product and sum (if common atom) 7. Deposit charge density data in simulation cell
28
Charge Density Performance ! Tested on Iridis3, Emerald and Nvidia PSG cluster ! Detailed timings are blocking –
No asynchronous kernel execution and data transfer
! Data transfer stages (2 and 7) problematic, larger quantity of data in step 7
29
Charge Density Performance ! Clear shift in bottlenecks to data transfer stages
Local Potential Integrals 1. 2. 3. 4.
Extract NGWF Populate FFT box Transform to reciprocal space Up-sample to fine grid (avoid aliasing errors) 5. Transform to real space 6. Extract local potential FFT box from simulation cell 7. Apply local potential
Host+
12. Extract PPDs that overlap with 13. Calculate dot product of and PPDs.
CPU+to+ GPU+ DATA+
"↓$ (&)!
FFT+
"↓$ #(+)!
"↓$ (&)!
UpAsample+ to+fine++ grid+
Extract++ PPDs+
Extract+FFT+box+
'↓()* (&)!
FFT+
CPU+to+ GPU+ DATA+
"↓$ (&)!
"↓$ #(+)!
Apply++ poten(al++
8. Transform to reciprocal space 9. Down-sample to fine grid 10. Transform to real space 11. Extract PPDs that overlap with
30
FFT+
'↓()* "↓$ (&)↑ !
'↓()* "↓$ #(+)↑ !
Host+
12+ Extract++ 13+
11+
GPU+to+ CPU+ DATA+
∫↑▒"↓/ ↑∗ (&)' ↓()* (&)"↓$ (&)↑ 0 ! '↓()* "↓$ (&)↑ !
DownAsample++ to+coarse++ grid+
10+ FFT+
'↓()* "↓$ #(+)↑ !
Performance - Bottlenecks Relative cost of stages
Local Potential Integrals •
More data transfers
Kinetic integrals •
Only coarse grid
NGWF gradient • • •
Partly accelerated Significantly more complex Data transfer at stage 4.
31
32
Performance Summary ! Performance of 1 GPU vs 1 CPU core (unfair): Xeon E5620 Charge Density 1221.1
Tesla M2050
Tesla M2090
Kepler K20
340.5 (3.6)
271.1 (4.5) 225.4 (5.4)
Local Potential Integrals
861.3
301.2 (2.9)
242.2 (3.6) 219.1 (3.9)
Kinetic integrals
268.9
127.0 (2.1)
112.4 (2.4) 131.0 (2.1)
636.1 (1.8)
521.5 (2.1) 509.1 (2.2)
NGWF Gradient 1120.8
! Total calculation time: –
1 GPU 2-3.5x faster than 1 CPU core, equivalent to 4 CPU cores
33
Scaling Performance ! Good scaling up to 4 GPUs – Seen for all areas of accelerated code Charge+density+
Local+poten(al+integrals+
787&
4x+CPU+core+
843&
Hardware+details+
4x+K20+ 164& 242& 151& 353& 2x+K20+
396&
1x+K20+
4x+M2050+
446&
607&
2x+M2050+
532&
500&
249&
642&
456&
2x+M2090+ 1x+M2090+
689& 0+
457&
1,198&
Other+
996&
467&
441&
539&
1,240&
800&
406&
232&
830&
795&
4x+M2090+ 192& 236& 142& 417&
NGWF+gradient+
305&
889&
1x+M2050+
372&
739&
249& 307& 175&
Kine(c+integrals+
615& 381&
1,585&
1,092&
335& 246& 638& 1000+
789&
625&
320&
1,273& 2000+
Time+(s)+
907& 3000+
4000+
5000+
34
Understanding Poor Scaling ! Neglected thread affinity –
Quick path interconnect
! Not the only problem –
Sheer size of data transfers
GPUs Number GPUs on GPUs on Scaling Total Total per of bus 0 bus 1 Calculation relative to 1 GPUs node nodes (max. 3) (max. 5) GPU Time Total
% of ideal
1
1
1
1
0
5655.9
1.0 100
8
8
1
3
5
1737.5
3.3
41
24
8
3
3
5
637.9
8.9
37
24
6
4
3
3
620.6
9.1
38
24
4
6
2
2
471.9
12.0
50
24
2
12
1
1
362.9
15.6
65
35
Strong Scaling ! Detailed scaling tests on Emerald: 8 GPUs 2 CPUs ! Comparing against a single node – Impressive results ! Unfortunately not the full story… –
Data transfer on a node saturating available PCIe bandwidth?
14.0&
8.0&
Ideal& 495&atom&cellulose&nanofibril& 1717&atom&beta&amyloid&fibril&
12.0&
6.0&
Speedup+
10.0& 8.0&
4.0&
6.0& 4.0&
2.0&
2.0& 0.0& 0&
2&
4& 6& 8& 10& Number+of+nodes+
12&
14&
0.0& 0&
2&
4& 6& Number+of+GPUs+
8&
36
Conclusions ! Overall acceleration of a factor of 2 to 3x relative to single CPU core ! Equivalent performance to 4 CPU cores ! Data transfer serious bottleneck – –
No asynchronous data transfer Running in lock-step due to blocking MPI communications
! Strong scaling is hit hard by limited PCIe bandwidth
! Promising results, much more to do with obvious first target
37
HYBRID MPI-OPENMP PARALLELISM
38
Motivation ! Hard limit of 1 atom per core ! 10 atoms per core optimal –
Communications-workload balance
! Memory demands mean only 50% of cores used on some machines (HECTOR) ! Scaling limited to hundreds of cores
! Myself: – – – –
Charge density Local potential integrals Kinetic Integrals NGWF gradient
! Dr. Nick Hine: –
Sparse algebra
– –
Projectors Testing and modification for IBM BlueGene/Q at Hartree centre
39
Method ! Level above that parallelized by GPU developments
40
Initial Results ! Charge density only ! Tested on 16 core workstation ! 181 atom tennis ball dimer
Total Time
OpenMP threads
% of optimal 1.0 2.0 4.0 8.0 16.0 1.0 100.0 77.2 51.8 31.2 15.4 2.0 96.4 71.1 47.1 24.4 MPI 4.0 90.7 64.2 37.5 Processes 8.0 78.5 50.5 16.0 59.6 Charge Density
OpenMP threads
% of optimal 1.0 2.0 4.0 8.0 16.0 1.0 100.0 93.9 78.9 56.5 29.4 2.0 96.6 85.4 69.8 41.7 MPI 4.0 96.7 83.6 55.4 Processes 8.0 86.3 65.0 16.0 67.0
41
Latest Results 1728+atom+Silicon+Oxide+ MPI Processes
OpenMP threads Fixed core count
16
1
Out of memory
8
2
Out of time (>3600s)
4
4
1045s
2
8
617s
16
673s
1 4 4 4
& & & & &
OpenMP scaling 1
2711s
4
1046s
8
784s
& Results&courtesy&of&Dr.&Nick&Hine& CalculaHons&performed&on&Blue&Joule&at&the&Hartree¢re.
42
Latest Results 12,000+atom+Beta+Amyloid+Fibril+ (Each+MPI+process)+uses+16+OpenMP+threads+
Speedup+
4.0+
Ideal+ sparse+products+ row+sums+ \box+opera(ons+
3.0+
& 2.0+ & & &
1.0+ &4000+
&
8000+
12000+
16000+
Total+number+of+cores+
Results&courtesy&of&Dr.&Nick&Hine& CalculaHons&performed&on&Blue&Joule&at&the&Hartree¢re.
43
Summary ! Significant performance increase – – –
Scaling Memory usage cores > atoms now possible
! Excellent scaling up to 4 OpenMP threads, memory bandwidth limited on Blue Joule above 8 cores ! HECTOR performance promising
44
FUTURE WORK
45
Future Work Improved GPU code
Hybrid MPI-OpenMP code
• OpenACC • Asynchronous data transfer • Loop unrolling • Extend to other areas • GPU overloading • Batched 1D FFTs
• Extend to “Data parallelism” • Extend to other areas
Heterogeneous execution
Algorithmic Improvements • Dynamic/Mixed Precision • Increased truncation • Hybrid FFT/interpolation
Large scale ab initio Molecular Dynamics on machines such as: TITAN Stampede Mira
(AMD CPU + NVIDIA K20X) (Intel CPU + Intel Xeon Phi) (BlueGene/Q)
46
Acknowledgements ! Chris-Kriton Skylaris –
University of Southampton
! Nick Hine –
University of Cambridge
! E-Infrastructure South consortium – –
Istvan Reguly Emerald
! iSolutions –
Iridis3
! NVIDIA – –
Mark Berger Timothy Lanfear
! PGI – –
Mathew Colgrove Pat Brooks
! EPSRC
47
Precision • Optimized energies exceed the level of precision required for chemical accuracy (milliHartree) • Difference in calculated energies relative to the CPU code comparable to differences resulting from the number of MPI threads in the CPU code
48
Dynamic Precision • GPUs have a larger number of single precision (SP) than double precision (DP) cores. • Other codes (GAMESS-UK and Terachem) utilize dynamic precision. • Switch between SP and DP determined by a convergence threshold.
49
Dynamic Precision • GPUs have a larger number of single precision (SP) than double precision (DP) cores. • Other codes (GAMESS-UK and Terachem) utilize dynamic precision. • Switch between SP and DP determined by a convergence threshold. 1E+02
Energy Difference (Eh)
1E+00 1E-02 1E-04 1E-06 1E-08 1E-10
CPU DP 1x10E-5
1E-12 1
2
3
4
GPU DP 5x10E-6 5
6
7
8
1x10E-7 1x10E-6 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Iteration number
Nike Dattani Two GPU implementations in MATLAB: The Chaos Game Representation for genetic sequences, and Feynman integrals for open quantum systems dynamics
8 May2012
Quantum Computers
Photosynthesis: More efficient energy sources
Quantum High temperature Quantum Communication Devices superconductors measurement devices
Bird navigation: More accurate navigation instruments
Cancer prevention
Smelling: Design of better perfumes
http://blog.accelereyes.com/blog/2010/04/05/converting-matlab-loops-to-gpu-code/
Time mesh (indexed by k): (tk=0 , tk=1 , tk=2 , tk=3 , … , tk=N) At each time the system is in a state sk: (sk=0 , sk=1, sk=2 , sk=3 , … , sk=N) Each set {sk} is called a “Feynman path” Each Feynman path has an amplitude:
s0 s1 s2 s3 s4 = 11111 0.493 11112 0.917 ………………..... 21247 0.476 ………………..... MMMMM 0.851 Finally: Add the amplitudes for which the final state is the same, and calculate the amplitudes for the next time step (and repeat) If ‘memory’ is 5 steps, paths like 222222 and 213927 should be combined (ie. the Feynman amplitudes are 5th – order markovian, in this case)
For a 2-state system, 1st order markovian We have an amplitude array: And a propagator matrix: We want:
Which, by definition is:
For a 2-state system, 1st order markovian We have: ,
For a 2-state system, 2nd order markovian We have an amplitude array: And a propagator matrix: We want:
Which, by definition is:
For a 2-state system, 2nd order markovian We have: ,
For a 3-state system, 1st order markovian We have an amplitude array: And a propagator matrix:
We want:
Which, by definition is:
For a 3-state system, 1st order markovian We have: ,
For an M-state system, Δkmax order markovian We have an amplitude array:
And a propagator matrix:
We want:
Which, by definition is:
For an M-state system, Δkmax order markovian We have:
,
Most obvious MATLAB algorithm uses the function KRON Instead of KRON , restructure to use EXPAND: 2x speed up Instead of EXPAND, restructure to use REPMAT: 3x speed up (6x total) Instead of REPMAT, restructure to use BSXFUN: 2x speed up (12x total)
Most obvious MATLAB algorithm uses the function KRON Instead of KRON , restructure to use EXPAND: 2x speed up Instead of EXPAND, restructure to use REPMAT: 3x speed up (6x total) Instead of REPMAT, restructure to use BSXFUN: 2x speed up (12x total)
268 MB of data: 4x speed up (48x speed up in total) 1.07 GB of data: 6x speed up (72x speed up in total) 4.29 GB of data: doesn’t fit on the GPU
Skynet: Δkmax
2 3 4 5 6 7 8 9 10 11 12
CPU GPU CPU/GPU 0.057081 1.5377 0.0371 0.065062 3.2356 0.0201 0.091603 2.1099 0.043 0.19423 1.2279 0.1581 0.61296 1.2023 0.5098 2.416 1.335 1.8097 16.6221 1.6903 9.8338 59.2344 4.6461 12.7492 282.8173 15.6012 18.1279 1166.5199 67.1675 17.3673 4695.0871
24 December 2012, on Zen C2070 with 6 GB of VRAM Δkmax
CPU time (s)
GPU time (s)
2
0.0550
1.1172
3
0.0624
1.1196
4
0.0890
1.1118
5
0.1961
1.0741
6
0.6004
1.0791
7
2.2036
1.1396
x2
8
9.3467
1.5705
x6
9
39.542
4.3065
x9
10
184.822
14.5570
x 12.7
11
738.747
57.7692
x 12.8
12
2902.89
Speed -up
Data doesn’t fit on GPU
Jade: CPU
GPU CPU/GPU
0.052856 0.85784 0.051091 0.85151 0.070343 0.84949 0.13912 0.85151 0.38894 0.85712 1.3665 0.92836 5.8251 1.5124 26.099 5.03 94.6453 18.3151 385.6105 77.1665 1598.0806
0.0616 0.0600 0.0828 0.1634 0.4538 1.4720 3.8516 5.1887 5.1676 4.9971
On CPU – took 1238 seconds:
On GPU – took 107 seconds:
Mike Giles, Wes Armour & Andrew Richards Oxford Supercomputing Centre
OeRC Mihai Duta , Albert Solernou , Steven Young, Mark Hayton, Istvan Reguly For being some of the best computing staff I’ve ever worked with !
LBM
LBM+IBM(+GPU)
realtime LBM
The Oxford e-Research Centre Many-Core Seminar Series 5th June 2013
University of Manchester Modelling and Simulation Centre Twitter: @Dr Revell
Alistair Revell
GPU Implementation of Lattice Boltzmann Method with Immersed Boundary: observations and results
LBM+GPU
LBM
realtime LBM
Sciences), which led to this opportunity.
• Thanks to the UKCOMES community (UK Consortium Mesoscale Engineering
Mr. Mark Mawson (PhD student @ Univ. Manchester) Dr. Julien Favier (Univ. Marseilles) Pr. Alfredo Pinelli (City Univ. London; prev. CIEMAT, Madrid) Mr. Pedro Valero Lara PhD student @ CIEMAT) Mr. George Leaver (Visualization, Univ. Manchester)
• Colleagues contributing to this work • • • • •
LBM+IBM(+GPU)
Acknowledgements
LBM+GPU
LBM
LBM+IBM(+GPU)
• Realtime simulation • OpenGL tweaks • Live demos
• Immersed Boundary method • Overview of the method • Implementation of IB into LB • Some GPU optimisations • Results
• Lattice-Boltzmann method • An overview of the method • Some GPU-related optimizations • Validation / Results
Overview of Seminar
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
realtime LBM
• Broadly, one can consider that Lattice Boltzmann Method operates between these constraints. • on one side it can be extended to macroscale problems, whilst retaining a strong underlying element of particle behaviour.
• Mesoscale:
• In the limit of high Knudsen, one might resort to Molecular Dynamics • While this approach is impractical for macroscale applications
• Microscale approaches:
• In the limit of low Knudsen number one can assume a ‘Continuum’. • The Navier Stokes equation can be applied to infinitesimal elements
• Macroscale approaches:
The Lattice Boltzmann Method might be considered to be a ‘Mesoscale’ approach
Lattice Boltzmann Method: Introduction
LBM+GPU
LBM
LBM+IBM(+GPU)
realtime LBM
[Raabe, 2004, ]
Lattice Boltzmann Method: Introduction
LBM+GPU
LBM
realtime LBM
(eq)
⇣ m ⌘ 32 = 4⇡ c 2e 2⇡kT
distribution function :
f (r + cdt, c + Fdt, t + dt)drdc
f (r, c, t)drdc = ⌦(f )dcdt
• the cornerstone of the Lattice Boltzmann Method
conditions f (eq) (r, c, t)
• Boltzmann equation describes the return of f (r, c, t) to equilibrium
recovered by integration
mc 2 2kT
• For given equilbrium gas, one can obtain the Maxwell-Boltzmann
• Focus on a distribution of particles f (r , c, t) • Characterises the particles without realising their individual dynamics • Instead considers the distribution of particle velocities
• Macroscopic quantities
f
LBM+IBM(+GPU)
Lattice Boltzmann Method: Introduction
LBM+GPU
LBM
LBM+IBM(+GPU)
• basic conservation laws applied • Particles can move only along one of the directions • Particles move only to next node in one timestep. • No particles at same site can move in same direction
• Frisch Hasslacher, Pomeau [Frisch et al., 1986, ] • proposed a hexagonal lattice; ensured realisitic fluid dynamics • included a ‘random element’ and used look up tables.
• proposed a square lattice arrangement • though not rotationally invariant and produced ‘square vorticies’ !
• LBM has origins in Lattice Gas Cellular Automata • Hardy, Pomeau, de Pazzis [Hardy et al., 1973, ]
Lattice Boltzmann Method: Introduction
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
f (eq) =
✓
1 2⇡cs2
◆
e
c2 2cs2
"
1+
c · u (c · u) + cs2 2cs4
2
• And use expansion of f (eq) to be 2nd order accurate.
⌘ 1 ⇣ (eq) ⌦= f f ⌧ • represents relaxation back to equilibrium distribution in timescale ⌧ .
from [Bhatnagar et al., 1954, ] to approximate ⌦
• We use the BGK collision operator
@ t f + c · rr f + F · rc f = ⌦
• Boltzmann equation can be written as:
u cs2
2
#
Lattice Boltzmann Method: Introduction
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
realtime LBM
[Guo et al., 2002, ]:
• fib eventually used for immersed boundary
✓ 1 (0) fi (x+ei , t+1) = fi (x, t) (fi (x, t) fi (x, t))+ 1 ⌧
• Force term implemented following
• Here we use D2Q9 and D3Q19
◆ 1 ei u ci · u !i ( 2 + 4 ci ·fib 2⌧ cs cs
• DnQm models used for n spatial dimensions and m discrete velocities
• Spatial discretization on lattice is provided by Gaussian Quadrature
Lattice Boltzmann Method: Introduction
LBM+GPU
LBM
LBM+IBM(+GPU)
[Grad, 1949, ]
realtime LBM
n=0
n
a(n) ,
a(n) =
Z
f H(n) (c)dc
[Latt, 2013, ]
macroscopic variables
• Coefficients of the Hermite polynomial match the moments of the
f N (c, c, t) = !(c) =
N X 1
the distribution function may be expanded over velocity and space using orthonormal Hermite polynomials [Shan et al., 2006, ].
• Based on 13-moment theory of
Validity of LBM for di↵erent flows
LBM+GPU
LBM
D2Q17
D2Q9
LBM+IBM(+GPU)
D3Q39[Nie and Chen, 2009, ]
D3Q19
Di↵erence Lattice models
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
[Zou, 1997, ]
‘underdefined’
• some ‘issues’ at corners and along edges where problem is
• 2nd order Zou-He conditions have been implemented
• 1st order bounce-back conditions are the most simple
Boundary Conditions
LBM+GPU
realtime LBM
LBM
! loop to 2
6. compute macroscopic quantities
5. imposed boundary condition
4. stream & collide
3. compute equilibrium function
2. compute forces
1. initialise
7.
LBM+IBM(+GPU)
Collide(local)
Stream (non-local)
Lattice Boltzmann Method: Algorithm
LBM+GPU
realtime LBM
LBM+GPU
LBM+IBM(+GPU)
realtime LBM
double precision
Validation of 3D solver & boundary conditions Poiseuille Flow single precision
LBM
LBM
LBM+IBM(+GPU)
realtime LBM
Lid Driven Cavity 2D:
[Ghia et al., 1982, ],
3D:
[Jiang et al., 1994, ]
Validation of 3D solver & boundary conditions
LBM+GPU
LBM
LBM+IBM(+GPU)
Implementation on GPU
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
FK104 :K5000M 1344 (7 x 192) 63 4GB 24:1 1.6 TFLOPS 66 GB/s (measured)
Hardware Tested
Feature cores (SMX x cores/SMC) regs / thread DRAM SP/DP ratio Peak performance (single precision) DRAM Bandwidth
LBM+GPU
GK110: K20c 2496 (13 x 192) 255 4.7GB 3:1 3.5 TFLOPS 143 GB/s (measured)
realtime LBM
LBM
realtime LBM
3. compute f (eq)
7. compute macroscopic quantities
4. collide
2. compute forces
7. compute macroscopic quantities
6. impose bcs.
- 5. stream i.e. read values from host into new location
[Rinaldi et al., 2012, ])
1. initialise
PULL (see
6. impose bcs.
• i.e. requires synchronisation
4. collide (local) 5. stream (non-local)
3. compute f (eq)
- 2. compute forces
1. initialise
PUSH
LBM+IBM(+GPU)
Lattice Boltzmann Method: Algorithm
LBM+GPU
LBM+GPU
LBM+IBM(+GPU)
realtime LBM
i n t size=Nx ∗ Ny ; f o r ( dir = 0 ; dir < 9 ; ++dir ) { Xnew=X d_cx [ dir ] ; // Stream x �PULL ' Ynew=Y d_cy [ dir ] ; pop_local [ dir ] = pop_old [ dir ∗ size+Ynew ∗ Nx+Xnew ] ; }
GPU implementation: pull
f o r ( i n t dir = 0 ; dir < 9 ; ++dir ) { Xnew=X+cx [ dir ] ; // Stream x �PUSH ' Ynew=Y+cy [ dir ] ; // Stream y pop [ dir ] [ Xnew ] [ Ynew ] = pop_old [ dir ] [ X ] [ Y ] ∗ ( 1 omega ) + pop_eq [ dir ] ∗ omega + force_latt [ dir ] ; // C o l l i d e }
CPU implementation: push
Lattice Boltzmann Method: GPU implementation
LBM
LBM
LBM+IBM(+GPU)
realtime LBM
f [ dir ∗ Nx ∗ Ny+Y∗ Nx+X ]dimension to reduce additional dereferencing. arrays into a single
• within DRAM it is common practise to ‘flatten’ multiple dimension
access to DRAM once initially loaded
• All is stored in a struct within registers to minimise high latency
boundary type. Density ⇢ and velocity ui are only stored if output.
• Each thread stores values of f (x)i and an integer tag denoting
LBM algorithm at one spatial location f (x)i in the fluid domain.
• Code is parallelized such that one thread will perform the complete
Memory Arrangement
LBM+GPU
LBM
LBM+IBM(+GPU)
[Volkov, 2010, ]
realtime LBM
x-component of c =
1 in SMC
No x-component of c in SMC
depends on the x component of the discrete velocity direction (c) • Cache hit rate is low as we don’t have repeat accesses (< 7% in L2) • .. so instead maximise register use (which lowers occupancy)
• Operating on all indices of f in one thread helps to hide latency • Use structs of arrays access to f . Coalesced access therefore only
More ILP at expense of occupancy improves performance
Instruction Level Parallelism
LBM+GPU
LBM
LBM+IBM(+GPU)
• for high registers/thread, large block sizes are impractical. • e.g block size of 1024 means only a single block would run • [Obrecht et al., 2013, ] recommend maximum block size of 256
• in 3D we need 19 loads for f (x) and 19 stores • we also need 1 integer (boundaries) and 4 macroscopic quantities • so total of 43 registers needed per thread
registers threads blocks ⇥ ⇥ 65536 thread block SMX
• maximum of 65536 registers and can host up to 2048 threads
Maximising use of registers
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
Shared Mem
Shu✏e
Shared mem shu✏e operation on Keplers
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
• large block sizes are impractical for LBM
• ILP is more efficient than Shared memory or shu✏e
Shared mem shu✏e operation on Keplers
LBM+GPU
realtime LBM
LBM
Compared well to other implementations (albeit on other h/w) [Rinaldi et al., 2012, ] and [Astorino et al., 2011, ] use Shared Mem.
•
• vs theoretical max of 892MLUPS and 412MLUPS based on measured bandwidth
Peak 814MLUPS and 402MLUPS for K20c and K5000M respectively
•
•
LBM+IBM(+GPU)
Overall Performance
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
Immersed Boundary method
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
realtime LBM
• Not a replacement, but a valuable tool
• An alternative to body fitted mesh
• Define arbitrary mesh shape
[Peskin, 1977, ] • Preserve efficient high order (Cartesian) solver
• Original Motivation of Charles Peskin
Fluid equations solved on an Eulerian mesh. The geometrical complexity is embedded using an Lagrangian mesh via a system of singular forces.
Immersed Boundaries (1977-)
LBM+GPU
of the momentum equations in terms of body
actual boundary position are supplied to the rhs
• The elastic forces that restore the true and
system of inextensible spring & dampers
• Peskin idea: model the boundary as a
forces.
LBM+IBM(+GPU)
realtime LBM
⇧ pressure correction error
• Sharing a drawback:
⇧ Deformable boundary
⇧ Moving boundary
• Modified approach has some advantages:
introduced
introduced and applied (CIEMAT): no k sti↵ness
• [Pinelli et al., 2010, ] (no springs) has been
Original vs current approach
LBM+GPU
• Applied to a complete heart fluid system
• [Peskin, 1977, ]
LBM
LBM+IBM(+GPU)
algorithm:
@u = rhs + f , @t
f =
u (d) t
un
realtime LBM
u⇤
6. goto 1.
5. compute for pressure correction, project velocity field
4. repeat momentum advancement with f
3. compute f as a function of the di↵erence u (d)
2. advance momentum equation without boundary induced body forces (u ⇤ on S)
rhs 8~x 2 S
Immersed Boundary: basics (I)
LBM+GPU
1. given f update position of Lagrangian markers (boundary shape)
Along S u = u (d)
LBM
In discrete form in 2D
Xl )✏ l
Xl ) x 2
realtime LBM
• and it’s computation guarantees that interpolation(spread( F))=F
• it can be considered to represent the physical width of the surface.
• Epsilon is the key to the accuracy
l2⌦l
Spread (convolution) X f (xi,j ) = F (Xl ) h (xi,j
i,j
LBM+IBM(+GPU)
Immersed Boundary: basics (II)
LBM+GPU
Interpolation X U ⇤ (xl ) = u ⇤ (xi,j ) h (xi,j
LBM
LBM
• Bounceback does not o↵er high
• Inherent suitability of LB to IB • Lattice already uniform Cartesian • Poisson-free ! ! No pressure correction drawback
moving/flexible boundaries
• assumes ‘stair-step’ surface • and is problematic for
accuracy
LBM+IBM(+GPU)
Coupling IB with LB
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
(see
[Favier et al., 2013, ]
for full details)
11. compute macroscopic quantities
10. find velocity field with object (call LBM again)
9. (solve other equations: collision, tension, bending, inextensibility)
8. spread force onto lattice
7. compute required force for object
6. interpolate fluid velocity onto Lagrangian marker
5. find epsilon Costly, & also not required if not moving
4. compute support not required if not moving
3. find velocity field in absence of object (call LBM)
2. set forces to zero
1. initialise
Lattice Boltzmann Method: Algorithm
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
[Uhlmann, 2005, ]
settling velocity
x-position
Starting at rest, no slip walls, gravity along x
Falling particle under gravity ⇢p /⇢f = 8, Re = 165
Validation on finite-di↵erences DNS
Rigid Particles: validation 1
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
[Koumoutsakos and Shiels, 1996, ]
[Uhlmann, 2005, ]
Impulsively moved flat plate
Kissing/Tumbing particles
Rigid Particles: validation 2
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
• spreading operation is a problem (atomic add)
• exploit capability to overlap memory transfers with executions
simultaneously to hide latency
• Transfer from host to IBM kernel can be started
are random with much higher cache use.
• Transactions moving data between fluid and boundary
• each point only needs information about itself
• Transactions for each point of the object are coalesced
Immersed boundary + GPU
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
• e.g. for a sphere radius r = 20 we need nlag 4000
• For large objects (nlag > 1024), we need to launch a kernel for each object
• generally the case for 2D
• For small objects (nlag < 1024) we assign one block per object
• Objects are treated according to size; number of Lagrangian markers (nlag)
Small vs. large objects
LBM+GPU
realtime LBM
LBM
⇢
@2 @2X (K ) + ⇢g B @s 2 @s 2
Inextensibility condition
Incompressible condition
Solid momentum
following method of [Huang et al., 2007, ]
@X @X · =1 @s @s
r·u=0
@2X @ @X = (T ) @t 2 @s @s
Immersed boundary
Fh =
Interpol(f)
Kinematic condition
@X = Interpol(u) @t
Fh
Apply inextensibility condition to the filament
• f is the force required by the fluid to verify the b.c • Fh is the hydrodynamic force resulting from having applied the b.c.
> > > > > > > > > > > > > > > > :
8 > > > > > > > > > > > > > > > > <
LBM+IBM(+GPU)
Flexible beating filament I
LBM+GPU
realtime LBM
LBM
• •
• •
LBM+IBM(+GPU)
2Xn + Xn t2
1
Ds X
n+1
n+1
n+1
Ds X · Ds X
= [Ds (T
)] + (Fb ) + Fr = 1;
n+1
g g
F
n
realtime LBM
@ 2 T n+1 @s 2
(
=) AT
n+1
= rhs
@ 2 Xn @ 2 Xn n+1 · )T = 2 2 @s @s n
⇢
˙ n @X ˙n @X @Xn @3 @ 2 Xn · + · (K ) B @s @s @s @s 3 @s 2 where A is a tridiagonal matrix
@Xn dFnh · @s ds
For the initial guess, it’s possible to derive a very good estimate for the tension, by incorporating the inextensibility condition in the momentum equations:
Tension computed via iterative Newton-Raphson loop (by computing the exact Jacobian)
Xn+1
Staggered discretization of the Lagrangian space (X and T )
Application to Lattice-Boltzmann solver
Flexible beating filament II
LBM+GPU
LBM
LBM+IBM(+GPU)
(see [Favier et al., 2013, ] for full details)
With fluid (for di↵erent rigidities we observe correct ‘kick’ of free end)
Without fluid (falling under its own weight)
flexible filaments: validation 1
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
(see [Favier et al., 2013, ] for full details)
correct phase dependence on separation of filaments
Filament interactions: 2 filaments
flexible filaments: validation 2
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
[Revell et al., 2013, ]
(temp web address : http://195.83.116.187/pelskin_web/)
EU funded project on this topic just started: PELSKIN
[Favier et al., 2009, ]
Cylinder coated with filaments: drag reduction
flexible filaments: investigation
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
Realtime LBM
LBM+GPU
realtime LBM
LBM
Realtime output
LBM+IBM(+GPU)
realtime LBM
↵)Mi
1
(x + ui (x, ti ) t) + ↵Ni (x) • Dye injection uses a similar method • can be more intuitive
Mi (x) = (1
• velocities used to displace mesh vertices using forwards integration • noise texture does not a↵ect flow • mesh resolution may be coarser than lattice
with random noise texture N according to blending factor ↵
• previous frame M is textured onto a distorted mesh and blended
forwards using texture mapping
• instead of particle seeding which can be hit or miss • noise textures are used to represent dense set of particles, which are advected
several OpenGL visualization techniques are implemented • Contour colour map (for velocity magnitude) is stored on the device • Image Based Flow Visualization (IBFV) simulates advection of particles through an unsteady vector field (macroscopic u field)
LBM+GPU
LBM
LBM+IBM(+GPU)
• Attracting increased attention in some areas • training for medical surgical proceedure • for complex industrial applications where ‘intuition’ is missing
• Initially started out as a gimic, used for teaching and open days
uses for Realtime?
LBM+GPU
realtime LBM
LBM
Conclusions
LBM+IBM(+GPU)
• with various visualization options • exploring potential applications
• Realtime version of the solver available
• various flow physics investigations underway • context of EU project PELSKIN • BBSRC project on ‘protein manufacturability’
• IB-LB solver validated for rigid and flexible geometry
• 814 MLUPS peak on K20c
• LBM solver implemented on Kepler hardware with some optimizations
LBM+GPU
realtime LBM
LBM
LBM+IBM(+GPU)
References I
Huang, W., Shin, S., and Sung, H. (2007). Simulation of flexible filaments in a uniform flow by the immersed boundary method. Journal of Computational Physics, 226(2):2206 – 2228.
Hardy, J., Pomean, Y., and de Pazzis, O. (1973). Time evolution of a two-dimensional model system. Modelling Simul. Mater. Sci. Eng., 14:17461752.
Guo, Z., Zheng, C., and Shi, B. (2002). forcing term lbm. Physical review letters E, 65(4).
Grad, H. (1949). On the kinetic theory of rarefied gases. Communications on pure and applied mathematics.
Ghia, U., Ghia, K., and Shin, C. (1982). High-re solutions for incompressible flow using the navier-stokes equations and a multigrid method. Journal of computational physics.
Frisch, U., B Hasslacher, B., and Pomeau, Y. (1986). Lattice-gas automata for the navier-stokes equation. Physical review letters.
Favier, J., Revell, A., and Pinelli, A. (2013). A lattice boltzmann - immersed boundary method to simulate the fluid interaction with moving and slender flexible objects. HAL, hal(00822044).
Favier, J., Dauptain, A., Basso, D., and Bottaro, A. (2009). Passive separation control using a self-adaptive hairy coating. Journal of Fluid Mechanics, 627:451–483.
Bhatnagar, P., Gross, E., and Krook, M. (1954). A model for collision processes in gases. i: small amplitude processes in charged and neutral one-component system. Physical Review, 94:511–525.
Astorino, M., Sagredo, J. B., and Quarteroni, A. (2011). A modular lattice boltzmann solver for gpu computing processors.
LBM+GPU
realtime LBM
LBM
References II
LBM+IBM(+GPU)
Rinaldi, P., Dari, E., V´ enere, M., and Clausse, A. (2012). A lattice-boltzmann solver for 3d fluid simulation on gpu. Simulation Modelling Practice and Theory, 25:163–171.
Revell, A., Favier, J., and Pinelli, A. (2013). Drag reduction of flow around a cylinder with attached flexible filaments. In ERCOFTAC international symposium onUnsteady separation in fluid-structure interaction.
Raabe, D. (2004). Overview of the lattice boltzmann method for nano- and microscale fluid dynamics in materials science and engineering. Modelling Simul. Mater. Sci. Eng., 12.
Pinelli, A., Naqavi, I., Piomelli, U., and Favier, J. (2010). Immersed-boundary methods for general finite-di↵erence and finite-volume navier-stokes solvers. Journal of Computational Physics, 229(24):9073 – 9091.
Peskin, C. S. (1977). Numerical analysis of blood flow in the heart. Journal of Computational Physics, 25(3):220–252.
Obrecht, C., Kuznik, F., Tourancheau, B., and Roux, J.-J. (2013). Efficient gpu implementation of the linearly interpolated bounce-back boundary condition. Computers & Mathematics with Applications.
Nie, X.B., S. X. and Chen, H. (2009). Lattice-boltzmann/finite- di↵erence hybrid simulation of transonic. flow. AIAA-Paper 2009-139.
Latt, J. (2013). Introduction to lattice boltzmann method. http://www.youtube.com/watch?v=I82uCa7SHSQ.
Koumoutsakos, P. and Shiels, D. (1996). Simulations of the viscous flow normal to an impulsively started and uniformly accelerated flat plate. Journal of Fluid Mechanics, 328:177–227.
Jiang, B., Lin, T., and Povinelli, L. (1994). Large-scale computation of incompressible viscous flow by least-squares finite element method. Computer Methods in Applied Mechanics and Engineering, 114.
LBM+GPU
realtime LBM
LBM
References III
LBM+IBM(+GPU)
Zou, Q., H. X. (1997). On pressure and velocity boundary conditions for the lattice boltzmann bgk model. Phys. Fluids, 9(6).
Volkov, V. (2010). Better performance at lower occupancy. In Proceedings of the GPU Technology Conference, GTC, volume 10.
Uhlmann, M. (2005). An immersed boundary method with direct forcing for the simulation of particulate flows. Journal of Computational Physics, 209(2):448–476.
Shan, X., Yuan, X.-F., and Chen, H. (2006). Kinetic theory representation of hydrodynamics: a way beyond the navierstokes equation. Journal of Fluid Mechanics, 550(1):413–441.
LBM+GPU
realtime LBM
View more...
Comments