Challenges#in#GPU#por>ng:## the#Quantum#ESPRESSO#case

October 30, 2017 | Author: Anonymous | Category: N/A

Share Embed

Report this link

Short Description

effec>ve#charges#and#dielectric#tensors#. – .. What%is%their% scienfific% workflow? Typically Central Processing U&nb...

Description

Challenges#in#GPU#por>ng:## the#Quantum#ESPRESSO#case# Filippo#SPIGA#1,2## 1#High#Performance#Compu>ng#Service,#University#of#Cambridge# 2#Quantum#ESPRESSO#Founda>on#

«What%I%cannot%compute,%I%do%not%understand.»#(adapted#from#Richard#P.#Feynman)#

#

##

Outline# •  Overview# •  The#Quantum#ESPRESSO#project# •  PWSCF#GPU#por>ng# •  Challenges#in#GPU#por>ng# •  Performance#evalua>on## •  Final#considera>ons:#performance,#correctness#and#sustainability#

2#

QUANTUM'ESPRESSO'

What#is#QUANTUM#ESPRESSO?# • 

QUANTUM#ESPRESSO#is#an#integrated#suite#of#computer#codes#for#atomis>c# simula>ons#based#on#DFT,#pseudo\poten>als,#and#plane#waves#

• 

"ESPRESSO"#stands#for#opEn#Source#Package#for#Research#in#Electronic#Structure,# Simula>on,#and#Op>miza>on#

• 

QUANTUM#ESPRESSO#is#an#ini>a>ve#of#SISSA,#EPFL,#and#ICTP,#with#many#partners#in# Europe#and#worldwide#

• 

QUANTUM#ESPRESSO#is#free#so`ware#that#can#be#freely#downloaded.#Everybody#is# free#to#use#it#and#welcome#to#contribute#to#its#development#

4#

What#can#QUANTUM#ESPRESSO#do?# •  •  •  •  • 

• 

norm\conserving#as#well#as#ultra\so`#and#PAW#pseudo\poten>als# many#diﬀerent#energy#func>onals,#including#meta\GGA,#DFT+U,#and# hybrids#(van#der#Waals#soon#to#be#available)# scalar\rela>vis>c#as#well#as#fully#rela>vis>c#(spin\orbit)#calcula>ons# magne>c#systems,#including#non\collinear#magne>sm# ground3state'calcula:ons' –  Kohn\Sham#orbitals#and#energies,#total#energies#and#atomic# forces# –  ﬁnite#as#well#as#inﬁnite#system# –  any#crystal#structure#or#supercell# –  insulators#and#metals#(diﬀerent#schemes#of#BZ#integra>on)# –  structural#op>miza>on#(many#minimiza>on#schemes# available)# –  transi>on#states#and#minimum\energy#paths#(via#NEB#or# string#dynamics)#electronic#polariza>on#via#Berry’s#phase# –  ﬁnite#electric#ﬁelds#via#saw\tooth#poten>al#or#electric# enthalpy# Wannier'intepola:ons'

• 

• 

• 

ab3ini:o'molecular'dynamics' –  Car\Parrinello#(many#ensembles#and#ﬂavors)## –  Born\Oppenheimer#(many#ensembles#and#ﬂavors)## –  QM\MM#(interface#with#LAMMPS)# linear'response'and'vibra:onal'dynamics' –  phonon#dispersions,#real\space#interatomic#force#constants# –  electron\phonon#interac>ons#and#superconduc>vity# eﬀec>ve#charges#and#dielectric#tensors# –  third\order#an\harmonici>es#and#phonon#life>mes# –  infrared#and#(oﬀ\resonance)#Raman#cross#sec>ons# –  thermal#proper>es#via#the#quasi\harmonic#approxima>on# electronic'excited'states' –  TDDFT#for#very#large#systems#(both#real\>me#and#“turbo\ Lanczos”)# –  MBPT#for#very#large#systems#(GW,#BSE)#

# plus%several%post%processing%tools!%

5#

QUANTUM#ESPRESSO#in#numbers# • 

350,000+#lines#of#FORTRAN/C#code#

• 

46#registered#developers#

• 

1600+#registered#users#

• 

5700+#downloads#of#the#latest#5.x.x#version#

• 

2#web\sites#(QUANTUM\ESPRESSO.ORG#&#QE\FORGE.ORG)#

• 

1#user#mailing\list,#1#developer#mailing\list#

• 

24#interna>onal#schools#and#training#courses#(1000+#par>cipants)#

6#

The#development#model# • 

• 

• 

QUANTUM#ESPRESSO#is#not'a'monolithic'applica:on,#but#an#integrated#ecosystem# thriving#around#a#small#number#of#core#components#developed#and#maintained#by# a#small#number#of#developers# the#ecosystem#is#designed#so#as#to#be#alien\friendly:#a#number#of#third3party'QE3 compa:ble'applica:ons'and'add3ons,#o`en#designed#to#be#code\agnos>c,#are# distributed#with#QE#(notable#examples#include#wannier90,#Yambo,#EPW,#WanT,# XCrysDen,#...)# the#environment#that#allows#the#ecosystem#to#prosper#is#provided#by#the#QE\ FORGE.ORG#plauorm,#freely'available'to#researchers#and#developers#from#all#over# the#world# 7#

Quantum#ESPRESSO#package#poruolio# Wannier90#

WanT#

PLUMED#

SaX#

Yambo# TDDFPT#

Xspectra#

GIPAW#

GWL# NEB#

PHonon# PWscf#

Atomic#

PWCOND# CP#

CORE'' modules' 8#

J. Phys.: Condens. Matter 21 (2009) 395502

ns. Matter 21 (2009) 395502

P Gia

P Giannozzi et al

QE#Scalability# (2010)#

CP##

(Aβ\pep>de#in#water:#838#atoms#and#2311# Figure 3. Scalability for medium-size calculations (CP code). CPU#time (s) per time step (left panel) and speedup with re # electronic ##electrons,#gamma#point)##

32 processors (right panel) as a function of the number of processors and for different numbers n task of task groups, on an IBM Blue (BG/P) and on an SGI Altix. The system is a fragment of an Aβ -peptide in water containing 838 atoms and 2311 electrons in a ˚ 3 cell, ultrasoft pseudopotentials, " point, 25 and 250 Ry cutoff for the orbitals and the charge density respect 22.1 × 22.9 × 19.9 A

ability for medium-size calculations (CP code). CPU time (s) per electronic time step (left panel) and speedup with respect to (right panel) as a function of the number of processors and for different numbers n task of task groups, on an IBM BlueGene/P an SGI Altix. The system is a fragment of an Aβ -peptide in water containing 838 atoms and 2311 electrons in a ˚ 3 cell, ultrasoft pseudopotentials, " point, 25 and 250 Ry cutoff for the orbitals and the charge density respectively. 19.9 A

PWscf#

(PSIWAT:#587#atoms,#2552#electrons,#4#k\points)# (CNT:#1532#atoms,#5232#electrons,#gamma#point)#

9# of pro Figure 4. Scalability for large-scale calculations: wall time (left panel) and speedup (right panel) as a function of the number PSIWAT: PWscf code, n pool = 4, n task = 4, on a Cray XT 4. The system is a gold surface covered by thiols in interaction with water ˚ 3 cell, 587 atoms, 2552 electrons. CNT (1): PWscf code, n task = 4, on a Cray XT 4. The system k-points, 10.59 × 20.53 × 32.66 A porphyrin-functionalized nanotube, " point, 1532 atoms, 5232 electrons. CNT (2): CP code on a Cray XT3, same system as for CN Times for PSIWAT and CNT (1) are for 10 and 2 self-consistency iterations, respectively; times for CNT (2) are for 10 electronic ste Car–Parrinello step, divided by 2 so that they fall in the same range as for CNT (1).

PWSCF'GPU'PORTING'

Plane#Wave#Self\Consistency#Field#(PWSCF)# The#solu>on#of#the#Kohn\Sham#requires#the#diagonaliza>on#of#the#matrix#HKS# whose#matrix#elements#are# # 2 ~ hk + G|T|k + G0 i = (k + G0 )2 2m

hk + G|VH |k + G0 i = VH (G

G,G0

0 n(G G ) 0 2 G ) = 4⇡e |(G G0 )|2

hk + G|Vxc |k + G0 i = F T [Vxc (r)]

Kine>c#energy# Hartree#term# Exchange#correla>on#

11#

PWscf#paralleliza>on#hierarchy# Images#

• Only#for#Nudged#Elas>c#Band#(NEB)#calcula>ons#

K\points#

• Distribu>on#over#k\points#(if#more#than#one)# • Scalability#over#k\points#(if#more#than#one)# • No#memory#scaling#

BGRP_COMM' Plane\waves#

• Distribu>on#of#wave\func>on#coeﬃcients# • Distribu>on#of#real\grid#points# • Good#memory#scale,#good#overall#scalability,#load\balancing#

Linear#algebra#&#task# • Itera>ve#diagonaliza>on#(fully\parallel#or#serial)# groups# • Smart#grouping#of#3DFFTs#to#reduce#compulsory#MPI#communica>ons# Mul>\threaded# kernels#

• OpenMP#handled#explicitly#or#implicitly% • Extend#the#scaling#on#mul>\core#machines#with#“limited”#memory# 12#

Simpliﬁed#PWSCF#life\cycle# 3D\FFT#+#GEMM#+#LAPACK# CALCULATIO N OF NEW ATOMIC POSITIONS (AND LATTICE PARAMETERS)

STRUCTURE OPTIMIZATION

CALCULATION OF NEW FORCES (AND STRESSES) SELF CONSISTENCY ACHIEVED

SETUP OF THE INITIAL PHYSICAL QUANTITIES

SELF CONSISTENCY

ELSE ELSE

CALCULATION OF WAVEFUNCTIONS

DIAGONALIZATION OF THE HAMILTONIAN MATRIX (FOR EACH K POINT)

WAVEFUNCTION ORTHO NORMALIZATION FORCES AND STRESSES EQUAL TO ZERO

CALCULATION OF THE NEW NON LOCAL POTENTIALS

3D\FFT#

CALCULATION OF THE NEW POTENTIALS

CALCULATION OF THE CHARGE DENSITY

3D\FFT#+#GEMM# 13#

A#"lucky"#star>ng#point# AUSURF,#112#Au#atoms,#2#k\point#

1#SCF#itera>on#

1#full#SCF#cycle# 14#

GPU#developments# •  •  •  • 

•  • 

MPI\GPU#binding#&#GPU#memory#management## NEWD#!#CUDA#NEWD#(mul>ple#kernels#combined)# ADDUSDENS#!#CUDA#ADDUSDENS# VLOC_PSI#!#CUDA#VLOC_PSI#(CUDA#kernels#+#CUFFT)# BLAS3#*GEMM#!#PHIGEMM#library#(CUBLAS)# (serial)#LAPACK#!#MAGMA#library# VLOC_PSI'acts'over'distributed'data' NEWD/ADDUSDENS'act'over'local'data'

15#

CUDA#kernels# • 

ADDUSDENS# –  compute\bounded#kernel# –  Best#performance#measured:#20x*#(Realis>c?#9x~10x)#

• 

NEWD# –  compute\bounded#kernels# –  Best#performance#measured:#7.2x*#(Realis>c?#3x~4x)#

• 

VLOC_PSI# –  memory\bounded#kernels# –  Best#(serial)#performance#measured:#9x#(Realis>c?#...)##

• 

All#the#data#is#moved#to#GPU#memory#only#at#once#

• 

External#loops#over#atomic#species#are#kept#on#the#CPU#side# 16#

QE\GPU#

CPU\only#with#GEMM#proﬁle# capability#

GPU'conﬁgure' Y# CUDA?#

phiGEMM?#

proﬁling?#

Y#

GPU\CPU#with#serial# diagonaliza>on#(MAGMA)#

Y# MAGMA?#

fast#CUDA?# GPU\CPU#with#serial# diagonaliza>on#(MAGMA)#and# advanced#op>miza>ons*#

Y# Y#

Y#

ELPA?#

phiGEMM?#

GPU\CPU#with#parallel# diagonaliza>on#based#on# ScaLAPACK#(CPU#only)#

GPU\CPU#with#parallel# diagonaliza>on#based#on#ELPA# (CPU#only)#

GPU\CPU#with#parallel# diagonaliza>on#based#on#ELPA# (CPU#+#GPU*)#

17#

CHALLENGES'IN'GPU'PORTING'

The#"Condensed#Ma~er#Physics"#happy#family# CASTEP#and#QE#shares...# • 

similar#data#distribu>on#(by#plane#waves,#by#k\point,#by#bands,#...)#

• 

similar#MPI#communica>on#pa~erns#(MPI_alltoall,#MPI_reduce)#

• 

similar#constrains#in#scalability#

# GPU#matches#between#CASTEP#and#QE#...# • 

Block%Davidson%solver%with%density%mixing%!#eigen\solvers,#vloc_psi#

• 

Support%of%ultrasoA/normCconserving%pseudopotenDal%!#newd#&#addusdens#

• 

Geometry%opDmizaDon%!#stress#&#forces#calcula>ons#

• 

BLAS#3#opera>ons#!#CUBLAS,#phiGEMM# 19#

QE#ad\hoc#parallel#3D\FFT# There#are#two#"FFT#grid"#representa>on#in# Reciprocal#Space:#wave#func>ons#(Ecut)#and#charge# density#(4Ecut)## Transform along Z

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Ec

4Ec

y z

x ~ Nx Ny / 5 FFT along z

Parallel Transpose ~ Nx Ny Nz / (5 Np) data exchanged per PE

A#single#3D\FFT#is#divided#in#independent#1D\FFTs' Transform along X

Transform along Y 0

0

1

1

2

2

3

3

z

y

z x

y Nx Nz / 2 FFT along y 0 1

Data#are#not#con>guous#and#not#“trivially”#distributed#across#processors'

PE 0 PE 1

2 3

x Ny Nz FFT along x

PE 2 PE 3

Zeros#are#not#transformed.#Diﬀerent#cut\oﬀs#preserve#accuracy#

20#

H#*#psi## compute/update H * psi:! ! !compute kinetic and non-local term (in G space)! ! ! complexity : Ni × (N × Ng+ Ng × N × Np)! ! !Loop over (not converged) bands:! !! ! !FFT (psi) to R space! !! ! ! complexity : Ni × Nb × FFT(Nr)! compute V * psi! !! ! ! complexity : Ni × Nb × Nr! FFT (V * psi) back to G space! !! ! ! complexity : Ni × Nb × FFT(Nr)! compute Vexx! !! ! ! complexity : Ni × Nc × Nq × Nb × (5 × Nr + 2×FFT(Nr))! N#=#2×Nb#(where#Nb#=#number#of#valence#bands)# Ng#=#number#of#G#vectors# Ni#=#number#of#Davidson#itera>on#

Np#=#number#of#PP#projector# Nr#=#size#of#the#3D#FFT#grid# Nq#=#number#of#q\point#(may#be#diﬀerent#from#Nk)# 21#

Task\group#paralleliza>on# P0# P1# P2# P3#

do i = 1, n compute 3D FFT( psi(i) ) end do

G0

G1

G2

G3

do i = 1, n/groupsize merge( psi( i ), psi( i + 1 ) ) compute groupsize 3D FFT (at the same time) end do 22#

Parallel#VLOC_PSI# FFT G!R

CUFFT G!R PSI

PSIC

PSI

PSIC PSIC

Mul>ple#LOCAL#grid#to# compute#

“MPI_Allgatherv” DISTRIBUTED

products

products

“MPI_Allscatterv”

HPSI

HPSI

PSIC

FFT R!G

PSIC PSIC

CUFFT R!G Overlapping'is'possible!!'

23#

Parallel#VLOC_PSI#\\#limita>ons#

512#

256#

64# 24#

Dependencies,"workﬂows,"and"itera= What#scien>sts#do?#

Courtesy#of#Prof.#Nicola%Marzari%(THEOS,#EPFL)#

GPU#"#

NO#GPU### 25#

Next#developments/priori>es# TOP#PRIORITY:# • 

Speciﬁc#NVIDIA#Kepler#op>miza>ons#!#NVIDIA#Dev#Tech#"#

• 

non#collinear#&#spin\orbital#magne>za>on#

• 

PHonon#using#OpenACC#

#

"LESS"#PRIORITY:# • 

ﬁrst#GPU\accelerated#CP#(special#focus#on#DFORCE)#

• 

Alterna>ves#for#current#eigen\solvers#(ELPA+GPU?#Lancroz?)#

#

and...#DOCUMENTATION!#

26#

PERFORMANCE'EVALUATION'

AUSURF112,#serial#

7.81x#

5.54x# 3.49x# 3.54x#

Tests%run%early%2012% 28#

Performance#&#Power#consump>on#(serial)# AUSURF112, k-point (C2050)

0.06 0.04 0.02 1200

5

0.7

\58%#

0.5 0.3 0.1 9000

4

\54%#

0.12 0.08 0.04

2100

4

3.2x#

4

3

300

3

1400

3

3000

2

700

2

1

0

Time [s]

600

6000

Time [s]

Time [s]

0.16

3.1x#

3.67x# 900

Water-on-Calcite (C2050) Power [KW/h]

\57%#

0.08

Power [KW/h]

Power [KW/h]

Shilu-3 (C2050) 0.1

2

0

1

6 OMP

6 MPI

6 OMP 1 GPU

0

6 OMP

6 MPI

6 OMP 1 GPU

1

6 OMP

6 MPI

6 OMP 1 GPU

Tests%run%early%2012% 29#

MGST\hex#

216#atoms#of#{Ge,#Mn,#Te,#Sb},#gamma\only#(courtesy#of#Zhang%W.%–#RWTH/AACHEN)#

2.43x' 2.96x' 3.38x'

Tests%run%early%2012% 30#

CdSe159#

159#atoms#of#{Cd,#Se},#gamma\only#(courtesy#of#Calzolari%A.%–#CNR/NANO)#

2.52x' - Walltime of 1 full SCF CdSe-159 35000

2.40x' 2.17x'

30000

CdSe-159 - Speedup 7

CPU only CPU+GPU

6

2.06x'

20000

2.19x'

15000

speed-up

Time [s]

25000

1.89x'

10000

CPU only CPU+GPU Ideal

5 4 3 2

5000 0

1 2 (16)

4 (32)

6 (48)

8 (64)

10 (80)

12 (96)

Total number of Cores

14 (112)

2 (16)

4 (32)

6 (48)

8 (64)

10 (80)

12 (96)

14 (112)

Total number of CPU cores

Tests%run%early%2012% 31#

IRMOF\M11#

130#atoms#of#{O,#Zn,#C,#H},##1#K\point#(courtesy#of##Clima%S.%\#IMEC)#

IR-MOF (130) - Walltime of full SCF (x3) 7000 6000 5000 Time [s]

CPU only CPU+GPU

1.87x' 1.79x'

1.97x'

4000

1.37x'

3000 2000 1000 0 1 (8)

2 (16)

3 (24)

4 (32)

5 (40)

6 (48)

7 (56)

8 (64)

8

Total number of Cores

Tests%run%early%2012% 32#

HiÜng#the#limit#

130#atoms#of#{O,#Zn,#C,#H},##1#K\point#(courtesy#of##Clima%S.%\#IMEC)#

30.9%' IR-MOF 45.1%' (130) - 3 relax

1.49x'(1.87x)'

IR-MOF (130) - 3 relax

10000

only CPU+GPU

9000 8000

9000

40.9%'

Time [s]

7000

6000 5000 4000

init electron updatepot forces stress

30.6%'

8000

1.26x'(1.41x)'

7000 Time [s]

10000

1.46x'(1.79x)' CPU

26.3%'

6000 5000

40.1%'

4000

3000

3000

2000

2000

1000

1000 0

0 1 (8)

2 3 4 5 6 7 8 (16) (24) (32) (40) (48) (56) (64)

Total number of Cores

8

C G

1 (8)

C G

C G

C G

C G

C G

C G

C G

2 3 4 5 6 7 8 (16) (24) (32) (40) (48) (56) (64)

Tests%run%early%2012% 33#

HiÜng#the#limit#

489#atoms#of#{Cd,#Se},#3(+1/2)#SCF#steps#&#STRESS#&#FORCES,#24#MPI#x#6#OMP#(144#cores)#@#PLX#(CINECA)#

3.2x'

~#1.73x# Tests%run%early%2012% 34#

Chea>ng#using#parallelism:#USE_3D_FFT#

120#atoms#of#{Bi,#Fe,#O,#LS,#Mn},#reduced#to#8#k\point#(courtesy#of#Ferreira%R.#–#Rio#de#Janeiro#Federal#Univ.)#

###pool#=###MPI#processes## (#dimension#each#pool#=#1,#k\point#distributed#in#"round\robin")# Computer#Nodes#

Execu>on#Time#[s]## #10#self\consistency#cycles#

2#x#iDataPlex#DX360M3,#dual#Xeon#E5645#6\cores#2.40#GHz# (24#cores)#

52057.22'

2#x#iDataPlex#DX360M3,#dual#Xeon#E5645#6\cores#2.40#GHz# (24#cores)#+#4'NVIDIA'2070'(__USE_3D_FFT)'

10029.1'

Speed\up#

'5.2x''

Tests%run%early%2012% 35#

GPU\accelerated#PWscf,#when#and#why?# •  Physical#system#criteria# –  –  –  – 

"crazy"#amount#of#atoms#(#O(1000)#)#in#gamma\only#(or#1~2#k\point)# reasonable#amount#of#atoms#(O(100)#)#' small#amount#of#atoms#(O(10)#)#but#many#k\points# O(10)~O(100)#but#with#lot#of#atomic#species#

•  Hardware#criteria*# –  worksta>ons#equipped#with#commodity#GTX#and/or#high\end#TESLA#GPUs# –  reasonable#HPC#clusters#(O(100)#nodes,#1:1#ra>on#GPU#versus#CPU#socket#)#

•  Work\ﬂow#criteria# –  long#SCF#energy#minimiza>on# –  high#throughput#

FINAL'CONSIDERATIONS'

+

PERFORMANCE,+CORRECTNESS+AND+SUSTAINABILITY

37#

What#drive#(my)#agenda?#

Sustainability'

Performance'

Correctness'

38#

Sustainability#vs%Performance#vs#Correctness# • 

CUDA#or#not\CUDA?#(Shakespearean%dilemma)#

• 

Third\part#libraries#hide#complexity#but#also#introduce#"noise"#

• 

Reduc>on#opera>ons#(explicit#managed#or#implicitly#inherit)#are#the#nightmare#

• 

Bugs#can#be#very#tricky#and#not#easy#to#detect#

• 

An#embedded#Unit#Test#suite#is#needed#but#also#an#Acceptance%TesDng%Procedure#

• 

The#need#of#cer>ﬁed#QE\GPU#benchmarks*#

• 

Guarantee#performance#portability#is#very#diﬃcult#

39#

OpenACC:#yes,#no,#maybe?# PROs:# •  easy#learning#curve# •  fast#ﬁrst#deployment# •  acceptable#performance#compromises# # #CONs:# •  Code#has#to#be#rewri~en#to#suite#the#accelerator# •  Lack#of#direct#control#of#the#generated#GPU#code# •  memory#sharing#and#compe>>on# # and#now#...#OpenMP#4.0#RC2#with#DirecDves%for%aRached%accelerators% 40#

(pseudo\random)#Personal#thoughts# • 

Training#users#is#challenging#(we#kept#everything#simple#but...)#

• 

User#support#is#>me#consuming#

• 

CUDA#is#easier#than#most#people#think#

• 

Focusing#on#ﬁne\grain#performance#ONLY#at#the#ﬁnal#stage#

• 

There#is#not#only#the#parallel#world#

• 

No#more#old\fashion#por>ng#

The#OpDmizaDon%Dilemma%

?

Is#it#worth#to#spend#>me/eﬀort#op>mizing#the#code#for#the#latest# available#architecture#by#using#the#latest#cool#set#of#features#provided# by#the#most#updated#version#of#CUDA#toolkit/driver?# Who%are%the%users?%Are%they%users%or%developer?%What%hardware%they%have?%What%is%their% scienDﬁc%workﬂow?%What%are%the%scienDﬁc%challenges%they%have?%%What%is%their%budget?% What%ROI%they%expect?%%

42#

THANK'YOU'FOR'YOUR'ATTENTION!' Links:# •  h~p://www.quantum\espresso.org/# •  h~p://founda>on.quantum\espresso.org/# •  h~p://qe\forge.org/gf/project/q\e/# •  h~p://qe\forge.org/gf/project/q\e\gpu/#

43#

Porting the ONETEP Linear-Scaling Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures

Karl Wilkinson Skylaris Group, University of Southampton

2

Outline 1.  Computational Chemistry 2.  Heterogeneous Architectures and Programming models 3.  Order-N Electronic Total Energy Package (ONETEP) 4.  GPU Code Developments and Performance 5.  Hybrid MPI-OpenMP Parallelism 6.  Future Work 7.  Acknowledgements

3

COMPUTATIONAL CHEMISTRY

4

Computational Chemistry !  Trade offs: –  –  – 

!  Drug design

Accuracy System size Level of sampling

Low Molecular mechanics

–  – 

!  Materials High

Accuracy

Lead structure design Protein-lead structure interactions

Ab initio

– 

Epoxidised linseed oil – ships hulls

!  Catalysis !  Energy research

Low 10’s of atoms

High System size

Low Single point energy

Millions of atoms

High Sampling

Free energy surface

– 

Enzyme-biomass interactions

5

HETEROGENEOUS ARCHITECTURES AND PROGRAMMING MODELS

Heterogeneous Compute Architectures !  Heterogeneous: Multiple distinct types of processors –  –  –  – 

More than 10% of top 500 supercomputer use some form of accelerator Typically Central Processing Units (CPUs) coupled with Graphical Processing Units (GPUs) Common in high powered workstations Intel Xeon Phis emerging

!  Development considerations –  –  – 

Generally need to adapt code to take advantage of the architecture Single code will contain multiple algorithms - execute on most suitable processing unit May result in new bottlenecks (PCIe data transfer)

6

Parallelism in Heterogeneous Machines

7

Multicore CPUs

Node

CPU Socket

O(100-10,000) Nodes

CPU Socket

GPU based accelerator boards O(10) Streaming Multiprocessors

Titan 18,688 Nodes

8-16 CPU cores

PCIe Socket

18,688 CPUs (hexadecacore) 299,008 CPU Cores 18,688 K20X GPUs (14MPs) 261,632 Multiprocessors 560,640 Total cores

PCIe Socket

32-192 CUDA cores (SP/DP)

Memory in Heterogeneous Machines CPU !  Complex memory hierarchy !  GPUs designed to hide the movement of data on the GPU itself

!  Major issues: –  –  – 

–  – 

QPI

Socket

CPU Socket

System RAM

System RAM

GPU

GPU

Global GPU Memory

Global GPU Memory

Multiprocessor Memory

Multiprocessor Memory

Core caches

Core caches

Host-Device - Typical data transfer bottleneck

Host – device data transfer CPU speed for data parallel execution GPU speed for serial execution

!  Hardware manufacturers addressing these issues: Uncertainty over future hardware – 

8

CPUs becoming more like GPUs (increasing core count) GPUs gaining CPU-like functionality (Project Denver) Hybrid “System-on-chip” models

9

Motivation for choosing OpenACC !  Hardware Considerations –  – 

Multiple hardware models already available Uncertainty over future hardware models

!  Solutions – 

Ensure portability and future proof

– 

Abstractions help to protect the developer

– 

Pragmas and directives

!  Development Considerations –  –  –  –  –  – 

Complexity of hardware - average developer is a domain specialist Porting existing rather than writing from scratch Avoid changing original version of the code Avoid mixing languages Avoid reinventing the wheel Ensure maintainability of code

PGI Accelerator OpenACC

10

ORDER-N ELECTRONIC TOTAL ENERGY PACKAGE (ONETEP)

11

ONETEP !  Commercial code, distributed by Accelrys as part of Materials Studio !  Actively developed by a number of research groups – 

Central SVN repository

!  Standard Fortran 95 !  233,032 lines of code !  Parallel design from onset – Message Passing Interface – 

Aiming to extend beyond MPI parallelism

12

ONETEP !  Highly accurate linear-scaling quantum chemistry package –  –  – 

Number of atoms Number of cores Maintain number of iterations

!  Wide range of functionality –  Applicable to wide range of problems –  Used for materials science and biological systems

!  Scales to tens of thousands of atoms !  Runs on hundreds of cores using MPI

13

ONETEP – Brief Theory !  Utilizes a reformulation of Density Functional Theory in terms of the one-particle density matrix –  – 

Broad strokes: The key parameter is the electron density Electrons exist within atomic orbitals, orbitals represented by mathematical functions

!  Scaling with number of atoms achieved through exploitation of the “nearsightedness” of electrons in many atom systems –  – 

Localized density kernel, K, and functions Not sums over all possible pairs

14

ONETEP – Methodology

!  Iterative process –  – 

Optimize the NGWFs ( (r)) Optimize the density kernel (K)

!  Strictly localized non-orthogonal generalized Wannier functions (NGWFs) ( (r)) used to represent atomic orbitals

15

ONETEP – Methodology !  NGWFs are expressed on a real space grid Technically: psinc basis set, equivalent to a plane wave basis set

Simula(on+Cell+

N1+direc(on+

!  Parallel scaling achieved through decomposition of data across processors !  Three types of data !  !  ! 

Atomic Simulation cell Sparse matrix

A

FFT+Box++

Computa(onally+eﬃcient+

!  Localized operations performed within FFT boxes. !  FFT box – 3x diameter of largest localized function

N2+direc(on+

– 

ONETEP – Data distribution and movement !  Atomic (FFT box) operations produce data for simulation cell operations and vice versa

!  Memory efficient data abstraction required

16

ONETEP – Data distribution and movement !  Parallelpiped (PPD) is the basic unit of data used in this process

!  Data contiguity within PPDs is inconsistent relative to other representations

!  Natural barrier to initial areas of accelerated code

17

18

ONETEP – Profiling !  Small chemical system by ONETEP standards (Tennis ball dimer: 181 atoms)

!  Basic calculation

!  Representative level of accuracy !  Most common density kernel optimization scheme !  Single iteration of NGWF optimizations (over represents NGWF gradient step)

19

CODE DEVELOPMENTS

Scope of the Developments

20

!  Focus on atom localized operations: FFT boxes !  Focus on core algorithms that will benefit full range of functionality !  Maintain data structures and current communications schemes

!  Minimize disruption to rest of code

Submitted to Journal of Computational Chemistry

21

Naive Implementation !  Simple switch from CPU FFT library (MKL/FFTW/ACML) to a GPU library (CUFFT) !  No consideration given to data transfer optimization !  Little/no benefit

!  Extend to other areas of code…

22

Implementation Notes !  Hybrid PGI Accelerator/CUDA Fortran –  – 

PGI for Kernels CUDA Fortran for data transfers

!  CUFFT library

!  PGI Accelerator

!  Asynchronous data transfers by default In “pure” PGI Accelerator – 

!  Very straightforward code, no complex use of pragmas.

OpenACC

Hybrid model breaks this

23

Data Transfer Implementation

24

Charge Density (36% of runtime) Diagonal elements of the oneparticle density matrix:

using n(r) =

(r; r).

1.  Extract NGWFs 2.  Populate FFT boxes 3.  Transform to reciprocal space 4.  Up-sample to fine grid (avoid aliasing errors) 5.  Transform to real space 6.  Calculate product and sum 7.  Deposit charge density data in simulation cell

25

Charge Density (36% of runtime) !Data transfer to GPU not shown

1.  Extract NGWFs 2.  Populate FFT boxes 3.  Transform to reciprocal space

!$acc data region !$acc region coarse_work(1:n1,1:n2,1:n3) = scalefac * cmplx( & fftbox_batch(:,:,:,is,3,batch_count), & fftbox_batch(:,:,:,is,4,batch_count),kind=DP) !$acc end region ! Fourier transform to reciprocal space on coarse grid call cufftExec(cufftplan_coarse,planType,coarse_work, & coarse_work,CUFFT_FORWARD

26

Charge Density (36% of runtime) 4.  Up-sample to fine grid (avoid aliasing errors)

•  Populate corners of a cube

Coarse array

Fine array

•  Typical dimensions of course grid: 75-125 (odd numbers), double this for the fine grid

27

5.  Transform to real space 6.  Calculate product and sum (if common atom) 7.  Deposit charge density data in simulation cell

28

Charge Density Performance !  Tested on Iridis3, Emerald and Nvidia PSG cluster !  Detailed timings are blocking – 

No asynchronous kernel execution and data transfer

!  Data transfer stages (2 and 7) problematic, larger quantity of data in step 7

29

Charge Density Performance !  Clear shift in bottlenecks to data transfer stages

Local Potential Integrals 1.  2.  3.  4. 

Extract NGWF Populate FFT box Transform to reciprocal space Up-sample to fine grid (avoid aliasing errors) 5.  Transform to real space 6.  Extract local potential FFT box from simulation cell 7.  Apply local potential

Host+

12. Extract PPDs that overlap with 13. Calculate dot product of and PPDs.

CPU+to+ GPU+ DATA+

"↓$ (&)!

FFT+

"↓$  #(+)!

"↓$ (&)!

UpAsample+ to+ﬁne++ grid+

Extract++ PPDs+

Extract+FFT+box+

'↓()* (&)!

FFT+

CPU+to+ GPU+ DATA+

"↓$ (&)!

"↓$  #(+)!

Apply++ poten(al++

8.  Transform to reciprocal space 9.  Down-sample to fine grid 10. Transform to real space 11. Extract PPDs that overlap with

30

FFT+

'↓()* "↓$ (&)↑ !

'↓()* "↓$  #(+)↑ !

Host+

12+ Extract++ 13+

11+

GPU+to+ CPU+ DATA+

∫↑▒"↓/ ↑∗ (&)' ↓()* (&)"↓$ (&)↑ 0 ! '↓()* "↓$ (&)↑ !

DownAsample++ to+coarse++ grid+

10+ FFT+

'↓()* "↓$  #(+)↑ !

Performance - Bottlenecks Relative cost of stages

Local Potential Integrals • 

More data transfers

Kinetic integrals • 

Only coarse grid

NGWF gradient •  •  • 

Partly accelerated Significantly more complex Data transfer at stage 4.

31

32

Performance Summary !  Performance of 1 GPU vs 1 CPU core (unfair): Xeon E5620 Charge Density 1221.1

Tesla M2050

Tesla M2090

Kepler K20

340.5 (3.6)

271.1 (4.5) 225.4 (5.4)

Local Potential Integrals

861.3

301.2 (2.9)

242.2 (3.6) 219.1 (3.9)

Kinetic integrals

268.9

127.0 (2.1)

112.4 (2.4) 131.0 (2.1)

636.1 (1.8)

521.5 (2.1) 509.1 (2.2)

NGWF Gradient 1120.8

!  Total calculation time: – 

1 GPU 2-3.5x faster than 1 CPU core, equivalent to 4 CPU cores

33

Scaling Performance !  Good scaling up to 4 GPUs –  Seen for all areas of accelerated code Charge+density+

Local+poten(al+integrals+

787&

4x+CPU+core+

843&

Hardware+details+

4x+K20+ 164& 242& 151& 353& 2x+K20+

396&

1x+K20+

4x+M2050+

446&

607&

2x+M2050+

532&

500&

249&

642&

456&

2x+M2090+ 1x+M2090+

689& 0+

457&

1,198&

Other+

996&

467&

441&

539&

1,240&

800&

406&

232&

830&

795&

4x+M2090+ 192& 236& 142& 417&

NGWF+gradient+

305&

889&

1x+M2050+

372&

739&

249& 307& 175&

Kine(c+integrals+

615& 381&

1,585&

1,092&

335& 246& 638& 1000+

789&

625&

320&

1,273& 2000+

Time+(s)+

907& 3000+

4000+

5000+

34

Understanding Poor Scaling !  Neglected thread affinity – 

Quick path interconnect

!  Not the only problem – 

Sheer size of data transfers

GPUs Number GPUs on GPUs on Scaling Total Total per of bus 0 bus 1 Calculation relative to 1 GPUs node nodes (max. 3) (max. 5) GPU Time Total

% of ideal

1

1

1

1

0

5655.9

1.0 100

8

8

1

3

5

1737.5

3.3

41

24

8

3

3

5

637.9

8.9

37

24

6

4

3

3

620.6

9.1

38

24

4

6

2

2

471.9

12.0

50

24

2

12

1

1

362.9

15.6

65

35

Strong Scaling !  Detailed scaling tests on Emerald: 8 GPUs 2 CPUs !  Comparing against a single node – Impressive results !  Unfortunately not the full story… – 

Data transfer on a node saturating available PCIe bandwidth?

14.0&

8.0&

Ideal& 495&atom&cellulose&nanoﬁbril& 1717&atom&beta&amyloid&ﬁbril&

12.0&

6.0&

Speedup+

10.0& 8.0&

4.0&

6.0& 4.0&

2.0&

2.0& 0.0& 0&

2&

4& 6& 8& 10& Number+of+nodes+

12&

14&

0.0& 0&

2&

4& 6& Number+of+GPUs+

8&

36

Conclusions !  Overall acceleration of a factor of 2 to 3x relative to single CPU core !  Equivalent performance to 4 CPU cores !  Data transfer serious bottleneck –  – 

No asynchronous data transfer Running in lock-step due to blocking MPI communications

!  Strong scaling is hit hard by limited PCIe bandwidth

!  Promising results, much more to do with obvious first target

37

HYBRID MPI-OPENMP PARALLELISM

38

Motivation !  Hard limit of 1 atom per core !  10 atoms per core optimal – 

Communications-workload balance

!  Memory demands mean only 50% of cores used on some machines (HECTOR) !  Scaling limited to hundreds of cores

!  Myself: –  –  –  – 

Charge density Local potential integrals Kinetic Integrals NGWF gradient

!  Dr. Nick Hine: – 

Sparse algebra

–  – 

Projectors Testing and modification for IBM BlueGene/Q at Hartree centre

39

Method !  Level above that parallelized by GPU developments

40

Initial Results !  Charge density only !  Tested on 16 core workstation !  181 atom tennis ball dimer

Total Time

OpenMP threads

% of optimal 1.0 2.0 4.0 8.0 16.0 1.0 100.0 77.2 51.8 31.2 15.4 2.0 96.4 71.1 47.1 24.4 MPI 4.0 90.7 64.2 37.5 Processes 8.0 78.5 50.5 16.0 59.6 Charge Density

OpenMP threads

% of optimal 1.0 2.0 4.0 8.0 16.0 1.0 100.0 93.9 78.9 56.5 29.4 2.0 96.6 85.4 69.8 41.7 MPI 4.0 96.7 83.6 55.4 Processes 8.0 86.3 65.0 16.0 67.0

41

Latest Results 1728+atom+Silicon+Oxide+ MPI Processes

OpenMP threads Fixed core count

16

1

Out of memory

8

2

Out of time (>3600s)

4

4

1045s

2

8

617s

16

673s

1 4 4 4

& & & & &

OpenMP scaling 1

2711s

4

1046s

8

784s

& Results&courtesy&of&Dr.&Nick&Hine& CalculaHons&performed&on&Blue&Joule&at&the&Hartree¢re.

42

Latest Results 12,000+atom+Beta+Amyloid+Fibril+ (Each+MPI+process)+uses+16+OpenMP+threads+

Speedup+

4.0+

Ideal+ sparse+products+ row+sums+ \box+opera(ons+

3.0+

& 2.0+ & & &

1.0+ &4000+

&

8000+

12000+

16000+

Total+number+of+cores+

Results&courtesy&of&Dr.&Nick&Hine& CalculaHons&performed&on&Blue&Joule&at&the&Hartree¢re.

43

Summary !  Significant performance increase –  –  – 

Scaling Memory usage cores > atoms now possible

!  Excellent scaling up to 4 OpenMP threads, memory bandwidth limited on Blue Joule above 8 cores !  HECTOR performance promising

44

FUTURE WORK

45

Future Work Improved GPU code

Hybrid MPI-OpenMP code

•  OpenACC •  Asynchronous data transfer •  Loop unrolling •  Extend to other areas •  GPU overloading •  Batched 1D FFTs

•  Extend to “Data parallelism” •  Extend to other areas

Heterogeneous execution

Algorithmic Improvements •  Dynamic/Mixed Precision •  Increased truncation •  Hybrid FFT/interpolation

Large scale ab initio Molecular Dynamics on machines such as: TITAN Stampede Mira

(AMD CPU + NVIDIA K20X) (Intel CPU + Intel Xeon Phi) (BlueGene/Q)

46

Acknowledgements !  Chris-Kriton Skylaris – 

University of Southampton

!  Nick Hine – 

University of Cambridge

!  E-Infrastructure South consortium –  – 

Istvan Reguly Emerald

!  iSolutions – 

Iridis3

!  NVIDIA –  – 

Mark Berger Timothy Lanfear

!  PGI –  – 

Mathew Colgrove Pat Brooks

!  EPSRC

47

Precision •  Optimized energies exceed the level of precision required for chemical accuracy (milliHartree) •  Difference in calculated energies relative to the CPU code comparable to differences resulting from the number of MPI threads in the CPU code

48

Dynamic Precision •  GPUs have a larger number of single precision (SP) than double precision (DP) cores. •  Other codes (GAMESS-UK and Terachem) utilize dynamic precision. •  Switch between SP and DP determined by a convergence threshold.

49

Dynamic Precision •  GPUs have a larger number of single precision (SP) than double precision (DP) cores. •  Other codes (GAMESS-UK and Terachem) utilize dynamic precision. •  Switch between SP and DP determined by a convergence threshold. 1E+02

Energy Difference (Eh)

1E+00 1E-02 1E-04 1E-06 1E-08 1E-10

CPU DP 1x10E-5

1E-12 1

2

3

4

GPU DP 5x10E-6 5

6

7

8

1x10E-7 1x10E-6 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Iteration number

Nike Dattani Two GPU implementations in MATLAB: The Chaos Game Representation for genetic sequences, and Feynman integrals for open quantum systems dynamics

8 May2012

Quantum Computers

Photosynthesis: More efficient energy sources

Quantum High temperature Quantum Communication Devices superconductors measurement devices

Bird navigation: More accurate navigation instruments

Cancer prevention

Smelling: Design of better perfumes

http://blog.accelereyes.com/blog/2010/04/05/converting-matlab-loops-to-gpu-code/

Time mesh (indexed by k): (tk=0 , tk=1 , tk=2 , tk=3 , … , tk=N) At each time the system is in a state sk: (sk=0 , sk=1, sk=2 , sk=3 , … , sk=N) Each set {sk} is called a “Feynman path” Each Feynman path has an amplitude:

s0 s1 s2 s3 s4 = 11111 0.493 11112 0.917 ………………..... 21247 0.476 ………………..... MMMMM 0.851 Finally: Add the amplitudes for which the final state is the same, and calculate the amplitudes for the next time step (and repeat) If ‘memory’ is 5 steps, paths like 222222 and 213927 should be combined (ie. the Feynman amplitudes are 5th – order markovian, in this case)

For a 2-state system, 1st order markovian We have an amplitude array: And a propagator matrix: We want:

Which, by definition is:

For a 2-state system, 1st order markovian We have: ,

For a 2-state system, 2nd order markovian We have an amplitude array: And a propagator matrix: We want:

Which, by definition is:

For a 2-state system, 2nd order markovian We have: ,

For a 3-state system, 1st order markovian We have an amplitude array: And a propagator matrix:

We want:

Which, by definition is:

For a 3-state system, 1st order markovian We have: ,

For an M-state system, Δkmax order markovian We have an amplitude array:

And a propagator matrix:

We want:

Which, by definition is:

For an M-state system, Δkmax order markovian We have:

,

Most obvious MATLAB algorithm uses the function KRON Instead of KRON , restructure to use EXPAND: 2x speed up Instead of EXPAND, restructure to use REPMAT: 3x speed up (6x total) Instead of REPMAT, restructure to use BSXFUN: 2x speed up (12x total)

Most obvious MATLAB algorithm uses the function KRON Instead of KRON , restructure to use EXPAND: 2x speed up Instead of EXPAND, restructure to use REPMAT: 3x speed up (6x total) Instead of REPMAT, restructure to use BSXFUN: 2x speed up (12x total)

268 MB of data: 4x speed up (48x speed up in total) 1.07 GB of data: 6x speed up (72x speed up in total) 4.29 GB of data: doesn’t fit on the GPU

Skynet: Δkmax

2 3 4 5 6 7 8 9 10 11 12

CPU GPU CPU/GPU 0.057081 1.5377 0.0371 0.065062 3.2356 0.0201 0.091603 2.1099 0.043 0.19423 1.2279 0.1581 0.61296 1.2023 0.5098 2.416 1.335 1.8097 16.6221 1.6903 9.8338 59.2344 4.6461 12.7492 282.8173 15.6012 18.1279 1166.5199 67.1675 17.3673 4695.0871

24 December 2012, on Zen C2070 with 6 GB of VRAM Δkmax

CPU time (s)

GPU time (s)

2

0.0550

1.1172

3

0.0624

1.1196

4

0.0890

1.1118

5

0.1961

1.0741

6

0.6004

1.0791

7

2.2036

1.1396

x2

8

9.3467

1.5705

x6

9

39.542

4.3065

x9

10

184.822

14.5570

x 12.7

11

738.747

57.7692

x 12.8

12

2902.89

Speed -up

Data doesn’t fit on GPU

Jade: CPU

GPU CPU/GPU

0.052856 0.85784 0.051091 0.85151 0.070343 0.84949 0.13912 0.85151 0.38894 0.85712 1.3665 0.92836 5.8251 1.5124 26.099 5.03 94.6453 18.3151 385.6105 77.1665 1598.0806

0.0616 0.0600 0.0828 0.1634 0.4538 1.4720 3.8516 5.1887 5.1676 4.9971

On CPU – took 1238 seconds:

On GPU – took 107 seconds:

Mike Giles, Wes Armour & Andrew Richards Oxford Supercomputing Centre

OeRC Mihai Duta , Albert Solernou , Steven Young, Mark Hayton, Istvan Reguly For being some of the best computing staff I’ve ever worked with !

LBM

LBM+IBM(+GPU)

realtime LBM

The Oxford e-Research Centre Many-Core Seminar Series 5th June 2013

University of Manchester Modelling and Simulation Centre Twitter: @Dr Revell

Alistair Revell

GPU Implementation of Lattice Boltzmann Method with Immersed Boundary: observations and results

LBM+GPU

LBM

realtime LBM

Sciences), which led to this opportunity.

• Thanks to the UKCOMES community (UK Consortium Mesoscale Engineering

Mr. Mark Mawson (PhD student @ Univ. Manchester) Dr. Julien Favier (Univ. Marseilles) Pr. Alfredo Pinelli (City Univ. London; prev. CIEMAT, Madrid) Mr. Pedro Valero Lara PhD student @ CIEMAT) Mr. George Leaver (Visualization, Univ. Manchester)

• Colleagues contributing to this work • • • • •

LBM+IBM(+GPU)

Acknowledgements

LBM+GPU

LBM

LBM+IBM(+GPU)

• Realtime simulation • OpenGL tweaks • Live demos

• Immersed Boundary method • Overview of the method • Implementation of IB into LB • Some GPU optimisations • Results

• Lattice-Boltzmann method • An overview of the method • Some GPU-related optimizations • Validation / Results

Overview of Seminar

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

realtime LBM

• Broadly, one can consider that Lattice Boltzmann Method operates between these constraints. • on one side it can be extended to macroscale problems, whilst retaining a strong underlying element of particle behaviour.

• Mesoscale:

• In the limit of high Knudsen, one might resort to Molecular Dynamics • While this approach is impractical for macroscale applications

• Microscale approaches:

• In the limit of low Knudsen number one can assume a ‘Continuum’. • The Navier Stokes equation can be applied to infinitesimal elements

• Macroscale approaches:

The Lattice Boltzmann Method might be considered to be a ‘Mesoscale’ approach

Lattice Boltzmann Method: Introduction

LBM+GPU

LBM

LBM+IBM(+GPU)

realtime LBM

[Raabe, 2004, ]

Lattice Boltzmann Method: Introduction

LBM+GPU

LBM

realtime LBM

(eq)

⇣ m ⌘ 32 = 4⇡ c 2e 2⇡kT

distribution function :

f (r + cdt, c + Fdt, t + dt)drdc

f (r, c, t)drdc = ⌦(f )dcdt

• the cornerstone of the Lattice Boltzmann Method

conditions f (eq) (r, c, t)

• Boltzmann equation describes the return of f (r, c, t) to equilibrium

recovered by integration

mc 2 2kT

• For given equilbrium gas, one can obtain the Maxwell-Boltzmann

• Focus on a distribution of particles f (r , c, t) • Characterises the particles without realising their individual dynamics • Instead considers the distribution of particle velocities

• Macroscopic quantities

f

LBM+IBM(+GPU)

Lattice Boltzmann Method: Introduction

LBM+GPU

LBM

LBM+IBM(+GPU)

• basic conservation laws applied • Particles can move only along one of the directions • Particles move only to next node in one timestep. • No particles at same site can move in same direction

• Frisch Hasslacher, Pomeau [Frisch et al., 1986, ] • proposed a hexagonal lattice; ensured realisitic fluid dynamics • included a ‘random element’ and used look up tables.

• proposed a square lattice arrangement • though not rotationally invariant and produced ‘square vorticies’ !

• LBM has origins in Lattice Gas Cellular Automata • Hardy, Pomeau, de Pazzis [Hardy et al., 1973, ]

Lattice Boltzmann Method: Introduction

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

f (eq) =

✓

1 2⇡cs2

◆

e

c2 2cs2

"

1+

c · u (c · u) + cs2 2cs4

2

• And use expansion of f (eq) to be 2nd order accurate.

⌘ 1 ⇣ (eq) ⌦= f f ⌧ • represents relaxation back to equilibrium distribution in timescale ⌧ .

from [Bhatnagar et al., 1954, ] to approximate ⌦

• We use the BGK collision operator

@ t f + c · rr f + F · rc f = ⌦

• Boltzmann equation can be written as:

u cs2

2

#

Lattice Boltzmann Method: Introduction

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

realtime LBM

[Guo et al., 2002, ]:

• fib eventually used for immersed boundary

✓ 1 (0) fi (x+ei , t+1) = fi (x, t) (fi (x, t) fi (x, t))+ 1 ⌧

• Force term implemented following

• Here we use D2Q9 and D3Q19

◆ 1 ei u ci · u !i ( 2 + 4 ci ·fib 2⌧ cs cs

• DnQm models used for n spatial dimensions and m discrete velocities

• Spatial discretization on lattice is provided by Gaussian Quadrature

Lattice Boltzmann Method: Introduction

LBM+GPU

LBM

LBM+IBM(+GPU)

[Grad, 1949, ]

realtime LBM

n=0

n

a(n) ,

a(n) =

Z

f H(n) (c)dc

[Latt, 2013, ]

macroscopic variables

• Coefficients of the Hermite polynomial match the moments of the

f N (c, c, t) = !(c) =

N X 1

the distribution function may be expanded over velocity and space using orthonormal Hermite polynomials [Shan et al., 2006, ].

• Based on 13-moment theory of

Validity of LBM for di↵erent flows

LBM+GPU

LBM

D2Q17

D2Q9

LBM+IBM(+GPU)

D3Q39[Nie and Chen, 2009, ]

D3Q19

Di↵erence Lattice models

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

[Zou, 1997, ]

‘underdefined’

• some ‘issues’ at corners and along edges where problem is

• 2nd order Zou-He conditions have been implemented

• 1st order bounce-back conditions are the most simple

Boundary Conditions

LBM+GPU

realtime LBM

LBM

! loop to 2

6. compute macroscopic quantities

5. imposed boundary condition

4. stream & collide

3. compute equilibrium function

2. compute forces

1. initialise

7.

LBM+IBM(+GPU)

Collide(local)

Stream (non-local)

Lattice Boltzmann Method: Algorithm

LBM+GPU

realtime LBM

LBM+GPU

LBM+IBM(+GPU)

realtime LBM

double precision

Validation of 3D solver & boundary conditions Poiseuille Flow single precision

LBM

LBM

LBM+IBM(+GPU)

realtime LBM

Lid Driven Cavity 2D:

[Ghia et al., 1982, ],

3D:

[Jiang et al., 1994, ]

Validation of 3D solver & boundary conditions

LBM+GPU

LBM

LBM+IBM(+GPU)

Implementation on GPU

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

FK104 :K5000M 1344 (7 x 192) 63 4GB 24:1 1.6 TFLOPS 66 GB/s (measured)

Hardware Tested

Feature cores (SMX x cores/SMC) regs / thread DRAM SP/DP ratio Peak performance (single precision) DRAM Bandwidth

LBM+GPU

GK110: K20c 2496 (13 x 192) 255 4.7GB 3:1 3.5 TFLOPS 143 GB/s (measured)

realtime LBM

LBM

realtime LBM

3. compute f (eq)

7. compute macroscopic quantities

4. collide

2. compute forces

7. compute macroscopic quantities

6. impose bcs.

- 5. stream i.e. read values from host into new location

[Rinaldi et al., 2012, ])

1. initialise

PULL (see

6. impose bcs.

• i.e. requires synchronisation

4. collide (local) 5. stream (non-local)

3. compute f (eq)

- 2. compute forces

1. initialise

PUSH

LBM+IBM(+GPU)

Lattice Boltzmann Method: Algorithm

LBM+GPU

LBM+GPU

LBM+IBM(+GPU)

realtime LBM

i n t size=Nx ∗ Ny ; f o r ( dir = 0 ; dir < 9 ; ++dir ) { Xnew=X d_cx [ dir ] ; // Stream x �PULL ' Ynew=Y d_cy [ dir ] ; pop_local [ dir ] = pop_old [ dir ∗ size+Ynew ∗ Nx+Xnew ] ; }

GPU implementation: pull

f o r ( i n t dir = 0 ; dir < 9 ; ++dir ) { Xnew=X+cx [ dir ] ; // Stream x �PUSH ' Ynew=Y+cy [ dir ] ; // Stream y pop [ dir ] [ Xnew ] [ Ynew ] = pop_old [ dir ] [ X ] [ Y ] ∗ ( 1 omega ) + pop_eq [ dir ] ∗ omega + force_latt [ dir ] ; // C o l l i d e }

CPU implementation: push

Lattice Boltzmann Method: GPU implementation

LBM

LBM

LBM+IBM(+GPU)

realtime LBM

f [ dir ∗ Nx ∗ Ny+Y∗ Nx+X ]dimension to reduce additional dereferencing. arrays into a single

• within DRAM it is common practise to ‘flatten’ multiple dimension

access to DRAM once initially loaded

• All is stored in a struct within registers to minimise high latency

boundary type. Density ⇢ and velocity ui are only stored if output.

• Each thread stores values of f (x)i and an integer tag denoting

LBM algorithm at one spatial location f (x)i in the fluid domain.

• Code is parallelized such that one thread will perform the complete

Memory Arrangement

LBM+GPU

LBM

LBM+IBM(+GPU)

[Volkov, 2010, ]

realtime LBM

x-component of c =

1 in SMC

No x-component of c in SMC

depends on the x component of the discrete velocity direction (c) • Cache hit rate is low as we don’t have repeat accesses (< 7% in L2) • .. so instead maximise register use (which lowers occupancy)

• Operating on all indices of f in one thread helps to hide latency • Use structs of arrays access to f . Coalesced access therefore only

More ILP at expense of occupancy improves performance

Instruction Level Parallelism

LBM+GPU

LBM

LBM+IBM(+GPU)

• for high registers/thread, large block sizes are impractical. • e.g block size of 1024 means only a single block would run • [Obrecht et al., 2013, ] recommend maximum block size of 256

• in 3D we need 19 loads for f (x) and 19 stores • we also need 1 integer (boundaries) and 4 macroscopic quantities • so total of 43 registers needed per thread

registers threads blocks ⇥ ⇥  65536 thread block SMX

• maximum of 65536 registers and can host up to 2048 threads

Maximising use of registers

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

Shared Mem

Shu✏e

Shared mem shu✏e operation on Keplers

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

• large block sizes are impractical for LBM

• ILP is more efficient than Shared memory or shu✏e

Shared mem shu✏e operation on Keplers

LBM+GPU

realtime LBM

LBM

Compared well to other implementations (albeit on other h/w) [Rinaldi et al., 2012, ] and [Astorino et al., 2011, ] use Shared Mem.

•

• vs theoretical max of 892MLUPS and 412MLUPS based on measured bandwidth

Peak 814MLUPS and 402MLUPS for K20c and K5000M respectively

•

•

LBM+IBM(+GPU)

Overall Performance

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

Immersed Boundary method

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

realtime LBM

• Not a replacement, but a valuable tool

• An alternative to body fitted mesh

• Define arbitrary mesh shape

[Peskin, 1977, ] • Preserve efficient high order (Cartesian) solver

• Original Motivation of Charles Peskin

Fluid equations solved on an Eulerian mesh. The geometrical complexity is embedded using an Lagrangian mesh via a system of singular forces.

Immersed Boundaries (1977-)

LBM+GPU

of the momentum equations in terms of body

actual boundary position are supplied to the rhs

• The elastic forces that restore the true and

system of inextensible spring & dampers

• Peskin idea: model the boundary as a

forces.

LBM+IBM(+GPU)

realtime LBM

⇧ pressure correction error

• Sharing a drawback:

⇧ Deformable boundary

⇧ Moving boundary

• Modified approach has some advantages:

introduced

introduced and applied (CIEMAT): no k sti↵ness

• [Pinelli et al., 2010, ] (no springs) has been

Original vs current approach

LBM+GPU

• Applied to a complete heart fluid system

• [Peskin, 1977, ]

LBM

LBM+IBM(+GPU)

algorithm:

@u = rhs + f , @t

f =

u (d) t

un

realtime LBM

u⇤

6. goto 1.

5. compute for pressure correction, project velocity field

4. repeat momentum advancement with f

3. compute f as a function of the di↵erence u (d)

2. advance momentum equation without boundary induced body forces (u ⇤ on S)

rhs 8~x 2 S

Immersed Boundary: basics (I)

LBM+GPU

1. given f update position of Lagrangian markers (boundary shape)

Along S u = u (d)

LBM

In discrete form in 2D

Xl )✏ l

Xl ) x 2

realtime LBM

• and it’s computation guarantees that interpolation(spread( F))=F

• it can be considered to represent the physical width of the surface.

• Epsilon is the key to the accuracy

l2⌦l

Spread (convolution) X f (xi,j ) = F (Xl ) h (xi,j

i,j

LBM+IBM(+GPU)

Immersed Boundary: basics (II)

LBM+GPU

Interpolation X U ⇤ (xl ) = u ⇤ (xi,j ) h (xi,j

LBM

LBM

• Bounceback does not o↵er high

• Inherent suitability of LB to IB • Lattice already uniform Cartesian • Poisson-free ! ! No pressure correction drawback

moving/flexible boundaries

• assumes ‘stair-step’ surface • and is problematic for

accuracy

LBM+IBM(+GPU)

Coupling IB with LB

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

(see

[Favier et al., 2013, ]

for full details)

11. compute macroscopic quantities

10. find velocity field with object (call LBM again)

9. (solve other equations: collision, tension, bending, inextensibility)

8. spread force onto lattice

7. compute required force for object

6. interpolate fluid velocity onto Lagrangian marker

5. find epsilon Costly, & also not required if not moving

4. compute support not required if not moving

3. find velocity field in absence of object (call LBM)

2. set forces to zero

1. initialise

Lattice Boltzmann Method: Algorithm

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

[Uhlmann, 2005, ]

settling velocity

x-position

Starting at rest, no slip walls, gravity along x

Falling particle under gravity ⇢p /⇢f = 8, Re = 165

Validation on finite-di↵erences DNS

Rigid Particles: validation 1

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

[Koumoutsakos and Shiels, 1996, ]

[Uhlmann, 2005, ]

Impulsively moved flat plate

Kissing/Tumbing particles

Rigid Particles: validation 2

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

• spreading operation is a problem (atomic add)

• exploit capability to overlap memory transfers with executions

simultaneously to hide latency

• Transfer from host to IBM kernel can be started

are random with much higher cache use.

• Transactions moving data between fluid and boundary

• each point only needs information about itself

• Transactions for each point of the object are coalesced

Immersed boundary + GPU

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

• e.g. for a sphere radius r = 20 we need nlag 4000

• For large objects (nlag > 1024), we need to launch a kernel for each object

• generally the case for 2D

• For small objects (nlag < 1024) we assign one block per object

• Objects are treated according to size; number of Lagrangian markers (nlag)

Small vs. large objects

LBM+GPU

realtime LBM

LBM

⇢

@2 @2X (K ) + ⇢g B @s 2 @s 2

Inextensibility condition

Incompressible condition

Solid momentum

following method of [Huang et al., 2007, ]

@X @X · =1 @s @s

r·u=0

@2X @ @X = (T ) @t 2 @s @s

Immersed boundary

Fh =

Interpol(f)

Kinematic condition

@X = Interpol(u) @t

Fh

Apply inextensibility condition to the filament

• f is the force required by the fluid to verify the b.c • Fh is the hydrodynamic force resulting from having applied the b.c.

> > > > > > > > > > > > > > > > :

8 > > > > > > > > > > > > > > > > <

LBM+IBM(+GPU)

Flexible beating filament I

LBM+GPU

realtime LBM

LBM

• •

• •

LBM+IBM(+GPU)

2Xn + Xn t2

1

Ds X

n+1

n+1

n+1

Ds X · Ds X

= [Ds (T

)] + (Fb ) + Fr = 1;

n+1

g g

F

n

realtime LBM

@ 2 T n+1 @s 2

(

=) AT

n+1

= rhs

@ 2 Xn @ 2 Xn n+1 · )T = 2 2 @s @s n

⇢

˙ n @X ˙n @X @Xn @3 @ 2 Xn · + · (K ) B @s @s @s @s 3 @s 2 where A is a tridiagonal matrix

@Xn dFnh · @s ds

For the initial guess, it’s possible to derive a very good estimate for the tension, by incorporating the inextensibility condition in the momentum equations:

Tension computed via iterative Newton-Raphson loop (by computing the exact Jacobian)

Xn+1

Staggered discretization of the Lagrangian space (X and T )

Application to Lattice-Boltzmann solver

Flexible beating filament II

LBM+GPU

LBM

LBM+IBM(+GPU)

(see [Favier et al., 2013, ] for full details)

With fluid (for di↵erent rigidities we observe correct ‘kick’ of free end)

Without fluid (falling under its own weight)

flexible filaments: validation 1

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

(see [Favier et al., 2013, ] for full details)

correct phase dependence on separation of filaments

Filament interactions: 2 filaments

flexible filaments: validation 2

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

[Revell et al., 2013, ]

(temp web address : http://195.83.116.187/pelskin_web/)

EU funded project on this topic just started: PELSKIN

[Favier et al., 2009, ]

Cylinder coated with filaments: drag reduction

flexible filaments: investigation

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

Realtime LBM

LBM+GPU

realtime LBM

LBM

Realtime output

LBM+IBM(+GPU)

realtime LBM

↵)Mi

1

(x + ui (x, ti ) t) + ↵Ni (x) • Dye injection uses a similar method • can be more intuitive

Mi (x) = (1

• velocities used to displace mesh vertices using forwards integration • noise texture does not a↵ect flow • mesh resolution may be coarser than lattice

with random noise texture N according to blending factor ↵

• previous frame M is textured onto a distorted mesh and blended

forwards using texture mapping

• instead of particle seeding which can be hit or miss • noise textures are used to represent dense set of particles, which are advected

several OpenGL visualization techniques are implemented • Contour colour map (for velocity magnitude) is stored on the device • Image Based Flow Visualization (IBFV) simulates advection of particles through an unsteady vector field (macroscopic u field)

LBM+GPU

LBM

LBM+IBM(+GPU)

• Attracting increased attention in some areas • training for medical surgical proceedure • for complex industrial applications where ‘intuition’ is missing

• Initially started out as a gimic, used for teaching and open days

uses for Realtime?

LBM+GPU

realtime LBM

LBM

Conclusions

LBM+IBM(+GPU)

• with various visualization options • exploring potential applications

• Realtime version of the solver available

• various flow physics investigations underway • context of EU project PELSKIN • BBSRC project on ‘protein manufacturability’

• IB-LB solver validated for rigid and flexible geometry

• 814 MLUPS peak on K20c

• LBM solver implemented on Kepler hardware with some optimizations

LBM+GPU

realtime LBM

LBM

LBM+IBM(+GPU)

References I

Huang, W., Shin, S., and Sung, H. (2007). Simulation of flexible filaments in a uniform flow by the immersed boundary method. Journal of Computational Physics, 226(2):2206 – 2228.

Hardy, J., Pomean, Y., and de Pazzis, O. (1973). Time evolution of a two-dimensional model system. Modelling Simul. Mater. Sci. Eng., 14:17461752.

Guo, Z., Zheng, C., and Shi, B. (2002). forcing term lbm. Physical review letters E, 65(4).

Grad, H. (1949). On the kinetic theory of rarefied gases. Communications on pure and applied mathematics.

Ghia, U., Ghia, K., and Shin, C. (1982). High-re solutions for incompressible flow using the navier-stokes equations and a multigrid method. Journal of computational physics.

Frisch, U., B Hasslacher, B., and Pomeau, Y. (1986). Lattice-gas automata for the navier-stokes equation. Physical review letters.

Favier, J., Revell, A., and Pinelli, A. (2013). A lattice boltzmann - immersed boundary method to simulate the fluid interaction with moving and slender flexible objects. HAL, hal(00822044).

Favier, J., Dauptain, A., Basso, D., and Bottaro, A. (2009). Passive separation control using a self-adaptive hairy coating. Journal of Fluid Mechanics, 627:451–483.

Bhatnagar, P., Gross, E., and Krook, M. (1954). A model for collision processes in gases. i: small amplitude processes in charged and neutral one-component system. Physical Review, 94:511–525.

Astorino, M., Sagredo, J. B., and Quarteroni, A. (2011). A modular lattice boltzmann solver for gpu computing processors.

LBM+GPU

realtime LBM

LBM

References II

LBM+IBM(+GPU)

Rinaldi, P., Dari, E., V´ enere, M., and Clausse, A. (2012). A lattice-boltzmann solver for 3d fluid simulation on gpu. Simulation Modelling Practice and Theory, 25:163–171.

Revell, A., Favier, J., and Pinelli, A. (2013). Drag reduction of flow around a cylinder with attached flexible filaments. In ERCOFTAC international symposium onUnsteady separation in fluid-structure interaction.

Raabe, D. (2004). Overview of the lattice boltzmann method for nano- and microscale fluid dynamics in materials science and engineering. Modelling Simul. Mater. Sci. Eng., 12.

Pinelli, A., Naqavi, I., Piomelli, U., and Favier, J. (2010). Immersed-boundary methods for general finite-di↵erence and finite-volume navier-stokes solvers. Journal of Computational Physics, 229(24):9073 – 9091.

Peskin, C. S. (1977). Numerical analysis of blood flow in the heart. Journal of Computational Physics, 25(3):220–252.

Obrecht, C., Kuznik, F., Tourancheau, B., and Roux, J.-J. (2013). Efficient gpu implementation of the linearly interpolated bounce-back boundary condition. Computers & Mathematics with Applications.

Nie, X.B., S. X. and Chen, H. (2009). Lattice-boltzmann/finite- di↵erence hybrid simulation of transonic. flow. AIAA-Paper 2009-139.

Latt, J. (2013). Introduction to lattice boltzmann method. http://www.youtube.com/watch?v=I82uCa7SHSQ.

Koumoutsakos, P. and Shiels, D. (1996). Simulations of the viscous flow normal to an impulsively started and uniformly accelerated flat plate. Journal of Fluid Mechanics, 328:177–227.

Jiang, B., Lin, T., and Povinelli, L. (1994). Large-scale computation of incompressible viscous flow by least-squares finite element method. Computer Methods in Applied Mechanics and Engineering, 114.

LBM+GPU

realtime LBM

LBM

References III

LBM+IBM(+GPU)

Zou, Q., H. X. (1997). On pressure and velocity boundary conditions for the lattice boltzmann bgk model. Phys. Fluids, 9(6).

Volkov, V. (2010). Better performance at lower occupancy. In Proceedings of the GPU Technology Conference, GTC, volume 10.

Uhlmann, M. (2005). An immersed boundary method with direct forcing for the simulation of particulate flows. Journal of Computational Physics, 209(2):448–476.

Shan, X., Yuan, X.-F., and Chen, H. (2006). Kinetic theory representation of hydrodynamics: a way beyond the navierstokes equation. Journal of Fluid Mechanics, 550(1):413–441.

LBM+GPU

realtime LBM

Challenges#in#GPU#por>ng:## the#Quantum#ESPRESSO#case

Short Description

Description

Comments