A massively parallel CFD/DEM approach for reactive gas-solid flows in complex geometries using unstructured meshes

(1)

OATAO is an open access repository that collects the work of Toulouse

researchers and makes it freely available over the web where possible

Any correspondence concerning this service should be sent

to the repository administrator:

tech-oatao@listes-diff.inp-toulouse.fr

This is an author’s version published in:

https://oatao.univ-toulouse.fr/26710

To cite this version:

Dufresne, Yann and Moureau, Vincent and Lartigue, Ghislain

and Simonin, Olivier A massively parallel CFD/DEM approach

for reactive gas-solid flows in complex geometries using

unstructured meshes. (2020) Computers and Fluids, 198.

104402. ISSN 0045-7930

Official URL:

https://doi.org/10.1016/j.compfluid.2019.104402

(2)

A

massively

parallel

CFD/DEM

approach

for

reactive

gas-solid

ﬂows

in

complex

geometries

using

unstructured

meshes

Yann

Dufresne

a,∗

_,

_Vincent

_Moureau

a

_,

_Ghislain

_Lartigue

a

_,

_Olivier

_Simonin

b

a CORIA-UMR6614, Normandie Université, CNRS, INSA and UniRouen, Rouen, 760 0 0, France

b Institut de Mécanique des Fluides de Toulouse (IMFT), Université de Toulouse, CNRS, INPT, UPS, Toulouse, France

a

b

s

t

r

a

c

t

Despite havingbeen thoroughlydescribed invarious simple configurations,the study ofgas-fluidized systemsinaCFD/DEM(DiscreteElementMethod)formalismbecomeschallengingasthecomputational domainsizeandcomplexityrise.Forawhile,attentionhasbeendrawntothedesignofphysicalmodels forfluid-particlesinteractions,butarecentchallengefornumericaltoolshasbeentotakeadvantagefrom theincreasingpowerofdistributedmemorymachines,inordertosimulaterealisticindustrialsystems. Furthermore, unstructuredmeshes areappealing fortheir abilityto describecomplex geometriesand toperformlocal refinements,butleadtosignificantcoding effortinvolvingsophisticatedalgorithm.In aattemptto design anumericaltool abletocope withtheselimitations, the methodologypresented hereproposesan efficient non-blockingalgorithmformassive parallelism management,as wellas an exhaustivecontactschemetodealwitharbitrarilycomplexgeometries,alltobeoperatedonunstructured meshes.Theaimistwo-fold:(i)To assistlargerscalecodesintheirendeavortoclosethesolidstress tensorforexample,(ii)topavethewayforcomplexindustrial-scalesystemsmodelingusingDEM.The methodologyissuccessfullyappliedtoapilot-scalefluidizedbedgathering9.6Msphericalparticlesand enablestoreachinterestingphysicaltimesusingreasonablecomputationalresources.

1. Introduction

InFluidizedBedReactors(FBR), thefluidization regimeoccurs whenthe fluidthat passesthrough thegranularmaterialexceeds the minimum fluidization velocity.In thisregime, the drag force appliedtothesolidgrainscounterbalancesgravity,whichleadsto astrongmixingofthefluidandsolidphases.Thismixingensures efficientheatandmasstransfersacrossthereactorandminimizes temperature and species concentration gradients in the fluidized region. These propertiesare particularly beneficial inthe field of metallurgy, energy and chemical industry for instance, in large scaleoperationssuchaschemicalsynthesis,coatingordrying[1]. Low-temperature combustionwithhighconversionefficiencyand

Nomenclature for non-obvious or recurrent abbreviations (by order of appearance

in the text): DEM, Discrete Element Method; ELGRP, Mesh element Group; PTGRP, Particle Group; INTCOMM, Internal Communicator; EXTCOMM, External Commu- nicator; MPI, Message Passing Interface; PTEXTCOMM, Particle External Communi- cator; PGTS, Particle Group To Send; VR , Voronoi Region; F, E,V, boundary Face, Edge, Vertex; BFG, Boundary Face Group; BSBFG, Bounding Sphere of Boundary Face Group; BSF, Bounding Sphere of Face.

∗ _{Corresponding author.}

E-mail addresses: yann.dufresne@insa-rouen.fr, yann.dufresne@coria.fr

lowpollutantemissionssuch asnitrogenoxidesisone ofthe nu-merousachievementofFBR.Inthepast,alackofunderstandingof thecomplex dynamicbehavior of such deviceshas beenpointed out[2]asoneofthecauseoftheseverediﬃcultiesintheirdesign andscale-up [3]. Thus, much time and resource is spent on the buildingofpreliminarytestsonpilot-scale reactors thatwilllead tothe design ofthe ﬁnal industrial-scalereactor by the meanof empiricalprocesses[4].

ComputationalFluidDynamics(CFD)hasalreadycontributedto theunderstandingofmanyelementaryphysicalprinciplethrough numerousstudiesonvarioussystemsizes,rangingfromthestudy ofheatandmasstransferattheparticlescale[5]tothemodeling ofcompleteindustrialunits[6].Theprimediﬃcultyresidesinthe largespectrumoflengthandinvolvedtimescales.Indeed,evenin industrialscalesystemswheretheratioofthereactorsize tothe solid particle diameter is very large, the ﬂuidization regime fea-tures macroscopicstructures such asrecirculations, particle clus-ters and gas bubbles of which dynamics prediction strongly de-pendson themicroscopic descriptionof particlecontacts,in par-ticular.

Today,themost promisingframework forthe modeling of in-dustrialunitsremains theTwoFluidModel(TFM)alsoreferred to asEuler-Euler method,in which itis assumed that both the gas

(3)

andthe particle phase are inter-penetratingcontinua [6]. Its un-derlyingassumptionistheexistenceofaseparationofscales:the sizeoftheaveragingregionismuchlargerthantheparticlescale. Thisclass of methods is computationally effectivebut the estab-lishmentofan accurate continuousdescription ofthe solidphase ischallenginganditsformulationrequiressemi-empiricalclosures anddetailedvalidations.Ontheother hand,theDiscreteElement Method(DEM)alsoreferred toasdiscreteparticlemethodallows for a more detailed description of particle-particle and particle-wall interactions. This deterministic approach ﬁnds its origins in the molecular dynamics methods initiated by Alder and Wain-wright[7]and hasbeen beneﬁting fromits advances eversince. InCFD/DEM,orEuler-Lagrangemethods,thegasphaseisstill con-sideredcontinuousanditstime evolutionisobtainedfroma clas-sicalCFD-typeEuleriancode,buttheparticlesaredescribed indi-vidually assuming that their motion obeys Newton’s second law of motion, which is solved using standard schemes for ordinary differentialequations.Thislevelofmodelingdesignatedas meso-scalestillrequiresclosuresfordrag,collisionandotherforcesasa CFDgridcell typically containsup toa few tensofparticles, but itsadvantageslieinitsabilitytoaccountfortheparticle-walland particle-particleinteractionsinamorerealisticmannerthan Euler-Eulermethods.

Forthe time being, apartfromtheclosures stillneededwhen usingCFD/DEM, two mainfactors limit its utilizationforrealistic industrialsystemstudy:i) Thesolving ofthemomentumbalance foreach particle gives rise to substantial costs that can only be overcomeby themeanofoptimizedparallelismmanagement and ii) industrialsystemgeometries areoftencomposed ofcylindrical andirregularpartsthat preventtheuseofconventionalCartesian meshes and necessitate a proper methodology to treat particle-wall contacts. Reaching suﬃcient computational performances in CFD/DEMsimulations servestwo purposes:theﬁrst istodevelop closurelaws which can represent the effective averaged interac-tionsinthelargerscalemodelssuchasTFM,andthesecondisto pavethewayforpilot andindustrial scale systemsimulations in thelongrun.

Many open-source or commercialCFD/DEM packageshave al-readyshowngoodcapabilitiesforsimulatingsuchsystemsormore complex ones. Among them, one can cite NGA [8] and MFIX-DEM[9]parallelsolverswhichare bothcapabletosimulate reac-tiveﬂows basedonCartesianmeshes. Othercodes relyingon un-structured meshesare built based on the coupling ofone solver dedicatedtotheﬂuid phaseandanothertothesolid phase,such as OpenFoam®+LIGGGHTS® [10] and Fluent®+EDEM CFD® [11]. This study presents the design of a massively parallel code for simulatingbothphasesonunstructuredmeshes.Concerning com-plex geometries, contrary to the algorithm suggested by Lin and Canny[12]implementedinthepopularI-Collide[13]collision de-tectionpackage, the methodproposed inthiswork is ableto re-turn the measure of a particle penetration depth into the wall, whilebeingsimplerthantheVoronoi-clipalgorithm[14],whichis designedforarbitrarycomplex3D polyhedracollisions.Thiscode canalsoworkinreactingconditions.

AnapproachcombiningDEMtorepresentthesolidphasewith Large-EddySimulation (LES)equations solved on an Eulerian un-structured grid forthe fluid phase hasbeen implemented inthe finite-volume code YALES2 [15], a LES and DNS (Direct Numeri-cal Simulation) solver based on unstructured meshes. This code solvesthelow-MachnumberNavier-Stokesequationsforturbulent reactiveflows usinga time-staggeredprojection methodfor con-stant[16]orvariabledensityflows[17].

There isabundantliteratureonthesubjectofthedifferent ex-istingmodels fordrag[18],collisionforce[19]andother closures thatmaybeusedforturbulenceorheat transfermodeling.These discussionsdon’tfallwithinthescopeofthiswork,whichfocuses

onamethodologyforperformanceincrease.Thus,onlyelementary modelsareusedinthepresentwork.Furthermore,asheattransfer neitherplays asignificant rolein codeperformancesnorinvolves extra specific numericalmethodology, our attentionturns to the studyofanisothermalgas-soliddensefluidizedbedexperimented attheUniversityofBirmingham[20].

In this context, this paper is organized in seven parts. The Euler–Lagrange formalism is first describedfor both the gaseous andtheparticle phaseinSection 2.Some noteworthyfeatures of theYALES2codearethenbrieflyintroducedinSection3.The pur-poseofSection4istopresentanefficientalgorithmforparallelism management.Then,aviablemannertotreatsphericalparticle con-tactswitharbitrarycomplexgeometriesispresentedinSection5. The main caseunderstudyis describedinSection 6. Finally,the performancesofthecodearemeasuredinSection7.Useful abbre-viationscanbefoundinfootnote1 .

2. TheEuler–Lagrangeformalism

This section exposes themain models andnumerics used for solvingthelow-MachnumberNavier–Stokesequationsderivedfor granularﬂows inaLESframework.Then, adescriptionofthe clo-suresandnumericsforsolidphasemodelingispresented.The cou-plingbetweenthephasesisprovidedintheAppendixA,including theinterpolation/projectiontechniqueandthedescriptionof ﬁlter-ingstepssuitedforunstructuredmeshes.

2.1. Gasphasemodeling

The LES governing equations for granular flows are obtained from the filtering of the unsteady, low-Mach number Navier– Stokesequations,takingthelocalfluidandsolidfractionsinto ac-count. Further details concerning the volume filtering operations canbe foundin[21].Thegoverningequationsformass conserva-tion,momentumtransport,sensibleenthalpytransportandspecies transportfinallyread:

∂

t

(

ε

ρ

¯

)

+

∇

·

(

ε

ρ

¯u˜

)

=0, (1)

∂

t

(

ε

ρ

¯u˜

)

+

∇

·

(

ε

ρ

¯u˜u˜

)

=−

∇

P¯+

∇

·

(

ε

τ

¯

)

+

ε

ρ

¯g+Fp→ f, (2)

∂

t

ε

ρ

¯˜hs

+

∇

·

ε

ρ

¯u˜˜hs

=

∇

·

μ

t Prt

∇

˜ hs

+dP0 dt +

∇

·

ελ∇

_T

₊

_ε

_ω

_˙_T₊_Q_p_→_f_, ₍₃₎

∂

t

ε

ρ

¯Yk

+

∇

·

ε

ρ

¯u˜Yk

=

∇

·

μ

t Sck,t

∇

Yk

+

∇

·

ε

ρ

¯Dk

∇

Yk

+

ε

ω

˙k. (4)

u,

ρ

,

μ

,P, hs,P0 ,T,

λ

,Dk,Yk,

ε

are thegas velocity, density,

dy-namic viscosity, dynamic pressure, sensible enthalpy, thermody-namicpressure,temperature,thermalconductivity,diffusion coef-ﬁcient,mass fractionofspeciesk,andﬂuid fraction,respectively.

˙

ω

kis the chemical source termand ˙

ω

T the enthalpy source term.

Theturbulentvariablesnoted

μ

t,PrtandSck,tarethegasturbulent

viscosity,turbulentPrandtlnumberandturbulentSchmidtnumber ofspeciesk.Theviscousstraintensor

τ

¯ iscalculatedas:

¯

τ

=

(

μ

+

μ

t

)

∇

u˜+

∇

u˜T₋2 3

(

∇

· ˜u

)

I

, (5)

whereI istheidentitytensor.F_p_→_f andQ_p_→_f arethemomentum sourcetermandtheheatsourceduetothecouplingwithparticles,

(4)

respectively.Thereisnospeciestransferbetweengasandparticles. Details concerning the computationof theseterms can be found intheAppendixA.Theseequationsaresupplementedbytheideal gasEquation-Of-State(EOS):

P₀=

ρ

¯r˜T with ˜r=

k ∈ S

RYk

W_k, (6)

with r being the ideal gas mass constant, R being the ideal gas constant, Wk themolarmass ofspeciesk,andSbeingtheset of

species.

Forthesake ofclarity,theﬂuid ﬁlteredquantitiesu˜,

ρ

¯,_P¯_,_h˜_s_,

T,Y_k,

τ

¯, r˜,F_p_→_f andQ_p_→_f willbe writtenu,

ρ

,P,hs,T, Yk,

τ

,r, Fp→ fandQp→ finthefollowingsections.

Eqs. (2)–(4) are integrated using an explicit variable density solverprovidingafullymass,momentumandenthalpyconserving timeadvancement.

2.2. Particlephasemodeling

The translational motion of a particle’s center of gravity and itsrotationalmotionaroundthecenterofgravitycanbefully de-scribed by the following system of equations given by Newton’s second law, assuming spherical and constant mass particle with highsolid/gasdensityratio:

mp dup dt =FD+FG+FP+FC with dxp dt =up, (7) Ip d

ω

p dt =MD+MC, (8)

wheremp,up, xp,Ip and

ω

p are theparticle mass,velocity, posi-tion,momentofinertiaandangularvelocity, FD isthedragforce, FG=mpg is the gravity force and FP=−Vp

∇

P@ p is the pressure

gradient force. In the last term, Vp is the particle’s volume and

∇

P_@p is the local pressure gradient interpolatedat the centerof

the particle. As in many dense gas-ﬂuidized bed cases, a soft-sphere model[22]is employed,inwhichparticles areallowed to overlapotherparticlesorwallsinacontrolledmanner.Aresulting contact force FC accounting forparticle-particle andparticle-wall

repulsion isthus addedinthe momentumbalance ofeach parti-cle.MC isthetorqueofthecontactforceFCandMD isthetorque

ofﬂuiddragforces.Theparticletemperatureevolutionisgivenby:

mpCp,pdT_dtp=QF, (9)

whereCp,pandTp aretheparticlemassheatcapacityand

temper-ature,andQFistheheatﬂuxexchangedwiththeﬂuid.

The source terms for particles FD and MD are calculated

us-ing a combination ofthe Ergun [23] andWen andYu [24] drag laws,andaclosurefromtheworkofDennis[25],respectively.The closuresused forthe computationofQF won’tbe detailedinthis

study,whichfocusesonanisothermalapplication.Therelation be-tweenFD,FP andFp→ f,betweenQF andQp→ f,aswell asdetails

concerningtheinterpolationkernelsaregivenintheAppendixA. Asecond-orderexplicitRunge-Kutta(RK2)algorithmisusedto advancetheparticles intime.Theuseofasoft-spheremodel de-mandsthat

tp<TC,where

tpistheparticletimestepandTCis

acontacttimedescribedinSection2.2.1.Inthiswork,

tp=TC/10

wasconsidered,toensureareasonableprecision.

2.2.1. Modelingofcollisions

Thetotal collisionforceFC actingon particleaiscomputedas

the sum of all pair-wise forces fcol

b→ a exerted by the Np particles

andNwwallsincontact.Asparticlesandwallsaretreatedsimilarly

Fig. 1. Soft sphere representation of two particles undergoing collision.

duringcollisions,thebindexreferstoboth:

FC= Np+ Nw

b=1

fcol

b→ a with fcolb→ a=fcoln,b→a+ftcol,b→a. (10)

Here a linear-spring/dashpot [22] model is used along witha simpleCoulombslidingmodelaccountingforthe normal

fcol

n,b→a

and tangential

fcol t,b→a

components of the contact force, respec-tively,asintheworkofCapecelatro[21].Foroneparticle(orwall)

bactingonaparticlea: fcol n,b→a=

−kn

δ

abnab− 2

γ

nMabuab,n 0 and fcol t,b→a=

−

μ

tan

||

fcol_n_,b→_a

||

tab if

δ

ab>0, 0 else. (11)

Fig.1showsarepresentationoftwocollidingparticles.knisthe

normalspringstiffness,

γ

nisthenormaldampingparameter,and

μ

tan isthe friction coeﬃcient. The term

δ

ab=ra+rb−

||

xb− xa

||

isdeﬁnedasthe overlapbetweenthe a andb entities expressed usingeach particleradius rp andcenter coordinates xp. The

sys-tem effective mass M_ab is expressed as M_ab₌

(

1_/ma+1/mb

)

−1 .

The unit normalvector nab from particlea towards entityb and

aunittangentialvectortabaredeﬁnedusingparticles’relative

po-sitionandvelocity.nabandtabarecalculatedasfollows:

nab= xb− xa

||

xb− xa

||

and tab=

_u ab− uab,n

||

uab− uab,n

||

if

||

uab− uab,n

||

>0, 0 else. (12)

The relative velocity of the collidingsystem atthe contactpoint

uabiswritten:

uab=

(

ua− ub

)

+

(

ra

ω

a+ rb

ω

b

)

∧ nab. (13)

Itsnormalcomponentisthengivenbyu_ab_,n₌

(

u_ab_{· n}_ab

)

n_ab. Using Newton’s third law yields an analytical expression for the system’s natural frequency

ω

₀=

kn/Mab and the contact

time[26]: TC=

π

ω

2 0 −

γ

n2

∝

mp kn

. (14)

(5)

Astheparticlesareallsphericalwithhomogeneousdensity,the momentofinertiasimplyisIp=mpd2 p/10,andthetotaltorqueMC

appliedbyallentitiesb incontactwithaparticleaonlydepends onthetangentialcomponentoftheindividualcontactforces: MC=ra

Np+ Nw b=1

nab∧fcol_t,b→a. (15)

Incaseofaparticle-wallcollision,thewallisconsideredasa par-ticlewithinﬁnitemassandnullradius.

Thesearch forpotentialcollisionpartnersisacceleratedbythe use of a standard linked-cell data structure [27]. This Cartesian grid,superimposedontheunstructuredEulerianmesh,is dynam-icallycomputed.Thedescriptionofthisusualstephasbeen omit-ted.

3. SpeciﬁcfeaturesoftheYALES2solver

Inthissection,somepropertiesoftheunstructuredmesh parti-tioningusedinYALES2arehighlighted.Thespeciﬁcdata architec-turestronglyinﬂuencesthemethodologiesthataretobediscussed atalaterstage,andisthusalsopresented.Furtherdetail concern-ingthesefeaturescanbefoundin[15].

3.1.Two-leveldomaindecompositionforunstructuredgrid

As mentionedpreviously,thelow-Mach numberNavier–Stokes equationsaresolvedonunstructuredmeshesinordertofully ben-eﬁt fromhigh-performance computingon massively parallel ma-chines. A two-level domain decomposition (Double Domain De-composition,abbreviatedDDD)is employedandorganizedas fol-lows:at a high level,mesh cells are dispatched over processors. Itconsistsinsplittingthecomputational domainintosub-meshes thatareaffectedtoeachcomputationalcore.Atthelowerlevel,at theprocessor scale,mesh cellsaregathered incell groupscalled ELementGRouPs (ELGRPs)as sketchedin Fig. 2.This double do-main decompositionallows foreasily optimizingthe use of pro-cessor memory for cache-aware algorithms and mayalso be ex-ploitedby deﬂationalgorithms[28].In3D cases,ELGRPstypically containsO

(

103

)

cells.Followingthesamepattern,particleslocated inanELGRParestoredinParTicleGRouPs(PTGRPs)containingup to500particleseach,hereagaintoimproveperformances.

3.2.Datastructures

Using DDD reinforces the need to work with a speciﬁc data structure.Indeedinthiscontext,eachELGRPstandsforan

individ-Fig. 2. Double Domain Decomposition (DDD). The highlighted elements are partic- ipating in the communications inside and outside each processor and those sur- rounded in black are participating in the communications between processors. Ex- tracted from [15] .

Fig. 3. Internal (INTCOMM) and External (EXTCOMM) communicators correspond- ing to Fig. 2 .

ualmeshblock,butcommunicationsoccurbetweenELGRPswhen computinggradientsorforresidualassembly.Thus,besides classi-calinter-processorconnectivities,somegeometricalelements,such as nodes, faces or edges, need to exchange data inside the core duringthecommunicationsteps. Anotherdatastructure is there-fore needed to connect the geometrical elements at the border of the ELGRPs. Rather than making use of a ghost cell method, an INTernal COMMunicator (INTCOMM) that contains a copy of all the nodes,faces oredges involved inthe communications in-side or outside the cores is deployed. EXTernal COMMunicators (EXTCOMMs) alsocontaintheir owncopyofthenodes,facesand edges that are located at theinterface betweenother computing cores.ThisarchitectureisdepictedinFig.3,wheretheboundaries arealsorepresented.

4. Parallelismmanagement

Moving towards pilot-scale CFD/DEM simulations imposes that a satisfactory scalability on massively parallel machines be reached.Parallelsimulationsrequirespecialtreatmentforparticles, ascollisionmightoccurbetweensomeofthemalthoughthey are dispatched on different processor domains that have no connec-tion. To cope with this requirement andin accordance with the data structures for unstructured meshesexplored in Section 3.2, a ghost particle method is used in a Message Passing Interface (MPI) paradigm. MPIparallel domaindecomposition isindeedan attractiveoptiontoparallelizeCFD/DEMproblems,especiallywith the emergence of massivelyparallel distributedmemory systems andforitshighscalabilitypossibilitiesevenforlarge numbersof processors. Note that a combination of a CFD code executed on CPUs(CentralProcessingUnits)andaDEMcodeexecutedonGPUs (GraphicsProcessingUnit)hasbeenreportedasapromising high-performance methodfor coupledCFD/DEMsimulations [29].This sectiontacklesthedesignofaneﬃcientparallelstrategyusingMPI domaindecomposition.

The currently implemented global algorithm is sketched in

Fig. 4 and can be resumed as follows: ﬁrst, ghost particles are identiﬁed using a cell halo surrounding each processor domain.

Fig. 4. Global algorithm for parallelism management on unstructured meshes. The two ﬁrst steps are detailed in this work.

(6)

Fig. 5. Flagging of one layer of elements ( ) at the interface of two processor domains. Left: on unstructured mesh, the red particle p don’t belong to any flagged element. Right: on Cartesian mesh this instruction is sufficient to identify ghost particles. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 6. Flagging of layers of elements ( ) at the interface of two processor domains with a distance instruction. Left : on unstructured mesh, all particles p are now identiﬁed. Right : on Cartesian mesh this instruction leads to the same result.

Next, the necessary data of particles belonging to the cell halo are packed, and exchanged betweeninvolved processors. Finally, after unpacking on each processor, ghost particles are treated as local particles totreat collisions. The followingsubsections detail each stepofthisalgorithm. Alltestshavebeenrun onthe Birm-ingham fluidized bed case (see Section 6), on the Curie super-computerfromCEAinFrance.Unlessotherwisespecified,statistics were collectedover 1s ofphysical time, andstartedafterhaving initiatedfluidizationfor2.5s.ItwillbeseeninSection6thatthese time scales are sufficientto ensure that boththe bedheight and thepressurelossacrossthebedareoscillatingaroundtheirmean value. Thisleadsustothinkthat thiscaseisrelevantenough for performances’measurements.

4.1. Initialization:Cellhaloidentiﬁcation

The ﬁrststepconsistsindeﬁninga cellhaloaroundeach pro-cessor domain,inorderto identifythe closestparticleson neigh-bor processors.A better selection willlimit the number ofghost particlestobeexchanged,thusfacilitatingthebuildingofthe col-lision partners’ detection grid and increasing the collision force computationspeed.

Thecellhaloidentificationcanbestraightforwardlyachievedon regularCartesianmeshes,andinmostcasesonelayerofcellsisa sufficient criterion for the halo building and no collision can be forgotten.However, when dealingwithmeshsize heterogeneities encountered on o-grid or unstructured meshes for instance,this criterioncaneitherleadtoanexcessivehalowidthsacrificingthe code performances,or theforgetting of manycontacts impacting the simulation physical meaning, as sketched in Fig. 5. This un-derlines theneedforan adaptiveelementflaggingmethodwhich should ideally be based on an exact distance computation. As computing the exact distance between a mesh node and a pro-cessor domain border would involve numerous calculation steps that couldnecessitateinter-processorcommunications,the imple-mented approach uses an approximate distance. The algorithm is inspired by fast-marching algorithms developed for Level-Set Methods[30,31].Inparticular,itreliesonthemappingofthe sur-face ofthe processordomain borderusingpointscalledmarkers,

ofwhichcoordinatesareautomaticallygenerated.Thegeneralidea istopropagatefromnodetonodethesecoordinates,fromthe pro-cessors’interfacetowardsneighboringprocessors.Eachstepofthis algorithmisthoroughlydetailedintheAppendixB.

Performingthesestepsallows,atthebeginningofasimulation, toidentifytherequiredelementsforghostparticlestreatment,as anyelementcontaininganodeclosertotheprocessordomain bor-derthan aparticle radiusisﬂagged (seeFig. 6). It alsoworkson any mesh element type; tetrahedron, hexahedron, prisms, pyra-mids,andhybridmeshes.

Asa result, Table1 yieldsthe relative CPUtime measured for the main steps of the collision force computation algorithm for theslowerprocessorregardingdifferentmethodsforthecellhalo identification. The reference is the adaptive method studied in thissection.Thethreeother methodscorrespond tosimplerones wherethegiveninstructionistoidentifythecloser,thetwocloser andthethreecloserlayers ofelementsforthehalo.Forthistest theBirminghamfluidizedbed(seeSection6)wasrunon512 pro-cessors.It should be noticed that in thiscasethe particle radius thatistargetedfortheadaptivemethodissupposedtobesmaller than the average mesh cell size, otherwise these results are ex-pectedtobedifferent.Theobtainedvaluesclearlyindicateastrong dependency upon the number of identified cells, as it obviously results in different quantities of ghost particles that have to be exchangedbetweenprocessors, thenlocatedinthe detectiongrid andfinallytreatedforcollisions.Inthetestedcaseitappearsthat thecontactforcecomputationisthemostsensitivepartofthe al-gorithm.Athoroughanalysisoftheidentifiedcellsrevealsthatthe methodusingone layerof elementsmayidentifyadditionalcells innegativecurvatureborderareas,hencetheslightlybetterresults obtainedwiththeadaptivemethod.

Tobenefitfromthisefficientcellhaloidentification,anewdata structure called ParTicle EXTernal COMMunicator (PTEXTCOMM) is created, playing the same role as the previously mentioned EXTCOMM but dedicated to the particles. At this point, the new architecture is sketched in Fig. 7. On each processor, one PTEXTCOMMis allocated foreach other processor impacted dur-ing the cell halo definition, that will support point-to-point MPI exchanges during the simulation. To make sure that for any

(7)

Table 1

Inﬂuence of cell halo identiﬁcation method on the CPU time for various operations.

Relative CPU time for some key phases measured for slowest processor

Method used Ghost particle communications Potential collision partners identiﬁcation Particle-particle contact force computation

adaptive method 1.00 1.00 1.00

one layer of elements 1.04 1.07 1.35

two layers of elements 1.69 1.66 1.98

three layers of elements 2.63 2.55 3.07

Fig. 7. Improved communicators structure containing Particle EXTernal COMMunicators (PTEXTCOMMs) and ghost ParTicle GRouPS (GHOST PTGRPS).

processor n sharing a PTEXTCOMM with a processor m, the re-verseistrueaswell,apre-communicationstepisperformedwhen the last processor exits the cell halo deﬁnition loop. As shown inFig.7,eachPTEXTCOMMcancontainseveraldedicatedPTGRPS calledPGTS (ParticleGroup ToSend),ofwhich roleis detailedin

Section4.2.Theyalsocontaininformationaboutelementsﬂagged duringcell halobuilding.Itisimportantto noticethatthe listof PTEXTCOMMScan differfrom the list of EXTCOMMS, asthe cell halosizecan locallyexceed themesh elementsize andthus im-pactdistantprocessors.

4.2.Ghostparticletreatment

When the collision force computingis needed, i.e.ateach RK step, the particles’ data have been updated and ghost particles havetobeexchanged.Ghostparticles areidentiﬁedusingthecell halo surrounding each processor domain as shown in Fig. 8 for processor ranked #1 in a cylindrical geometry discretized with an unstructured mesh. This halo has been deﬁned in previous

Section4.1.

Beforecomputingthecollisionforceonaprocessor#p,the fol-lowingtreatmentisapplied:

1. theparticles of any processor #n located in a mesh element whichbelongstoprocessor#p haloarecopiedintothe accord-ingPTEXTCOMMof#nanddispatchedamongitsPGTS. 2. Eachprocessor#nsendsitsPGTStoprocessor#p.

3. Processor#p storesallthereceivedparticlesinghostPTGRPs. Eventually, ghostparticles aretreatedbyproc #p aslocal par-ticles when computing contact forces, before all ghost particles are discarded to prepare for the next time step. Simulations of theBirminghamﬂuidized bed performedon 512 processors have shownthat a naive coding of the ghost particle exchange could leadittotaking45% ofthetotalsimulationtime. Thissection fo-cusesontheimplementationofaneﬃcientparallelstrategyto re-ducethecostofthe2ndstepofthepreviousalgorithm.

A packing strategy, consisting in arranging all the PGTS of a PTEXTCOMMinto a unique vector before sending it, is here em-ployedtocircumventthe problemofsmallmessages latency

(de-Fig. 8. Ghost particle method principle shown for a part of a cylindrical domain. The different processor domains are colored accordingly. The processor of interest is ranked #1 ( ) and its closest neighbors are ranked #2 , #3 , #4 , #5 and #6 . A particle entering the white cell halo around #1 will be sent to #1 as a ghost particle by the processor it belongs to.

tailed in the Appendix C), as shown in Fig. 9. When the recep-tion of all particle data packet is done, an unpacking step al-lows to rebuild ghost PTGRPs from it. In order to avoid numer-ousmemory(de)allocationoperations,theallocatedsizeofapack is only enlarged if not suﬃcient but is never downsized, target-ing buffers reuse. Fig. 10 displays the distribution of the num-ber of MPImessages asa function of the message sizesfor two strategies: theone without packingof the halodatacorresponds to a naive coding in which, for each PGTS of a PTEXTCOMM, each particle data array is sent individually, as well as its size forpreliminarymemoryallocation.Theotherstrategyinvolvesthe

(8)

Fig. 9. Before packing, a PTEXTCOMM contains two PGTS. Each PGTS is composed of as many arrays as particle data. The packing consists in arranging all these arrays in one unique vector which is the pack to send, thus simplifying MPI communications.

Fig. 10. Distribution of the number of MPI messages as a function of the message sizes for two strategies; : without packing of the halo data, : with packing of the halo data. Records come from runs performed on the Birmingham conﬁguration on 512 processors over 30 solver iterations. The sums under each curve indicates the total amount of messages exchanged.

aforementionedpacking/unpackingofthehalodata.Recordscome fromrunsperformedontheBirminghamconﬁgurationon512 pro-cessorsover 30solver iterations. Itis clearthat thenaive coding leadstoverylargeamountsofmessages:approximatelyone thou-sandtimesmorethanthepacking/unpackingstrategy.These mes-sagesarealsomuchsmallerintheﬁrststrategy,roughlyone thou-sandtimesmaller,andmorethan40%ofthetotalamountare4B messages,whereasthelargestonesare12kBmessages.Asregards the packing strategy, the observed message sizescorresponds to the packing ofdiscrete numbers of PGTS,here rangingfrom one PGTS(representing45kBmessagescontainingupto500particles) which are themore represented,up to twenty-ninePGTS (repre-senting1.28MBmessagescontainingupto14500particles).

The total time spent inthe communicationsis estimatedasa functionofthemessagesizeinFig.11.Thetotaltimeiscalculated asfollows forthestrategy without packing,k denotes amessage size:

totaltime

(

k

)

=

τ

latency

(

k

)

× numberofmessages

(

k

)

, (16)

where

τ

latencyisthegloballatency,andasfollowsforthestrategy

withpacking:

totaltime

(

k

)

=

τ

latency

(

k

)

+

τ

pack

(

k

)

+

τ

unpack

(

k

)

×numberofmessages

(

k

)

, (17)

where

τ

_pack and

τ

_unpackstand for theCPUcost ofthe packing and unpackingoperations,respectively. TheperformancesoftheCurie supercomputer’s network are assessed in the Appendix C in or-der toquantify

τ

_latency(k), aswell as

τ

_pack(k) and

τ

_unpack(k). Here thegloballatency istakenfromextra node communications(see

Fig.C.35),andthepacking/unpackingcostsaretakenwith prelim-inaryparticledataselection(seeFig.C.36).

Itcan be observed that evenwhen accountingfor thecost of thepackingandunpackingstepsofeachmessage,thesecond strat-egy is still 12 timesquicker than the naive coding without data packing,approximately.Itshouldbe notedthatthesecalculations only give maximum times because the underlying hypothesis is thatexchangesonlyoccuroneatatime,whileinarealsimulation somearedonesimultaneously. Itcanalsobe arguedthatevenon an ideal network withnulllatency and inﬁnitebandwidth, mes-sages cannot be treated concurrently at the time of their recep-tion,henceadditionalcontentionthatshould beavoided. Eventu-ally,theseresultsareallinfavorofanMPIstrategyinvolvingfewer datapacketstoexchange.SimulationsoftheBirminghamﬂuidized bed running on 512 processors show that MPI communications couldrepresentupto45%ofthetotalsimulationcostwithout spe-cialtreatment ofthedata exchanges.Results usingthe presented packing/unpackingstrategy exhibit a communication costdivided by3,allowingtheoverallsimulationtorun30%faster.

(9)

Fig. 11. Theoretical maximum time spent in communications as a function of the message sizes for two strategies; : without packing of the halo data, : with packing of the halo data. As an indication, stands for the case with packing but only accounting for the cost of communications. Records come from runs performed on the Birmingham conﬁguration on 512 processors over 30 solver iterations. The sums close to each curve indicates the total amount of time.

To accelerate the treatment of data packets, a fully asyn-chronousalgorithmfeaturingcomputation/communicationoverlap is implemented, as sketched in Fig. 12. The objective of such a method is to try to perform on-the-ﬂy packing and unpacking operationsinorder to overlapwithcommunication timesdueto globallatency.Itisdividedintotwonestedparts,theﬁrstone be-ingthe exchange ofpacksizesto allocatethenecessary memory onthereceiver’s side,andthe second one beingtheexchange of theactual packs.The main idea isto probethe non-blocking re-ceiverequestsinordertoperformthepackingandunpacking op-erationsassoonassome dataareavailable,whilewaitingforthe nextonestobecompleted.Bycheckingthesizeofapacktosend andtheoneofthepacktoreceive,anyPTEXTCOMMemptyof par-ticlesisdiscarded fromthesecond partofthe algorithm,aswell asallthePTEXTCOMMthatwouldhavetoexchangeparticleswith it.

Theoretically, this algorithm should reveal its full potential in cases where the packing and unpacking computational costs are closetothe globallatency.Indeedinthisconﬁguration,the pack-ing and unpacking operations can occur between two reception completionswithoutanytimeloss.Thecapacityofthisalgorithm to provide computation/communication overlap can be assessed by comparingits performanceswith theones ofa blocking cod-ingalsoincludingthepackingandunpackingfeatures. Inthe lat-tercase,packsizesareallexchanged inanorderlymannerbefore packingiscarriedout,thenactualpacksaretreatedthesame be-foreunpackingisperformed.Runon512processors,the Birming-hamﬂuidized bedcasedemonstrates thatthe computationalcost forghost particles treatment decreased by afactor 2.6when us-ingtheasynchronousmethodalongwithpacking/unpacking, com-paredtopacking/unpackingwithblockingcommunications, there-foreprovidingagainof11.5%ontheoverallsimulationtime.

5. Complexboundariesmanagement

Asindustrialsystemsoftencontainnon-planarboundaries,such ascylindricalpartsormorecomplexelementslikepipejunctions, aspecialtreatmentisrequiredtotreatparticle-wallcontacts. Sev-eraloptions havebeen proposed by different authors to address thisproblem. Among them, the most simple method is the one ofthegluedparticlestoapproximategeometricsurfacesandthus treatparticle-wallinteractionsthesamefashionasparticle-particle interactions([32],[33]).However, thissimpliﬁcationsuffersfroma

lackofaccuracyasitdoesn’trepresentcomplexshapesexactly, es-peciallyinthevicinityofconvexparts.Itcanalsoresultin uncon-trolledwallroughnessandlargercomputationaloverheads associ-atedwiththeuseofadditionalparticles[34].Furthercodingeffort canalsobeneededforsurfaceparticlesgeneration[35].

Theexplicitmethodsforthetreatmentofthecontactbetween complexobjectscanbeoftwotypes:“simplexbased” algorithms treata polyhedron astheconvexhullofapoint setandperform operations on simplices defined by subsets of these points [14]. Among these, the Discrete Function Representation algorithm pro-posedbyWilliams[36]allowsthetreatmentofnumerousvarieties ofshapesbutmayimplyfinediscretizationwithconsequentsetof pointsfor edgierbodies. The iterativealgorithm originatingfrom thework of Gilbert,Johnson & Keerthi [37],whichhas served as abasisforseveralothermethods,maybethemostfamous repre-sentativeamong thistype of methods.On thecontrary, “feature-based” algorithmstreatapolyhedronasasetofpoints,segments and faces. The finite wall method studied by Kremmer [38] is a goodcandidate,butstartswiththeassumptionthattheboundary surfacescanbe discretizedintotriangularelements, thepositions anddimensionsofwhichareknownandcontrollable,whichisnot thecasein generalCFDsimulations. Italso requiresan empirical “shrinkfactor” tobe defined.The popularalgorithmsuggestedby LinandCanny (Lin–Cannyalgorithm)[12],implemented inthe I-Collide[13] collisiondetectionpackage, isa “feature-based” algo-rithmdesignedforarbitrarycomplex3Dpolyhedracollisions.Itis basedontheexistenceofauniquedecompositionintoVoronoi re-gions ofthewallgeometry.The Lin–Cannyalgorithmraises prob-lems due to its lack ofrobustness, and is not able to returnthe measure of the penetration depth,therefore it is not suited to a soft-spheremodelimplementation. Ithas beenimprovedby Mir-tich(Voronoi-clipalgorithm)[14]inordertoovercomethese limi-tations,howeveritcanstillonlytreatspheresbytessellatingthem. Note that analytical contacts can be elegantly resolved forsome particularshapes[39],buttothe author’sknowledge,thisoption offersfewprospectsforgeneral3Dapplications.

Here is thus proposed an algorithm for detecting the interac-tionsbetweenasphericalparticleandan arbitrarilycomplex geo-metricsurfaceandmeshinthe frameworkof theDEM,and con-sistentwithmassiveparallelism.Thislastpointisofparticular im-portance,asthisaspect isseldomlyaddressedintheliterature. It reliesonthefactthat aparticlecancollidewithonlythreetypes of geometrical entities: either a vertex(V), or an edge (E), or a

(10)

Fig. 12. Flow chart of the ﬁnal asynchronous algorithm for ghost particle treatment featuring packing/unpacking of the halo data and communication/computation overlap. : pack sizes communication parts. : pack communication parts.

face (F),orwithanycombinationoftheseobjectssimultaneously inanyfashion.Itthusbelongstothe“feature-based” algorithms.It doesn’t requireanyinput parameternor preprocessingofthe ge-ometry,anddoesn’tuseanyiterativeprocess.Itisalsoonlybased onthe presentstate ofthecontactconfiguration(it’san “exhaus-tive” scheme [36]), and also relies on a Voronoi decomposition. The global algorithm is sketched in Fig. 13. As for most of the abovementionedmethods,afirstphaseofspatialsortingseeksto avoidan all-to-allbodycomparisonby cullingthe numberof ob-jectswhicharepotentialcontactorsofagivenparticle.Inafurther stage,all possiblecontactconditionsincludingcontactwithFs,Es andVs (Faces, Edges andVertices)are explicitlydetermined. The following subsectionsdetailthesesteps theother wayaround, as thelastonesareactuallyatthecoreofthemethod.Alltestshave been run on the Curie super-computer fromCEA in France. Un-lessotherwisespecified,statisticswerecollectedover1sof physi-caltime,andstartedafterhavinginitiatedfluidizationfor2.5s.

5.1. UseofVoronoiregions

Voronoi regions are used fortheir ability to yield an object’s closestboundaryfeature(s)andthentheshortestdistancebetween thisobjectandtheboundary.ThedeﬁnitionoftheVoronoiregions forseveralgeometricalfeatures isgiven:forafeatureX_∈[F,E,V] on a polyhedron, the Voronoi RegionVR

(

X

)

is the set ofpoints outsidethepolyhedronthatare closertoXthantoanyother fea-tureonthepolyhedron.TheVoronoiregions collectivelycoverthe entirespaceoutsidethepolyhedron.ExamplesofVRareshownin

Fig.14.

Itstemsfromthebuildingoftheboundaryfeatures’_VRsthat:

• The numberofplanesdelimitingVRs

(

F

)

isequaltothe num-ber ofedges ofF, saythreefora triangle,andthenormals to eachoftheseplanesaregivenbythenormalsofeachedge con-tainedintheplanedeﬁnedbyF.

(11)

Fig. 13. Global algorithm for arbitrarily complex geometries management.

Fig. 14. Voronoi regions of convex node ( VR (V) ), convex edge ( VR (E) ) and face ( VR (F) ).

• ThenumberofplanesdelimitingVRs

(

V

)

isequaltothe num-berofedgesconnectedtoV,andthenormalstoeachofthese planesaresimplyequaltothedirectionvectorscarriedbyeach edge.Thisnumbercanapriorireachanyvalue.

• _VRs

(

E

)

are all limitedby four planes.The normals to two of themaregivenbythenormalsofEcontainedintheplanes de-ﬁnedbyeachfacesconnectedbyE.Thetwoothersareobtained bytakingthedirectionvectorofEanditsopposite.

It appears that the knowledge of the VRs normals of each boundaryfeatureshould besuﬃcient toidentifyan object’s clos-estboundary feature in convex parts,and eventually, all VRs

(

E

)

andVRs

(

V

)

normalsarebuiltandstored.TheVRsnormalsofthe featuresthatarecommontoseveralprocessorsareentirelyknown toeachoftheseprocessors.Toidentifyanobject’sclosestpointon theboundary,thefollowingmethodologyisretained.Herethe ex-ampleofaparticleofradiusrpofcenterPapproachingaboundary

composedofseveraltriangularfacesistaken:

1. ProjectionPofthepointPontotheplane deﬁnedby theﬁrst boundaryfaceF.

2. ComputationofthedistancedpF betweenPandP.

3. Incaseofoverlap(dpF < rp), todeterminewhetherP belongs

to F or not. To this end, the coordinates of P are expressed intheface’sbarycentriccoordinates.Thefulldescriptionofthe operationisintheAppendixD.

4. In case P belongs to F, then the particle isactually colliding withF,itsshortestdistancetotheboundaryisdpFandthe

con-tacttreatmentcanbeapplied(seeSection5.2).Inthiscasethe algorithm movesontothe nextboundary face.Otherwisethe contactbetweenPandanyE _∈ Fhastobechecked.

5. To checkaparticle-edgecontact, thedistancedpE betweenthe

particleandthelinedeﬁnedbythedirectionvectorofEis cal-culatedﬁrst.

6. In case ofoverlap (dpE < rp), thebelonging of P to VR

(

E

)

is

checkedbyperformingdotproductsbetweeneachVR

(

E

)

nor-malandtheappropriatevectorforP,assketchedinFig.15,so that:

P∈VR

(

E

)

if

∀

i∈1; 4, p_i· ei <0. (18) Foran edge, p3 is equalto p2.To quickentheseoperations, P

isﬁrst assumedtobelong toVR

(

E

)

,theneach dotproduct is

Fig. 15. On the left, VR (V) normals v 1 , v 2 and v 3 are elucidated along with the

approaching particle’s corresponding vectors p 1 , p 2 and p 3 for dot products calcu-

lations. On the right, VR (E) normals e 1 , e 2 , e 3 and e 4 are elucidated along with the

approaching particle’s corresponding vectors p 1 , p 2 and p 4 for dot product calcula-

tions. p 3 is equal to p 2 .

consecutivelycheckedandthetestendsassoonasonegivesa positiveresult.

7. IncaseP∈VR

(

E

)

,the particleisactually collidingwithE,its shortestdistancetotheboundaryisdpE andthecontact

treat-ment can be applied (see Section 5.2). In thiscase the algo-rithmmovesontothenextboundaryface.Otherwisethe con-tactbetweenPandanyV _∈ Fhastobechecked.

8. Tocheckaparticle-vertexcontact,thedistancedpVbetweenthe

particleandViscalculatedﬁrst.

9. Incase ofoverlap (dpV < rp), thebelonging of P to VR

(

V

)

is

checkedbyperformingdotproductsbetweeneachVR

(

V

)

nor-malandtheappropriate vectorforP,assketchedinFig.15,so that:

P∈VR

(

V

)

if

∀

i∈1; numberofedgesconnectedtoV, pi · ei <0

(19)

Toquickentheseoperations, P∈VR

(

V

)

is ﬁrstassumedto be true,then each dot product is consecutively checkedand the testendsassoonasonegivesapositiveresult.

10. IncaseP∈VR

(

V

)

, theparticleisactuallycollidingwithV,its shortestdistancetotheboundaryisdpV andthecontact

treat-ment can be applied (see Section 5.2). In thiscase the algo-rithmmovesontothenextboundaryface.

This algorithm allows severalsimultaneous contacts withany kindofboundaryfeatureinconvexgeometricalparts.Furthermore, inconcaveareassuchastheonedepictedinFig.16,vectors orien-tationinvariablypreventstheparticlefromaccountingtheconcave feature E for collision, whileallowing both the side faces, which hasaphysicalmeaning.Fig.16alsorevealsagoodbehaviorofthe algorithmin morecomplexcasesthat canoccur fornodeswhich havemorethanthreeconnectededges. Byprovidingsuitableexit conditionsitalsopreventscontactsfrombeingdetectedwith sev-eralentitiesbelongingtothesameface:indeedwhenacontactis going tobe treatedbetweena particleandtheface F, nofurther testsareperformedforEsandVs∈F.Also,assoonasaparticleis foundoverlappinganE∈F,theremainingEsandVsarediscarded

(12)

Fig. 16. Left part: classical concave case. The particle is overlapping faces F 1 , F 2 and edge E in pale red areas. The algorithm for the use of VR s will detect that P ∈ VR (F1) , P ∈ VR (F2) but P / ∈ VR (E) . Hence, repulsion forces on the particle will be calculated for F 1 and F 2 even if the particle actually overlaps E . Right part: neither convex nor

concave case featuring an angle > 180 ◦ _{( ) and an angle < 180}◦_{( ). The particle is overlapping all edges and vertex V . The algorithm will detect that P ∈ VR}₍_E₁₎_, P ∈ VR (E2) but P / ∈ VR (V) , P / ∈ VR (E3) and P / ∈ VR (E4) . Hence, repulsion forces on the particle will be calculated for E 1 and E 2 only. (For interpretation of the references to

colour in this ﬁgure legend, the reader is referred to the web version of this article.)

andthealgorithmrepeatsforthenextface.Inalastcase,assoon asaparticleisfoundoverlappingaV∈F,theremainingVsare dis-cardedandthealgorithmisrepeatedforthenextface.Theseexits are thusessentialforcomputationalcost saving.Lastly,itappears that for contactswithnodes andedges, each connectedface can detectthecontact,resultinginarepulsionforceaccountedseveral timesinstead ofone.Tocopewiththislimitation, allthe contact forces exerted by edges are divided by two, while contactforces exerted by nodesaredivided bythenode’snumberofconnected faces.Thismultiplicityiscomputedinaparallelfashion.

5.2. Contactresolution

Having identified the type of contact point fora particle,the algorithm finishes withthe calculation of the effective repulsive forcesandtorquesexertedontheparticle.AsshowninEq.10,the total collision force exerted on a particle in contact with a wall is taken as the sum of the forces exerted by each colliding fea-ture ofthe wall. Theseparticle-wall forces are composed of nor-mal and tangential components written the same fashion as for particle-particle contacts (see Eq.11 andEq. 15), by treating the wall parts as a particle with null radius and infinite mass. The parameters kn,

γ

n and

μ

tan can beset todifferentvalues regard-ingthetype ofcontact, eitherparticle-particleorparticle-wall.To describe the repulsion force exerted on a particleby a boundary feature, a unit normalvector anda measure ofthe interpenetra-tion distance(overlap) betweenthesphere andwall element are tobeyielded. Asinamajorityofworks,contactsareheretreated consideringaunique contactpoint,despitetheactualoverlapping partsmayrevealanextensivesetofpossibilities.

ReferringtothedifferentcontactcasessketchedinFig.17,the treatmentofparticle-facecontactistrivialandconsistsinbuilding aunitvectornpFnormaltothefaceplaneandanoverlapping

dis-tance

δ

pF.LetNE bethenumberofedgesofaface,npFand

δ

pF are

obtainedby: npF= n∗_pF

||

n∗ pF

||

with n∗_pF= 1 NE i ∈N E e_i∧e_i+1, (20)

δ

pF=rp−

(

xF− xp

)

· npF, (21)

whereeiisthedirectionvectorofedgeEiorientedinsucha

man-ner that ei∧ei+1 yields a vector oriented towards the outsideof thedomain,andxF isthefacecentercoordinates.AsnpF isunique,

its value is stored in the appropriate data structure. The treat-mentofparticle-edgecontactconsistsinbuildingaunitvectornpE

normaltothe edgeorientedfromtheparticle centertothe edge andan overlappingdistance

δ

pE.LetEi bethisedge,composedof

pointsAandB: npE= n∗_pE

||

n∗_pE

||

with n ∗ pE =[

(

xp − xA

)

· ei ]ei −

(

xp − xA

)

, (22)

δ

pE=rp− [

(

xp− xA

)

· ei

)

ei−

(

xp− xA

)

]· npE, (23)

whereei=xB− xA isthe directionvector ofEi.The treatment of

particle-vertex contactconsists in buildinga unit vector npV

ori-ented from the particle center to the vertex and an overlapping distance

δ

pVsuchas:

npV= n∗_pV

||

n∗ pV

||

with n∗_pV=x_V− xp , (24)

δ

pV=rp−

(

xV− xp

)

· npV, (25)

wherexV arethe vertexcoordinates. Inthisformalism, it canbe

noticedthattheresolutionofaparticle-edgecontactistantamount toaparticle-facecontactofwhichfaceplanewouldbeorthogonal

Fig. 17. A particle of radius r p with center coordinates x p overlapping (here unreasonably) a vertex (left), an edge (center) and a face (right). Unit normal vectors considered

for collision force computation are indicated by n pV , n pE and n pF , respectively. Interpenetration distances are indicated by δpV , δpE and δpF , respectively. For these three

features, the method used considers the contact the same manner as the one with the pale red plane, which is orthogonal to the normal vector and contains the boundary feature. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

(13)

Fig. 18. Non-dimensional overlap (left) and non-dimensional normal velocity (right) as a function of the non-dimensional time during a particle-face ( ), a particle-edge ( ) and a particle-vertex ( ) contact. RK2 method with tp = T C / 100 is used. Comparison with analytical method ( ).

tonpE.Equally, theresolution of a particle-vertex contactis

tan-tamount toa particle-face contactof which faceplane wouldbe orthogonal tonpV. The time evolution of a particle’soverlap and

translationalvelocity duringthecontactwitha face,an edgeand avertexhavebeenplottedinFig.18.Resultsaretestedagainstthe followingnon-dimensionalanalyticalsolutionofthecontact equa-tion(see[4])andexhibittheexpectedbehaviors:

δ

ab

(

t

)

=

δ

ab

(

t

)

δ

max ab =

ω

0

0 exp

γ

n

1

0 arcsin

0

ω

0

− t

sin

(

₀t

)

, (26) u_ab_,n

(

t

)

= uab,n

(

t

)

u0 ab,n =

1 0 e −γnt_[−

γ

nsin

(

0 t

)

+

0 cos

(

0 t

)

], (27) with

0 =

ω

₀2 −

γ

n2 , (28)

andu0 _ab,nbeingtheparticle’sinitialnormalvelocity.

5.3.Spatialsortingandnullobjectdetector

Numerous works interested in collisions involving complex shapesreport drastic costs, and it clearly appears that setting a listofpotential contactsbetweenobjects isofparamount impor-tanceto prevent a vastmajority ofunuseful operationsfrom be-ingperformed. Inthisregards,various methods ofspatialsorting suchasthegridmethod,theoctreetechnique,andthebody-base approachhave beenreported inthe literature [36]. Thisovercost isparticularlyveriﬁed inthe caseofso-called exhaustivecontact schemes,whichmakeno apriori assumptions abouttheproblem evolution andreason based only on thepresent state of the ge-ometry,such asthepresentedapproach.As an example,thecase oftheBirminghamﬂuidized-bedrunon512processorsshowsthat morethan99.9%ofthetotalCPUtime wouldbe dedicatedtothe treatment of boundaries in a brute-force approach for which all particleshavetocheckcollisionswitheveryboundaryfaces. How-ever,a deeper analysisof thiscase indicates that only 1% of the particles are actually collidinga boundary feature at a given in-stant, thus promising improvement prospects ifa spatial sorting stepisusedtodiscardparticlesdistantfromthewalls.Inthiscase, italsorevealsthateachcollidingparticlehardlyhitmorethanone objectatagiveninstant,saybarely0.0005%ofthetotalamountof boundaryfeatures,approximately.Apartofthealgorithmreferred toasnullobjectdetectoristhusmandatoryinordertoquickly dis-card a particle’s farthest objects. First, the null object detector is described.

Fig. 19. Cylindrical domain meshed with tetrahedra. A Boundary Face Group (BFG) with its Bounding Sphere (BSBFG) is represented in blueish colors. Some orange Bounding Sphere of Faces (BSFs) are also visible.

A priori,the VRs belongingtestsintroduced inSection 5.1for each particle have to be performed and for each V, E and F of all boundaries. To minimize the cost of this search, a new data structure is created from groups of adjacent wall faces belong-ingtothesameboundary, calledBoundary FaceGroups(BFG,see

Fig.19). This additionalcoloring isobtainedthanks to theMETIS library[40].Theassociatedimproveddatastructureissketchedin

Fig.20.TheseBFGscontainallthenecessaryboundarymetricsand connectivitiesalongwiththe_VRsnormals,computedinaparallel fashion.Theyalsosupportotherrelevantgeometricaldatathatare usedforquickdistancechecking:

• thecenterxBSBFGandradiusrBSBFGofeachBFGBoundingSphere

(BSBFG)are computedusingthe BFGnodesmean coordinates andthedistancebetweenthecenterandthemostremoteBFG node,respectively,andstored.

• ThecenterxBSFandradiusrBSFofeachBoundingSphereofFace

(BSF)are alsocomputedusingtheface barycenterandits dis-tancetothefarthernodeoftheface,respectively,andstored. Thesepreliminaryoperationsfindtheir justificationinthefact thatcheckingtheintersectionbetweentwospheresissimple,but also among the quickest tests. In the literature, this is often re-ferred to asthe“sphere-tree” technique [41].It consistsin prior-itizingthe testsby usingsets ofspheres that describe the three-dimensionalsurfaceofanobjectatdifferentlevelsofdetail.Inthis study,atwo-levelhierarchyisemployed:TheBSBFGsstandforthe firstlevel,eachonecomposedofseveralBSFs,whichisthesecond level.

Because of the various geometrical and mesh conﬁgurations thatcanoccurincomplexcases,anobjectmaybefoundveryclose

(14)

Fig. 20. Improved communicators structure containing Particle External communicators (PTEXTCOMMs) and Boundary Face Groups (BFGs).

toaboundarythatdoesn’tshareelementswiththeprocessoritis located in.Tocope withthisfact, itisconceivable todispatchall BFGsonallprocessors,sothatnoomissionisallowed. Implement-ingthissolutioncouldhoweverresultinunnecessarytests,allthe more since mostparticles reside inthe bottom of the reactorin themajorityofﬂuidized-bedsystems.Theoptionselectedistorely onthe listofprocessors sharingPTEXTCOMMs.Byusingan addi-tional method lookingalike the one explored in Section 4.1, this makesitpossibletoidentifytheclosestBFGsofclosestprocessors. The identiﬁed BFGs are then exchanged to constitute the ghost BFGs ofeach processor. Incase ofstaticmesh,these calculations are performedonce duringthe solverinitialization. Asa prelimi-nary analysisto the algorithm introduced in Section 5.1,the fol-lowingmethodreferredtoasnullobjectdetectorallowstoidentify a particle’s closest boundaryfaces relying on thelocal andghost BFGs:

1. Theparticleofinterestpofcentercoordinatesxpandradiusrp

loopsoverlocalandghost BFGs.DistantBFGsare discarded if

||

xBSBFG− xp

||

2>

(

rBSBFG+rp

)

2.

2. ForeachintersectedBFG,ploopsoverallitsfaces’BSF.Distant facesarediscardedif

||

xBSF− xp

||

2 >

(

rBSF+rp

)

2 .

3. Eventually, only the faces of which boundingsphere is inter-sectingparetreatedbythealgorithmpresentedinSection5.1. All xBSBFG,rBSBFG,xBSF andrBSF havingbeen precalculated,only

squaredistanceshavetobequantiﬁedduringtherun,thus avoid-ingcostlysquareroots.

A spatial sorting step is added in order to prevent unneces-sary bounding sphereintersection tests. Indeed, an optimal sort-ingwouldbeabletodiscardalltheparticlesofwhichdistanceto thewallexceedstheir radius.Inthesamefashionasthecellhalo identificationdealtwithinSection4.1,thisveryfirststepfocuses on flagginglayersofmeshelementscovering physicalboundaries duringthe simulation initialization, sothat only the particles lo-catedintheseelementswillbetreatedbytheprevious nullobject detector during the run. The problemcan thus be formulated in thesameterms:thisclose-boundaryelementflaggingcanbe eas-ilyoperatedonCartesianmeshes,butrequiresfurthercodingeffort to deal with unstructured meshes, as mesh size heterogeneities are frequently encountered. Here again, the first option consists in flagging successivelayers of cellsin orderto ensure sufficient identification,butwithoutyieldingcertaintyonthedistance crite-rion,thismethodcanresultintheflaggingofnumerousunwanted cellsin addition. Onthe contrary,the implemented approach fo-cuses on local exact wall distance calculation, allowing the flag-ging ofmore elementsin refined area andfewer in coarse ones. The objective is to compute the distance between some interior meshnodesandthewallfeaturestodeducewhetheramesh ele-menthasto beflaggedornot, relyingontheprevious null object

detector,the_VRsintroducedinSection5.1andthecontact resolu-tionpresentedinSection5.2.Eachofthesestepsisdetailedinthe

Appendix E. In caseof a static mesh, these steps are performed once duringthe solver initialization. Mean resultsobtained from simulations of 1s physical time of the Birmingham ﬂuidized-bed runon variousnumberofprocessors showthat 91%ofthe parti-clesareeliminatedbythespatialsortingtest.Then,eachnearwall particleintersects2.5BSFinaveragethanks tothenull object de-tector,thus drasticallyreducing the numberofcostly VR teststo perform. Eventually, thesegains in selectivity enablethe slowest processor to spend approximately 4% of its computationtime in thetreatmentofboundarycontacts.

The deﬁnitive procedure for particle-boundary contact treat-mentinvolvingboundaries’closestelementﬂagging,nullobject de-tector,useofVoronoiregionsandcontactresolutionisdisplayedin

Fig.21.

Note that the parts concerning the Voronoi regions manage-mentandthe contactresolutionpresentedherecould be consid-eredasparticularcasesofthecollision oftwocomplexshape ob-ject (twonon-spherical particles forinstance). In thislattercase, Voronoiregionsare requiredonbothcollidingobjectstoﬁndout the pair of closest features, then compute their overlapping dis-tanceandnormalvector[14].Oneoftheseobjectsbeingasphere inourcase,somesimpliﬁcationsarise.Manyalreadyexistingparts in the current algorithm could be useful and directly applicable inmorecomplexcases,forinstancewhenconsideringthe bound-ingsphereofnon-sphericalparticlesforquickdiscardingtests pur-poses.

As an illustration, numerical simulations were performed to measure thesolid mass flow rateW across the orificeof diame-ter D₀ofan hourglassmeshed withtetrahedron forsixvaluesof particle diameterdp ranging from7.5% to 15%D0 . No fluid phase wasaccountedinthesesimulations.Resultswerecomparedtothe empiricallawofBeverloo[42],frequentlyencountered insilosor hopperdischargestudies,thatcanbewritten:

W =C

ρ

p√g

(

D0 − kdp

)

5 /2 , (29)

whereC andkare empirical dischargeandshape coeﬃcients re-spectively.ComparisonisshowninFig.22,forwhichtheconstant

C wasset to a classical value of 0.55 [42]. In order to extract a valueforkinthisconﬁguration,thefollowingformofthelawof Beverlooisplotted: 1 D0

W C

ρ

p√g

2 /5 ₌₁_{− k}dp D0 , (30)

with which the numerical results exhibit a good agreement for

(15)

Fig. 21. Flow chart of the deﬁnitive procedure followed by each particle for boundary contacts treatment, adapted for arbitrary complex walls. : initializing part and output. : spatial sorting. : null object detector . : contact resolution parts. : VR tests.

6. TheBirminghamgas-ﬂuidizedbed

6.1.Conﬁguration

All the performance measurements assessed in the previous sectionswereperformedforanisothermaldensegas-fluidizedbed experimentedatthe University ofBirmingham, whichwas previ-ouslyalreadysimulatedbyusingTFMapproach[20].This pressur-ized lab-scale reactor is axisymmetric and composed of a cylin-drical column ofinternal radius R=77mm widening to a inter-nalradiusof127mm.Theverticaldistancebetweenthehorizontal gasfluidizationdistributorplateandthetopoftheexhaust, corre-spondingtothecomputational domain,is1.75m. Nitrogenenters thedistributionplatewithafluidization velocityuinletof0.32m/s

andthepressureintheﬂuidizedbedis12bar.Theparticlephase isalmost monodispersewithamediandiameterof875μmanda material density of740kg/m3 . The reactor is ﬁlled with

approx-imately 9.6M particles (2.5kg of solid material). Details can be foundin [20].The experimentalsetup andthe computational do-mainaresketchedinFig.23.Theemployed meshiscomposedof 3.7Mtetrahedraanddividedinareﬁned zoneinthesmaller sec-tion part,with an average mesh element size of 1.85mm,and a coarserzoneintheupperpart,withanaveragemeshelementsize of3.9mm.ThetestswerecarriedoutontheCuriesupercomputer ofthe TGCC center (Très GrandCentre de Calcul, France), featur-ing anInﬁniBand QDRFull Fat Treeinterconnect.The nodesused comprise two IntelSandy Bridge octo-core processors runningat 2.7GHzwith64GBRAM (about4GBper core).The numerical pa-rametersusedforthesimulationsaresummarizedinTable2.

6.2. Statistics

Thenumericalsimulationsare performedduring20sof physi-cal time.Aﬁrstperiodof10s isemployedtoestablishconverged

(16)

Fig. 22. Non-dimensional solid mass ﬂow rate obtained for six different values of particle diameter ( ) compared with Beverloo law with C = 0 . 55 ( ). The value of

k ≈ 1.18 is extracted from the slope (see Eq. 30 ). On the left, the mesh is displayed and the cells used for the spatial sorting are colored in red. On the right, the particles are shown at t = 0 s and colored by the ﬂuid fraction. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

Fig. 23. Lab-scale Birmingham gas-ﬂuidized bed. On the left, experimental setup extracted from [20] . In the center, front view of the computational domain meshed with 3.7 M tetrahedra, accounting for the parts beyond the gas distribution plate, for a total height of 1.75 m. On the top, top-view featuring the chimney. On the bottom, distribution plate. On the right, domain decomposition into tori for statistics computation.

Table 2

Numerical parameters of the particle-particle and particle-wall soft-sphere collision model.

PARTICLE PHASE

Spring stiffness k n [N/m] 75

Normal restitution coeﬃcient e n [-] 0.9

Dynamic friction coeﬃcient μtan [-] 0.3

state regardingbedexpansion(see Fig.24)andpressureloss(see

Fig. 25) through the bed, then time-averaged statistics are com-putedduringtheremaining10s. Itshouldbenotedthat the

orig-inalsimulations involvingTFM[20] werecarriedoutduring360s, the last 240s being used to compute statistics. Even these dura-tionscouldn’tensurecompletestatisticalconvergence.

Theprofileofthe time-averagedpressureacrossthe reactoris visible in Fig. 26. As expected, the profile displays two distinct slopes:for lower altitudes the slope corresponds to the pressure evolutionacrossaparticlebedwithagivenfluidfractiondescribed byErgun forfixed particlebeds[23],whileonlygasispresentin thehigherregion.

In order to assess possible comparisons with Euler-Euler for-malism,particlephysicalquantitieshavetobetranslatedintosolid