*With increasing computational power, today’s simulations have also developed into a high performance computing (HPC) problem. Thus, the applied numerical method should allow to efficiently exploit the computational and memory resources provided by supercomputing systems. In this context, high-order discontinuous Galerkin spectral element methods (DGSEM) are known for their high potential of massively parallelization. In the following, we study the HPC features of FLEXI and address the question of parallel efficiency to conduct HPC simulations in a responsible and sustainable way.
*

**Key Facts:**

• Excellent scaling up to 131,072 processors @ Jugene (Juelich)

• Super-linear scaling due to caching effects

• For higher ploynomial degrees N, super-linear scaling up to one-element-per-core

The main advantage of the DGSEM scheme is based on its high performance computing (HPC) abilities, enabling an efficient parallelization of FLEXI. The DGSEM algorithm with explicit time discretization is inherently parallel since all elements communicate only with direct neighbors. Beyond that, the tensor-product ansatz of the DGSEM allows to convert the three-dimensional integrals into a series of one-dimensional computations. Thus, a local one-dimensional DGSEM operator can be constructed, which can be applied in each coordinate direction, characterizing an important efficiency feature of the scheme. Furthermore, the DGSEM operator itself can be split into two parts, namely the volume part, which solely relies on local data, and the surface part, for which neighbor information is required. This property of the DG method can be exploited by hiding communication latencies and the negative influence of data transfer can be reduced to a minimum. It is therefore possible to send surface data while simultaneously performing volume data operations. Hence, the DGSEM facilitates a lean parallelization strategy, where except for direct neighbor communication no additional operations are introduced, being important for an efficiently scalable algorithm.

To expose the high parallel efficiency, Figure 1 contains strong scaling tests of FLEXI, which have been conducted on HLRS Cray XC40 cluster using up to 12,288 physical cores. In Figure 1, two setups are depicted: the first setup assesses the scaling efficiency at a fixed polynomial degree N= 5 using three different meshes with 768–12,288 elements. The second setup involves the performance analysis of varying polynomial degrees N=3–9 at a fixed mesh with 12,288 elements. In both cases, we doubled the number of elements in each step until the limit of one-element-per-core was reached (represented by the last symbol on the plots). For all cases, we can achieve so-called super-linear scaling, i.e. the scaling efficiency is higher than 100% owing to caching effects due to low memory consumption, over a wide range of processes. The scaling efficiency decreases only towards the one-element-per-core case. For the higher polynomial degrees N=7 and N=9, however, we continuously obtain super-scaling even in the single element per core case. The strong scaling results, thus, prove that the FLEXI involves an excellent parallel efficiency and is very well suited for demanding HPC applications.

Figure 1: Strong scaling results for varying number of elements and constant

polynomial degree N=5 (left) and constant number of elements and varying polynomial degrees

N=3–9 (right).

Further, in Figure 2 we investigate the performance index (PID) over the load, i.e. the number of degrees of freedom per core. The PID is a convenient measure to judge the computational efficiency and it expresses the

computational time needed to update one degree of freedom for one time step. In Figure 2, we can distinguish between three regions: the leftmost region is the latency dominated region which is characterized by a very high PID at low loads. In this region, the communication latency hiding through doing volume operations has no impact due to the low load per core ratio. The rightmost region, in turn, is characterized by a high PID at high loads and is dominated by the memory bandwidth of the nodes. The central region, however, represents the *sweet spot*, where the PID curve reaches its minimum. Here, the load is just small enough to fit in the CPU cache to exploit the latency hiding feature of the scheme. We note, that in our study multi-threading had no beneficial effect on the performance of the code.

**Contact:**

Muhammed Atak, atak@iag.uni-stuttgart.de

Thomas Bolemann, bolemann@iag.uni-stuttgart.de

Claus-Dieter Munz, munz@iag.uni-stuttgart.de