3.2. StarPU-**based** **multifrontal** method
this section, as well as in Chapter 4, have been run with an interleaved memory allocation policy which basically make the locality aware scheduling policy implemented in **qr** mumps worthless; **for** this reason the two codes are also expected to have the same **task** efficiency. As the results in Figure 3.6 show, the performance difference between the two codes has to be found in the pipeline and and runtime efficiencies. The slightly worse pipeline efficiency is due to the different choice of priorities **for** the tasks that was made in qrm starpu in order to contain the memory consumption (more comments on this below) and to the fact that assembly operations are not fully parallelized as explained above. The lower runtime efficiency, instead, can partly be explained by the fact that in qrm starpu all the ready tasks are stored in a single, sorted central queue. This clearly incurs a relatively high cost due the contention on the locks used to prevent threads from accessing this data structure concurrently and due to the sorting of tasks depending on their priority; in Section 4.4 we will present a novel scheduler that overcomes most of this issues and consistently delivers better performance. As a result we can conclude that the performance difference between qrm starpu and **qr** mumps is merely due to minor, technical issues; the results presented in Chapter 4 with a much more refined and optimized implementation confirm this intuition. Memory consumption is an extremely critical point to address when designing a sparse, direct **solver**. As the building blocks **for** designing a scheduling strategy on top of StarPU differ (and are more advanced) than what is available in **qr** mumps (which relies on an ad hoc lightweight scheduler) we could not reproduce exactly the same scheduling strategy. Therefore we decided to give higher priority to reducing the memory consumption in qrm starpu. This cannot easily be achieved in **qr** mumps because its native scheduler can only handle two levels of **task** priority; as a result, fronts are activated earlier in **qr** mumps, almost consistently leading to a higher memory footprint as shown in Figure 3.7. The figure also shows that both qrm starpu and **qr** mumps achieve on average the same memory consumption as spqr. On three cases out of eleven spqr achieves a significantly lower memory footprint; in Chapter 5 we will show that it is possible to reliably control and reduce the memory consumption of qrm starpu and still achieve extremely high performance (roughly the same as the unconstrained case).

En savoir plus
154 En savoir plus

Research Report n° 8913 — May 2016 — 13 pages
Abstract: Heterogeneity is emerging as one of the most challenging characteristics of today’s parallel environments. However, not many fully-featured advanced numerical, scientific libraries have been ported on such **architectures**. In this paper, we propose to extend a sparse hybrid **solver** **for** handling distributed memory **heterogeneous** platforms. As in the original **solver**, we perform a domain decomposition and associate one subdomain with one MPI process. However, while each subdomain was processed sequentially (binded onto a single CPU core) in the original **solver**, the new **solver** instead relies on **task**-**based** local solvers, delegating tasks to available computing units. We show that this “MPI+**task**” design conveniently allows **for** exploiting distributed memory **heterogeneous** machines. Indeed, a subdomain can now be processed on multiple CPU cores (such as a whole multicore processor or a subset of the available cores) possibly enhanced with GPUs. We illustrate our discussion with the MaPHyS sparse hybrid **solver** relying on the PaStiX and Chameleon dense and sparse direct libraries, respectively. Interestingly, this two-level MPI+**task** design furthermore provides extra flexibility **for** controlling the number of subdomains, enhancing the numerical stability of the considered hybrid method. While the rise of **heterogeneous** computing has been strongly carried out by the theoretical community, this study aims at showing that it is now also possible to build complex software layers on top of runtime systems to exploit **heterogeneous** **architectures**.

En savoir plus
Different paradigms exist **for** programming such a DAG of tasks. Among them, the STF model is becoming increasingly popular because of the high productivity it allows **for** the programmer. Indeed, this paradigm simply consists in submitting a sequence of tasks through a non blocking function call that delegates the execution of the **task** to the runtime system. Upon submission, the runtime system adds the **task** to the current DAG along with its dependencies which are automatically computed through data dependency analysis. The actual execution of the **task** is then postponed to the moment when its dependencies are satisfied. This paradigm is also sometimes referred to as superscalar since it mimics the functioning of superscalar processors where instructions are issued sequentially from a single stream but can actually be executed in a different order and, possibly, in parallel depending on their mutual dependencies. Figure 1(right) shows the 1D STF version from [17, 18] of the **multifrontal** **QR** factorization described above. Instead of making direct function calls (activate, assemble, deactivate, panel, update), the equivalent STF code submits the corresponding tasks. Since the data onto which these functions operate as well as their access mode (Read, Write or Read/Write) are also specified, the runtime system can perform the superscalar analysis while the submission of **task** is progressing. **For** instance, because an assemble **task** accesses a block-column f(i) before a panel **task** accesses the same block-column in Write mode, a dependency between those two tasks is inferred. Our previous work [17] showed that the STF programming model allows **for** designing a code that achieves a great performance and scalability as well as an excellent robustness when it comes to memory consumption.

En savoir plus
The O(n 2 ) complexity of a standard, full rank solution of a
3D problem (of N unknowns) from the Laplacian operator dis- cretized with a 3D 7-point stencil is reduced to O(n 5/3 ) when using the BLR format (Amestoy et al., 2015b). Although com- pression rates may not be as good as those achieved with hi- erarchical formats, BLR offers a good flexibility thanks to its simple, flat structure. This makes BLR easy to adapt to any **multifrontal** **solver** without a complete rethinking of the code. Next, we describe the generalization of the BLR to a parallel environment. The row-wise partitioning imposed by the distri- bution of the front onto several processes constraints the clus- tering of the unknowns. However, in practice, we manage to maintain nearly the same compression rates when the number of processes grows (see Figure 3). Both LU and CB compres- sion can contribute to reducing the volume of communication by a substantial factor and improving the parallel efficiency of the **solver**. In our implementation, we do not compress the CB. To fully exploit multicore **architectures**, MPI parallelism is hy- bridized with thread parallelism by multithreading the tasks of Algorithm 1. In full-rank, we exploit multithreaded BLAS kernels. In low-rank, these tasks have a finer granularity and thus a lower efficiency (flop rate). Thus with multithreaded BLAS we are not able to efficiently transform the compres- sion of flops into reduction in time. To overcome this obstacle, Algorithm 1 can be modified to exploit OpenMP-**based** mul- tithreading instead, which allows **for** a larger granularity of computations. The Update **task** at line 9 is applied on a set of independent blocks A i, j : therefore, the loop at line 8 can

En savoir plus
2 Related work
2.1 Parallel programming models **for** **task**-**based** algorithms
The most common strategy **for** the parallelization of **task**-**based** algorithms con- sists in traversing the DAG sequentially and submit the tasks, as discovered, to the runtime system using a non blocking function call. The dependencies be- tween tasks are automatically inferred by the runtime system through a data dependency analysis [4] and the actual execution of the **task** is then postponed to the moment when all its dependencies are satisfied. This programming model is known as a Sequential **Task** Flow (STF) model as it fully relies on sequential consistency **for** the dependency detection. This paradigm is also sometimes re- ferred to as superscalar since it mimics the functioning of superscalar processors where instructions are issued sequentially from a single stream but can actually be executed in a different order and, possibly, in parallel depending on their mu- tual dependencies. As mentioned above, the popularity of this model encouraged the OpenMP board to include it in the 4.0 standard. The simplicity of the STF model facilitates the design of numerical algorithms in a concise manner and can be exploited to efficiently target multicore **architectures** [2].

En savoir plus
(3) It proposes a memory-aware approach **for** controlling the memory consumption of the parallel **multifrontal** method, which allows the user to achieve the highest possible performance within a prescribed memory footprint. This technique can be seen as employing a sliding window that sweeps the whole elimination tree and whose size is dynamically adjusted in the course of the factorization in order to accommodate **for** as many fronts as possible within the imposed memory envelope. These three contributions are deeply related. The approach **based** on 2D frontal ma- trix factorizations leads to extremely large DAGs with **heterogeneous** tasks and very complex dependencies; implementing such a complex algorithm in the STF model is conceptually equivalent to writing the corresponding sequential code because paral- lelism is handled by the runtime system. The effectiveness of this approach is sup- ported by a finely detailed analysis **based** on a novel method which allows **for** a very accurate profiling of the performance of the resulting code by quantifying separately all the factors that play a role in the scalability and efficiency of a shared-memory code (granularity, locality, overhead, scheduling). This method can be readily extended to the case of distributed-memory parallelism in order to account **for** other performance critical factors such as the communications. Our analysis shows that the runtime sys- tem can handle very efficiently such complex DAGs and that the relative runtime over- head is only marginally increased by the increased DAGs size and complexity; on the other hand, the experimental results also show that the higher degree of concurrency provided by 2D methods can considerably improve the scalability of a sparse matrix **QR** factorization. Because of the very dynamic execution model of the resulting code, concerns may arise about the memory consumption; we show that, instead, it is possi- ble to control the memory consumption by simply controlling the flow of tasks in the runtime system. Experimental results demonstrate that, because of the high concur- rency achieved by the 2D parallelization and because of the efficiency of the underly- ing runtime, the very high performance of our code is barely affected even when the parallel code is constrained to run within the same memory footprint as a sequential execution.

En savoir plus
In this paper, we tackle another class of algorithms, the Krylov subspace methods, which aim at solving large sparse linear systems of equations of the form Ax = b, where A is a sparse matrix. Those methods are **based** on the calculation of approximated solutions in a sequence of embedded spaces, that is intrinsically a sequential numerical scheme. Second, their unpreconditioned versions are exclusively **based** on non compute intensive kernels with irregular memory access pattern, Sparse Matrix Vector products (SpMV) and level-one BLAS, which need very large grain tasks to benefit from GPU acceleration. **For** these reasons, designing and scheduling Krylov subspace methods on a multi-GPUs platform is extremely challenging, especially when relying on a **task**-**based** abstraction which requires to delegate part of the control to a runtime system. We discuss this methodological approach in the context of the Conjugate Gradient (CG) algorithm on a shared-memory machine accelerated with multiple GPUs using the StarPU runtime system [5] to process the designed **task** graph. The CG **solver** is a widely used Krylov subspace method, which is the numerical algorithm of choice **for** the solution of large linear systems with symmetric positive definite matrices [13].

En savoir plus
tion number with dgecon by means of those two results. The main challenge here resides in the LU factorization with partial pivoting, which is difficult to implement us- ing **task**-**based** programming models. Indeed, searches **for** pivot candidates and row swapping generate many global synchronization points within the panel factorization and its resulting updates. Some solutions have been proposed on shared-memory systems [45] but there are no existing solutions that are oblivious of **heterogeneous** **architectures**. We thus propose a **QR**-**based** solution which consists in estimating the norm of A −1 by computing the norm of R −1 with A = **QR**. This solution, which turns out to be less costly, alleviates the pivoting issue all together, uses only regular tile algorithms and allows code portability across various **architectures**, thanks to the underlying run- time system. The third section of the algorithm, rows 21 to 48, is the main loop of the algorithm, which iterates on U k and converges to the polar factors. This section of the

En savoir plus
The objective of this paper is thus to study how, starting from an existing parallel **solver** using MPI, it is possible to adapt it and improve its performance on multi-core **architectures**. Although the resulting code is able to run on hybrid-memory **architectures**, we consider here a pure shared-memory environment. We use the example of the MUMPS **solver** [3, 5], but the methodology and approaches described in this paper are more general. We study and combine the use (i) of multithreaded libraries (in particular BLAS– Basic Linear Algebra Subprograms [25]); (ii) of loop-**based** fine grain parallelism **based** on OpenMP [1] directives; and (iii) of coarse grain OpenMP parallelism between independent tasks. On NUMA **architectures**, we show that the memory allocation policy and resulting memory affinity has a strong impact on performance and should be chosen with care, depending on the set of threads working on each **task**. Furthermore, when treating independent tasks in a multithreaded environment, if no ready **task** is available, a thread that has finished its share of the work will wait **for** the other threads to finish and become idle. In an OpenMP environment, we show how, technically, it is in that case possible to re-assign idle threads to active tasks, increasing dynamically the amount of parallelism exploited and speeding up the corresponding computations.

En savoir plus
Alternatively, a modular approach can be employed. First, the numerical al- gorithm is written at a high level independently of the hardware architecture as a Directed Acyclic Graph (DAG) of tasks where a vertex represents a **task** and an edge represents a dependency between tasks. A second layer is in charge of the scheduling. **Based** on the scheduling decisions, a runtime system takes care of performing the actual execution of the tasks, both ensuring that dependencies are satisfied at execution time and maintaining data consistency. The fourth layer consists of the tasks code optimized **for** the underlying **architectures**. In most cases, the last three layers need not be written by the application developer. Indeed, it usually exists a very competitive state-of-the-art generic scheduling algorithm (such as work-stealing [4], Minimum Completion Time [19]) matching the algorithmic needs to efficiently exploit the targeted architecture (otherwise, a new scheduling algorithm may be designed, which will in turn be likely to apply to a whole class of algorithms). The runtime system only needs to be ex- tended once **for** each new architecture. Finally, most of the time, the high-level algorithm can be cast in terms of standard operations (such as BLAS in dense linear algebra) **for** which vendors provide optimized codes. All in all, with such a modular approach, only the high-level algorithm has to be specifically designed, which ensures a high productivity. The maintainability is also guaranteed since the use of new hardware only requires (in principle) third party effort.

En savoir plus
Alternatively, a modular approach can be employed. First, the numerical al- gorithm is written at a high level independently of the hardware architecture as a Directed Acyclic Graph (DAG) of tasks where a vertex represents a **task** and an edge represents a dependency between tasks. A second layer is in charge of the scheduling. **Based** on the scheduling decisions, a runtime system takes care of performing the actual execution of the tasks, both ensuring that dependencies are satisfied at execution time and maintaining data consistency. The fourth layer consists of the tasks code optimized **for** the underlying **architectures**. In most cases, the last three layers need not be written by the application developer. Indeed, it usually exists a very competitive state-of-the-art generic scheduling algorithm (such as work-stealing [4], Minimum Completion Time [19]) matching the algorithmic needs to efficiently exploit the targeted architecture (otherwise, a new scheduling algorithm may be designed, which will in turn be likely to apply to a whole class of algorithms). The runtime system only needs to be ex- tended once **for** each new architecture. Finally, most of the time, the high-level algorithm can be cast in terms of standard operations (such as BLAS in dense linear algebra) **for** which vendors provide optimized codes. All in all, with such a modular approach, only the high-level algorithm has to be specifically designed, which ensures a high productivity. The maintainability is also guaranteed since the use of new hardware only requires (in principle) third party effort.

En savoir plus
7 IRIT Laboratory, University of Toulouse and UPS, F-31071 Toulouse, France
S U M M A R Y
We put forward the idea of using a Block Low-Rank (BLR) **multifrontal** direct **solver** to efficiently solve the linear systems of equations arising from a finite-difference discretization of the frequency-domain Maxwell equations **for** 3-D electromagnetic (EM) problems. The **solver** uses a low-rank representation **for** the off-diagonal blocks of the intermediate dense matrices arising in the **multifrontal** method to reduce the computational load. A numerical threshold, the so-called BLR threshold, controlling the accuracy of low-rank representations was optimized by balancing errors in the computed EM fields against savings in floating point operations (flops). Simulations were carried out over large-scale 3-D resistivity models representing typical scenarios **for** marine controlled-source EM surveys, and in particular the SEG SEAM model which contains an irregular salt body. The flop count, size of factor matrices and elapsed run time **for** matrix factorization are reduced dramatically by using BLR representations and can go down to, respectively, 10, 30 and 40 per cent of their full-rank values **for** our largest system with N = 20.6 million unknowns. The reductions are almost independent of the number of MPI tasks and threads at least up to 90 × 10 = 900 cores. The BLR savings increase **for** larger systems, which reduces the factorization flop complexity from O(N 2 ) **for** the full-rank **solver** to O(N m ) with m = 1.4–1.6. The BLR savings are

En savoir plus
Although compression rates may not be as good as those achie- ved with hierarchical formats, BLR offers a good flexibility thanks to its simple, flat structure. Many variants of Algo- rithm 1 can be easily defined, depending on the position of the ’C’ phase. **For** instance, it can be moved to the last position if one needs an accurate factorization and an approximated, faster solution phase. This might be a strategy of choice **for** FWI application, where a large number of right-hand sides must be processed during the solution phase. In a parallel en- vironment, the BLR format allows **for** an easier distribution and handling of the frontal matrices. Pivoting in a BLR matrix can be more easily done without perturbing much the structure. Lastly, converting a matrix from the standard representation to BLR and vice versa, is much cheaper with respect to the case of hierarchical matrices (see Table 1 **for** the low global cost of compressing fronts into BLR formats). This allows to switch back and forth from one format to the other whenever needed at a reasonable cost; this is, **for** example, done to simplify the assembly operations that are extremely complicated to perform in any low-rank format. As shown in Fig. 3, the O(N 2 ) com- plexity of a standard, full rank solution of a 3D problem (of N unknowns) from the Laplacian operator discretized with a 3D 11-point stencil is reduced to O (N 4/3 ) when using the BLR format. All these points make BLR easy to adapt to any multi- frontal **solver** without a complete rethinking of the code.

En savoir plus
11: Identify invalid tasks: J IV , {j ∈ J RD |(z ij 1 =
NULL) XOR (z ij 2 = NULL)}
12: end while
or, (c) only one subtask is assigned. **For** case (a), one can claim that the RDT j is assigned to an appropriate pair of agents, and **for** case (b), one can claim that the RDT j is not assigned because it is not attractive to the team. Case (c) can be problematic, because one agent is committed to take its part in RDT **based** on a wrong optimistic view, and actually obtains no reward from it. The performance degradation due to this invalid assignment can be larger than the loss of assumed score **for** the subtask the agent has committed to do. The reason **for** this additional performance degradation is the cost of opportunity. The agent may have been able to do other tasks had it been aware that this RDT would only be partially assigned, thus no actual reward could be obtained from it. Had the agent been aware of that fact, it would not have bid on the **task** in the first place. One approach to improve performance is to eliminate these invalid tasks and re-distribute the tasks accordingly.

En savoir plus
Han Lin et al. present a way of generating DAGs **for** **heterogeneous** scheduling evaluation purposes [10]. An implementation of this algorithm has been created by Frederic Suter [18]. It has 5 hyperparameters (n, fat, regularity, jump, and CCR). This graph generating method was our first idea **for** generating our dataset, but it has several downsides. It does not handle **task** types. Therefore each **task** is treated individually. In a real application, however, there are usually different **task** types. Besides, there are often correlations between a task’s dependencies and its type. **For** example, an application can have a regular pattern where **task** A performs a matrix inversion and then 16 tasks of type B read the matrix. In this case, A tasks would systematically have 16 successors of type B, and conversely **for** B tasks. This kind of behavior is not represented at all with the presented DAG generation method.

En savoir plus
This theory was applied to different domains, including vehicular traffic [ 14 ], supply chains [ 1 ], irrigation channels [ 3 ] and others. **For** a complete account of recent results and references we refer the reader to the survey [ 6 ].
**For** vehicular traffic, authors considered many different traffic situations to be modeled, thus proposing a rich set of alternative junction models even **for** the scalar case, see [ 8 , 9 , 12 , 13 , 14 , 18 , 22 , 24 ]. Here, we first propose a new model which considers priorities among the incoming roads as the first criterion and maximization of flux as the second. The main idea is that the road with the highest priority will use the maximal flow taking into account also outgoing roads constraints. If some room is left **for** additional flow then the road with the second highest priority will use the left space and so son. A precise definition of the new Riemann **solver**, called Priority Riemann **Solver**, is **based** on a traffic distribution matrix A (Definition 11 ), a priority vector P = (p 1 , . . . , p n ) (with p i ≥ 0 and P i p i = 1) and requires a recursion

En savoir plus
Ref. John A. García H. et al. Energetically Efficient Acceleration EEA-Aware. Degree work to obtain the title of Master of Science in Systems Engineering and Informatics at UIS 2016. Ref. Víctor Martinez et al. Towards Seismic Wave Modeling on **Heterogeneous** Many-Core **Architectures** Using **Task**-**Based** Runtime System. SBAC-PAD 2015.
Seismic Wave Equation **for** a solid medium.
Constitutive relation in the case of an isotropic medium.

HLS is the transformation of a functional description — provided as C code in our framework — into a Register Transfer Level (RTL) description. The generated architecture described at the RTL level is compliant with an architecture model that depends on the HLS tool. The HLS tool currently used in our framework is GAUT [28], which is linear in complexity. It is a free HLS tool **based** on a Data Flow Graph model and therefore dedicated to data-dominated algorithms. From a C/C++ specification and a set of design constraints, GAUT automatically generates a potentially pipelined RTL architecture in two formalisms: in VHDL **for** synthesis, and in SystemC **for** virtual prototyping. In our flow, the set of design constraints is provided by the HLS/DSE controller (cf. Fig.7), in order to generate and evaluate a set of coprocessors with several performance/resource tradeoffs. The HLS constraints we consider **for** automated exploration are the latency and the communication model (FIFO or shared memory). The generated architecture is composed of i) a processing unit composed of the data-path (logic/arithmetic operators + storage elements + steering logic) and an FSM controller; ii) a memory unit composed of memory banks and associated controllers and iii) a communication interface which can be implemented as a FIFO, a shared memory (optionally with ping pong mode), or a 4-phase handshake. Given that the underlying model of architecture of GAUT is clearly defined, it is possible to perform a resource usage estimation. So after behavioral synthesis, the following features are known exactly:

En savoir plus
and 30 per cent of the size **for** the matrices of factors compared to the conventional full-rank (FR) factorization method.
There have so far been no reports on the application of multi- frontal solvers with low-rank approximations to 3-D EM problems. EM fields in geophysical applications usually have a diffusive na- ture which makes the underlying equations fundamentally different from those describing seismic waves. They are also very different from the thermal diffusion equations since EM fields are vectors. Most importantly, the scatter of material properties in EM problems is exceptionally large, **for** example, **for** marine CSEM applications resistivities of seawater and resistive rocks often differ by four or- ders of magnitude or more. On top of that, the air layer has an essentially infinite resistivity and should be included in the compu- tational domain unless water is deep. Thus, elements of the system matrix may vary by many orders of magnitude, which can affect the performance of low-rank approaches **for** matrix factorization.

En savoir plus
The makespan results (the average of 5 executions) **for** the Montage and Cybershake DAGs are shown in Fig 6, while Fig. 7 shows the comparison of degree of imbalance between E-HEFT, HEFT and MinMin-TSH Algorithms. According to the experimental results in Fig 6, it can be seen that the total running time of scientific workflows lengthens with increase of the number of submitted tasks. It is clearly evident from the graph that E-HEFT is more efficient when compared with other two algorithms **based** on both workflows. According to Montage workflow, our proposed algorithm (E-HEFT) outper- forms HEFT and MinMin-TSH algorithms in terms of the average makespan of the submitted tasks by 21.37%, and 28.98% respectively as shown in Fig 6a. On the other hand, The E-HEFT showed an average improvement of 26.07% in relation of HEFT and 21.93% in relation of MinMin-TSH **based** on Cybershake workflow as shown in Fig 6b. It is worth mentioning that makespan is closely pertinent to the volume of data movement **for** data intensive workflow. So, we can state that those improvements variations are mainly due to the volume of data that is transferred among activities in each workflow. Compared with two other algorithms, our algorithm

En savoir plus