Multicore Architectures

Top PDF Multicore Architectures:

Reducing timing interferences in real-time applications running on multicore architectures

Reducing timing interferences in real-time applications running on multicore architectures

casse@irit.fr Abstract We introduce a unified wcet analysis and scheduling framework for real-time applications de- ployed on multicore architectures. Our method does not follow a particular programming model, meaning that any piece of existing code (in particular legacy) can be re-used, and aims at re- ducing automatically the worst-case number of timing interferences between tasks. Our method is based on the notion of Time Interest Points (tips), which are instructions that can generate and/or suffer from timing interferences. We show how such points can be extracted from the binary code of applications and selected prior to performing the wcet analysis. We then rep- resent real-time tasks as sequences of time intervals separated by tips, and schedule those tasks so that the overall makespan (including the potential timing penalties incurred by interferences) is minimized. This scheduling phase is performed using an Integer Linear Programming (ilp) solver. Preliminary results on state-of-the-art benchmarks show promising results and pave the way for future extensions of the model and optimizations.
En savoir plus

13 En savoir plus

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

This paper evaluates the usability of runtime systems and of the associated modular approach in the context of complex applications, namely, the multi- frontal QR factorization of sparse matrices [3], which yields extremely irregular workloads, with tasks of different granularities and characteristics as well as a variable memory consumption. For that, we consider a heavily hand-tuned state-of-the-art solver for multicore architectures, qr mumps [9], we propose an alternative modular design of the solver on top of the StarPU runtime system [5] and we present a thorough performance comparison of both approaches on the architecture for which the original solver has been tuned. The penalty of delegat- ing part of the task management system to a third party software, the runtime system, is to be regarded with respect to the impact of the numerical algorithmic choices; for that purpose, we also discuss the relative performance with respect to another state-of-the-art multifrontal QR solver for multicore architectures, the SuiteSparseQR package [11], referred to as spqr.
En savoir plus

14 En savoir plus

Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures

Abstract. Multicore architectures featuring specialized accelerators are getting an increasing amount of attention, and this success will probably influence the design of future High Performance Computing hardware. Unfortunately, programmers are actually having a hard time trying to exploit all these heterogeneous computing units efficiently, and most ex- isting efforts simply focus on providing tools to offload some compu- tations on available accelerators. Recently, some runtime systems have been designed that exploit the idea of scheduling – as opposed to of- floading – parallel tasks over the whole set of heterogeneous computing units. Scheduling tasks over heterogeneous platforms makes it necessary to use accurate prediction models in order to assign each task to its most adequate computing unit [2]. A deep knowledge of the application is usually required to model per-task performance models, based on the algorithmic complexity of the underlying numeric kernel.
En savoir plus

11 En savoir plus

X-Kaapi: a Multi Paradigm Runtime for Multicore Architectures

X-Kaapi: a Multi Paradigm Runtime for Multicore Architectures

I. I NTRODUCTION Industrial codes usually require mixing different paral- lelization paradigms to achieve interesting speedups. The chal- lenge is to develop programming and runtime environments that efficiently support this multiplicity of paradigms. We introduce X-K AAPI , a runtime for multicore architectures de- signed to support multiple parallelization paradigms with high performance thanks to a low overhead scheduler. The proposed case study is the industrial numerical simulation code for fast transient dynamics called EUROPLEXUS (abbreviated EPX in the following paragraphs). EPX 1,2 is dedicated to complex simulations in industrial framework, with a large source code composed of 600.000 lines of Fortran. It supports 1-D, 2-D and 3-D models, based on either continuous or discrete approaches, to simulate structures and fluids in interaction. EPX supports non-linear physics for both geometrical (finite displacements, rotations and strains) and material (plasticity, damage, etc) properties. A typical simulation spends more than 70% of the execution time in:
En savoir plus

9 En savoir plus

Fast and Portable Locking for Multicore Architectures

Fast and Portable Locking for Multicore Architectures

6. CONCLUSION This article has presented RCL, a novel locking technique that focuses on both reduc- ing lock acquisition time and improving the execution speed of critical sections through increased data locality. The key idea is to go one step further than combining locks, and to dedicate hardware threads for the execution of critical sections: since current multicore architectures have dozens of hardware threads at their disposal that typi- cally cannot be fully exploited because applications lack scalability, dedicating some of these hardware threads for a specific task such as serving critical sections can only improve the application’s performance. RCL takes the form of a runtime library for Linux and Solaris that supports x86 and SPARC architectures. In order to ease the reengineering of legacy applications, RCL proposes a profiler as well as a methodol- ogy for detecting highly contended locks and locks whose critical sections suffer from poor data locality, since these two kinds of locks can generally benefit from RCL. Once these locks have been identified, using RCL is facilitated by a provided reengineering tool that encapsulates critical sections into functions. We show that RCL outperforms other existing lock algorithms on a microbenchmark and on some applications where locks are a bottleneck.
En savoir plus

63 En savoir plus

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

This paper evaluates the usability of runtime systems and of the associated modular approach in the context of complex applications, namely, the multi- frontal QR factorization of sparse matrices [3], which yields extremely irregular workloads, with tasks of different granularities and characteristics as well as a variable memory consumption. For that, we consider a heavily hand-tuned state-of-the-art solver for multicore architectures, qr mumps [9], we propose an alternative modular design of the solver on top of the StarPU runtime system [5] and we present a thorough performance comparison of both approaches on the architecture for which the original solver has been tuned. The penalty of delegat- ing part of the task management system to a third party software, the runtime system, is to be regarded with respect to the impact of the numerical algorithmic choices; for that purpose, we also discuss the relative performance with respect to another state-of-the-art multifrontal QR solver for multicore architectures, the SuiteSparseQR package [11], referred to as spqr.
En savoir plus

13 En savoir plus

Task-based multifrontal QR solver for GPU-accelerated multicore architectures

Task-based multifrontal QR solver for GPU-accelerated multicore architectures

In our study we consider the three front partitioning strategies illustrated in Figure 3: fine-grain, coarse-grain and hierarchical. The fine-grain partitioning (Figure 3(a)), which is the method of choice in work [18], consists in applying a regular 1D block partitioning on fronts and is therefore mainly suited for homogeneous architectures (e.g. multicore systems). The coarse grain partitioning (Figure 3(b)), where fine-grained panel tasks are executed on CPU and large-grain (as large as possible) update tasks are performed on GPU, corresponds to the algorithm developed in the MAGMA package [9] and aims at obtaining the best acceleration factor of computationally intensive tasks on GPU. In order to keep the GPU constantly busy, a static scheduling is used that allows for overlapping GPU and CPU computation thanks to a depth-1 lookahead technique; this is achieved by splitting the trailing submatrix update into two separate tasks of, respectively, fine and coarse granularity.
En savoir plus

20 En savoir plus

Compiler techniques for scalable performance of stream programs on multicore architectures

Compiler techniques for scalable performance of stream programs on multicore architectures

The combined techniques effectively (i) leverage data parallelism, (ii) remove the inter-core communication requirement of data parallelism when data distribution is pre[r]

223 En savoir plus

Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems

Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems

Alternatively, a modular approach can be employed. First, the numerical algorithm is written at a high level independently of the hardware architecture as a Directed Acyclic Graph (DAG) of tasks where a vertex represents a task and an edge repre- sents a dependency between tasks. A second layer is in charge of the scheduling. This layer decides when and where (which Processing Unit) to execute a task. Based on the scheduling decisions, a runtime engine, in the third layer, retrieves the data neces- sary for the execution of a task (taking care of ensuring the coherency among multiple possible copies), triggers its execution and updates the state of the DAG upon its com- pletion. The border between the scheduling and runtime layers is a bit variable and ultimately depends on the design choices of the runtime system developers. The fourth layer consists of the tasks code optimized for the underlying architectures. In most cases, the bottom three layers need not be written by the application developer. Indeed, it is usually easy to find off the shelf a very competitive, state-of-the-art and generic scheduling algorithm (such as work-stealing [Arora et al. 2001], Minimum Completion Time [Topcuouglu et al. 2002]) that matches the algorithmic needs to efficiently exploit the targeted architecture. Otherwise, if needed, as we do in this study for efficiently handling tasks of small granularity, a new scheduling algorithm may be designed (and shared with the community to be in turn possibly applied to a whole class of algo- rithms). The runtime engine only needs to be extended once for each new architecture. Finally, in many cases, the high-level algorithm can be cast in terms of standard op- erations for which vendors provide optimized codes. In other common cases, available kernels only need to be slightly adapted in order to take into account the specificities of the algorithm; this is, for example, the case of our method where the most com- putationally intensive tasks are minor variants of LAPACK routines (as described in Section 4.3. All in all, with such a modular approach, only the high-level algorithm has to be specifically designed, which ensures a high productivity. The maintainability is also guaranteed since the use of new hardware only requires (in principle) third party effort. Modern tools exist that implement the two middle layers (the scheduler and the runtime engine) and provide a programming model that allows the programmer to conveniently express his workload in the form of a DAG of tasks; we refer to these tools as runtime systems or, simply, runtimes.
En savoir plus

24 En savoir plus

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

to break existing clusters. This process is repeated until each cluster contains only one eigenvalue. Then, eigenvectors can be computed. With such an approach, a good level of parallelism is expressed: an eigenvector computation is independent from the rest. However, the LAPACK version does not rely on level 3 BLAS operations, so it is more difficult to exploit parallelism. S CA LAPACK provides a parallel implementation for distributed architectures, but the available routine in the API is for the complete symmet- ric eigenproblem. The fastest implementation for shared- memory systems was developed by the Aachen Institute for Advanced Study in Computational Engineering Science in MR 3 -SMP [15]. The algorithm is expressed like a flow of sequential tasks and relies on P OSIX threads. Their internal runtime can schedule tasks statically or dynamically. Their experiments showed how well it outperforms the original MRRR algorithm with naive fork/join parallelization. They also explain why the implementation is scalable, and how it exploits as much parallelism as possible. In addition, com- putational timing shows that MR 3 -SMP is often better than
En savoir plus

11 En savoir plus

Block Wiedemann algorithm on multicore architectures

Block Wiedemann algorithm on multicore architectures

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]

2 En savoir plus

DSL Stream Programming on Multicore Architectures

DSL Stream Programming on Multicore Architectures

StreamIt adapts the granularity and communications patterns of programs through graph transformations [17], belonging to one of these three types: fu- sion transformations cluster adjacent filters, coarsening their granularity; fission transformations parallelize stateless filters decreasing their granularity; reorder- ing transformations operate on splits and joins to facilitate fission and fusion transformations. Complementary transformations have also been proposed. For example optimizing transformations proposed in [1] take advantage of algebraic simplifications between consecutive linear filters. On cache architectures, fusion transformations proposed in [26] optimize filters to instruction and data cache sizes.
En savoir plus

21 En savoir plus

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Scheduling Dynamic OpenMP Applications over Multicore Architectures

over hierarchical architectures, that lowers the general performance of parallel applica- tions. Alternatively, a distribution that considers those affinity relations entails a better use of cache memory, and improves local memory accesses. The Affinity bubble-scheduler is specifically designed to tackle irregular applica- tions based on a divide and conquer scheme. In this aim, we consider that each bubble contains threads and subbubbles that are heavily related, most of the time through data sharing. We assume that the best thread distribution is obtained by scheduling each entity contained in a bubble on the same processor, sometimes breaking the load bal- ancing scheme, even if a local redistribution is needed once in a while. This scheduler provides two main algorithms, to distribute thread and bubble entities over the different processors initially, and to rebalance work if one of them becomes idle.
En savoir plus

11 En savoir plus

Adapting Active Objects to Multicore Architectures

Adapting Active Objects to Multicore Architectures

Active objects, just as actors, are mono-threaded entities and therefore suffer from the same limitations regarding local parallelism. Taking into consideration the widespread popularity of multi-core processors and the trend of increasing the number of cores, any framework that does not fully utilize the multithread- ing capacities of multi-core architectures will seem deprecated. On the other hand, it is rather hard to deal both with the application logic and with the clearly orthogonal task of concurrent synchronization at the same time. Even if concurrent code is supposed to improve the performance of an application, if it is unwisely written it can introduce race-conditions, which make development and testing difficult [10], [11].
En savoir plus

9 En savoir plus

Automatic Mapping of Stream Programs on Multicore Architectures

Automatic Mapping of Stream Programs on Multicore Architectures

while data flows between cores of a same processor correspond to data transfers in the memory hierarchy. In this paper, we present a novel approach to optimize stream programs for hybrid architectures composed of GPU and multicore CPUs. The approach focuses on memory and communication performance bottlenecks for this kind of architecture. The initial stream graph is first transformed by a sequence of elementary restructurations. The guided beam-search method applied aims to reduce fork-join synchronization costs. We show that the heuristic proposed to drive these restructurations obtains similar results to those obtained by an ex- haustive search. The tasks are then partitioned between CPU cores and GPU using some existing partitioner, taking into account profiling information for a workload balanced partitioning. A new scheduling technique is finally proposed to coarsen tasks of each partition in order to adapt to cache sizes and commu- nication constraints of CPUs and GPU. Our experiments show the importance of both the synchronization cost reduction and of the coarsening step on perfor- mance, adapting the grain of parallelism to the CPUs and to the GPU.
En savoir plus

16 En savoir plus

Dynamic Scheduling of Real-Time Tasks on Multicore Architectures

Dynamic Scheduling of Real-Time Tasks on Multicore Architectures

Email: Christian.Fraboul@enseeiht.fr Abstract—We present a new dynamic scheduling on mul- ticore architectures. This is an improvement of the Optimal Finish Time (OFT) scheduler introduced by Lemerre[7] reducing preemptions. Our result is compared with other schedulers and we show that our algorithm can handle with more general scheduling problems.

3 En savoir plus

Two approximation algorithms for bipartite matching on multicore architectures

Two approximation algorithms for bipartite matching on multicore architectures

b Sabancı University, Faculty of Engineering and Natural Sciences, Istanbul, Turkey c LIP, UMR5668 (CNRS - ENS Lyon - UCBL - Universit´ e de Lyon - INRIA), Lyon, France Abstract We propose two heuristics for the bipartite matching problem that are amenable to shared-memory parallelization. The first heuristic is very intriguing from a parallelization perspective. It has no significant algorithmic synchronization overhead and no conflict resolution is needed across threads. We show that this heuristic has an approximation ratio of around 0.632 under some common condi- tions. The second heuristic is designed to obtain a larger matching by employing the well-known Karp-Sipser heuristic on a judiciously chosen subgraph of the original graph. We show that the Karp-Sipser heuristic always finds a maxi- mum cardinality matching in the chosen subgraph. Although the Karp-Sipser heuristic is hard to parallelize for general graphs, we exploit the structure of the selected subgraphs to propose a specialized implementation which demonstrates very good scalability. We prove that this second heuristic has an approximation guarantee of around 0.866 under the same conditions as in the first algorithm. We discuss parallel implementations of the proposed heuristics on a multicore architecture. Experimental results, for demonstrating speed-ups and verifying the theoretical results in practice, are provided.
En savoir plus

42 En savoir plus

A Benchmark for Multicore Machines

A Benchmark for Multicore Machines

An example of multithreaded applications are Web servers (e.g. Apache servers) in which requests are implemented as threads. However, in servers, threads basically do not communicate and are quite autonomous and inde- pendent computing entities. Actually, servers do not really exploit the shared memory which is at the basis of multicore architectures. Servers are, thus, only partial benchmarks for multicore machines.

12 En savoir plus

Real-time scheduling of transactions in multicore systems

Real-time scheduling of transactions in multicore systems

Addressing these main questions could help us formalizing the introduction of real-time scheduling of transactions within transactional memory. Furthermore, with the advent of transactional multicore, the real-time sched- uler of transactions could be integrated in hardware. It is then important to study the interactions between the schedulers of both tasks and transactions. Experiments similar to those presented in [5, 4] should be conducted in order to determine which real-time policy among (G-P)-EDF and PD 2

6 En savoir plus

Hierarchical hybrid sparse linear solver for multicore platforms

Hierarchical hybrid sparse linear solver for multicore platforms

Whereas PDSLin and ShyLU can be virtually turned into to a pure direct method if no dropping is performed, and whereas Hips may expect robustness by relying on a multilevel scheme, additive Schwarz preconditioners are extremely local. As a result, their computation is potentially much more parallel (and scalable), but their application may lead to a dramatic increase of the number of iterations (or even to non convergence) if the number of subdomains becomes too large. The objective of the present study is to assess whether a 2-level parallel approach allows additive Schwarz preconditioning for Schur Complement methods to achieve an efficient trade-off between numerical robustness and performance on modern hierarchical multicore platforms. Following the parallelization scheme adopted in [25], we have designed an MPI+thread approach (Section 3) to cope with these hardware trends. Contrary to the MPI+MPI approach investigated in [14],
En savoir plus

29 En savoir plus

Show all 630 documents...