2 EUCLIDEAN DISTANCE MATRIX 2
Standard implementations of the Euclidean distance matrix is to split the original ma- trix in tiles (or chunks/blocks) of columns (or lines). Two nested loops are then applied to fill the distance matrix. Since the distance matrix is symmetric with zero diagonal, only the lower of upper triangular part ic computed. We propose two algorithms for the dis- tance matrix **computation** on huge datasets: one for share **memory**/GPU computers and another for the distributed **memory** computers. We use the triangular numbers properties to collapse the two nested loops over the blocks, in contrary to [7] who use external pattern (Map-Reduce) to avoid nested loops. Our strategy leads to efficient implementation for **shared** **memory** computers and GPU. The distributed **memory** algorithm is based on cir- cular shift using a 1D periodic topology. The distance matrix is computed iteratively, each processor filling one block per iteration. We show that the number of iterations required is almost the number of tiles. The proposed paper is an extension to [1].

En savoir plus
quality time time time time rounds Grain n
Table 1: Comparison of **parallel** computational models
The synchrony assumption of the model is indicated in the row labeled synch. Lock-step indicates that the processors are fully synchronized at each step, without accounting for synchronization. Bulk-synchrony indicates that there can be asynchronous operations between synchronization barriers. The row labeled mem- ory shows how the model views the **memory** of the **parallel** computer: sh. indicates globally accessible **shared** **memory**, dist. stands for distributed **memory** and priv. is an abstraction for the case where the only assumption is that each processor has access to private (local) **memory**. In the last variant the whole **memory** could either be distributed or **shared**. The row labeled commun. shows the type of interprocessor communication assumed by the model. **Shared** **memory** (SM) indicates that communication is effected by reading to and writing from

En savoir plus
For the performance analysis we used the Multi-Processing Environment (MPE) library and the graphical visualization tool Jumpshot-4 [
16 . The experiments were performed on a cluster of sixteen SMP servers (two 64-bit Xeon 3.2GHz CPUs, 2MB L2 cache, 2GB of RAM) ]
running Linux 2.6 and connected by an 10Gb/s Infiniband network. The GCC 3.4.6 compiler was used.
Fig. 5(a) presents the obtained mean speedup. It is clearly visible that the **parallel** algorithm significantly decreases the **computation** time. In practice, it means that the typical time needed to simulate an organ with 50000 MFUs and consequently 300000 vessel segments can be reduced from 21 hours (with a single processor machine) to 2 hours (with 16 processors).

En savoir plus
As explained in the introduction, we have previously proposed a way to restrict the potentially large **memory** needed for the traversal of a task graph by adding edges that correspond to fictitious dependencies [29, 30]. Our method consists in first computing the worst achievable **memory** of any **parallel** traversal, using either a linear program or a min-flow algorithm. Then if the previous **computation** detects a potential situation when the **memory** exceeds what is available on the platform, we add a fictitious edge in order to make this situation impossible to reach in the new graph. This study is inspired by the work of Sbîrlea et al. [35]. In that study, the authors focus on a different model, in which all data have the same size (as for register allocation). They target smaller-grain tasks in the Concurrent Collections (CnC) programming model [9], a stream/dataflow programming language. Their objective is, just as ours, to schedule a DAG of tasks using a limited **memory**. To this purpose, they associate a color to each **memory** slot and then build a coloring of the data, in which two data items with the same color cannot coexist. If the number of colors is not sufficient, additional dependence edges are introduced to prevent too many data items to coexist. These additional edges respect a pre-computed sequential schedule to ensure acyclicity. An extension to support data of different sizes is proposed, which conceptually allocates several colors to a single data, but is only suitable for a few distinct sizes.

En savoir plus
We propose efficient **parallel** algorithms and implementations on **shared** **memory** architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main difficulties specific to linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Sec- ond, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design **parallel** algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decomposition: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive.

En savoir plus
independence of the right-hand sides.
In a distributed **memory** environment, one possible workaround would be to run **parallel** instances of the linear solver in **parallel**, each instance using the whole set of processes to solve for a block of right-hand sides (because all of the distributed factors might have to be accessed). This is not possible using MPI, given that the factors are distributed. Another potential workaround would be to replicate the factors on all processes, or to write the factors “**shared**” on disk(s), and to simulate a **shared** **memory** paradigm by launching sequential instances in **parallel**, each of them accessing the distributed factors. Unfortunately, this is not feasible for any distributed sparse solver that we know of. Furthermore, such a solution would then really lose the benefit of our **parallel** factorization, and the cost of accessing the distributed factors (which might be stored on local disks) would be prohibitive. Therefore, the blocks of entries to be computed thus have to be processed one at a time.

En savoir plus
7 Conclusion
In this paper, we have presented a methodology to adapt an existing, fully-featured, distributed- **memory** code to **shared**-**memory** architectures. We studied the gains of using efficient multi- threaded libraries (in our case, optimized BLAS) in combination to message-passing, then intro- duced multithreading in our main **computation** kernels thanks to OpenMP. We then exploited higher-level parallelism, taking advantage of the characteristics of the task graph arising from the problems. Because the task graph is a tree, serial kernels are first used to process inde- pendent subtrees in **parallel**, and multithreaded kernels are then applied to process the nodes from the top of the tree. We proposed an efficient algorithm based on a performance model of individual tasks to determine when to switch from tree parallelism to node parallelism. Because this switch implies a synchronization, we showed how it is possible to reduce the associated cost by dynamically re-assigning idle CPU cores to active tasks. We note that this last approach depends on the interaction between the underlying compilers and BLAS libraries and requires a careful configuration. We also considered NUMA environments, showed how **memory** allocation policies affect performance, and how **memory** affinity can be efficiently exploited in practice.

En savoir plus
jeanlouis.leroux@orange-ftgroup.com
Abstract
To cope quickly with all types of failure risks (link, node and **Shared** Risk Link Group (SRLG)), each router detecting a failure on an outgoing interface activates locally all the backup paths protecting the primary paths which traverse the failed interface. With the observation that upon a SRLG failure, some active backup paths are inoperative and do not really participate to the recovery (since they do not receive any traffic flow), we propose a new algorithm (SRLG structure exploitation algorithm or SSEA) exploiting the SRLG structures to enhance the admission control and improve the protection rate.

En savoir plus
CoMMTM requires simple program changes to exploit commutativity: defining a reducible state to avoid conflicts among commutative operations, using labeled memory accesses to p[r]

as MPI/OpenMP (message passing between nodes and **parallel** programming within nodes) and task-based models such as OpenCL, StarPu and OmpSs that encapsulate the user code into a specific framework (kernels, tasks, dataflow). These systems have been ported to different processor architectures, even on FPGAs for OpenCL, addressing the heterogeneity of the platforms. Unified dis- tributed **memory** systems can be built on top of heterogeneous platforms using, for example, cluster implementations of OpenMP and PGAS implementations (provided it does not rely on hardware mechanisms such as RDMA). In this work, we explore the possibility of deploying a full software-distributed **shared** **memory** system to allow MPMD programming on micro-servers (a distributed architecture with heterogeneous nodes). This is quite new for such systems, for two reasons: First, there is a lack of specification and formalization against hard- ware **shared** **memory**, and also because of a potential scaling problem. Second, software **shared** **memory**, while being famous with computing grid and peer-to- peer systems, is seen as a performance killer at the processor scale. We think that micro-servers are standing somewhere in-between: from the multi-processors they inherit the fast-communication links and from the computing grids, they inherit the heterogeneity, the dynamicity of resources and a bit of scaling issues. In this work, we propose an hybrid approach where data coherency is managed by software between nodes and by regular hardware within the nodes. We have designed and implemented a full software-distributed **shared** **memory** (S-DSM) on top of a message passing runtime. This S-DSM has been deployed over the RECS3 heterogeneous micro-server, running a **parallel** image processing appli- cation. Results show the intricacies between the design of the user application, the data coherence protocol and the S-DSM topology and mapping. The paper is organized as follows: Section 2 describes some micro-server architectures and the way they are used. Section 3 presents the S-DSM. Section 4 describes the experiments on both homogeneous and heterogeneous architectures. Section 5 gives some references on previous works. Finally, section 6 concludes this paper and brings new perspectives.

En savoir plus
Is this common physiological aptitude sufficient to give rise to and justify the feeling of a **shared** **memory**? Obviously not. First, this requires that each of us be aware of his/her sensation. We manage, on a permanent basis, in the shadows of our **memory**, and even in its darkest part, countless messages (e.g., interoceptive or proprioceptive) which, as long as they remain unconscious, have no chance of generating a sense of sharing. Without this consciousness which, during the evolutionary process, might have first emerged from primordial emotions (Denton, 2005) or from the ability of our brain to build, first rough, and then more and more complex mental scenes (Edelman and Tononi, 2000), there can be no **shared** mental representation. Secondly, just like the gravedigger, I must be conscious of my conscience in order to be able, possibly, to talk about it, a skill which “signs” the identity of our species (Candau, 2004b) . Thirdly and fourthly, each of us must be aware that: i) the other is aware of her/his sensation, ii) that this other - the sensory partner - is himself aware that each of us is aware that the other is aware of his sensation. I shall stop here, so as not to fall into a pointless mise en abîme of truly mutual knowledge, and I shall limit myself to stressing the importance of this meta- representational level 8 of the feelings of sharing. This level also conditions the possibility of claiming a **shared** **memory**. In this case, this metarepresentational level is that of metamemory.

En savoir plus
– Requesting a more privileged access to the **memory** of both VMs to perform
the message transfers. This avoids unnecessary copies to a **shared** buffer but requires a costly transition to the hypervisor.
This is once again a tradeoff between latency and bandwidth. To provide optimal performance, the most appropriate technique has to be picked dynam- ically depending on the size of the message to transmit. Additionally, VM iso- lation must be taken into consideration. A VM should not be able to read or corrupt another VM’s **memory** through our communication interface, except for the specific communications buffers to which it has been granted access.

En savoir plus
distributed computing **shared** **memory** models
Sergio Rajsbaum* Michel Raynal**
Abstract: Due to the advent of multicore machines, **shared** **memory** distributed computing models taking into account asynchrony and process crashes are becoming more and more important. This paper visits some of the models for these systems, and analyses their properties from a computability point of view. Among them, the snapshot model and the iterated model are particularly investigated. The paper visits also several approaches that have been proposed to model crash failures. Among them, the wait-free case where any number of processes can crash is fundamental. The paper also considers models where up to t processes can crash, and where the crashes are not independent. The aim of this survey is to help the reader to better understand recent advances on what is known about the power and limits of distributed computing **shared** **memory** models and their underlying mathematics.

En savoir plus
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]

The papers [12, 32, 54] show, roughly speaking, that the executions of any wait-free algorithm in the asynchronous read/write **shared** **memory** model with n processes, starting on one input conguration, can be represented by an (n − 1)-dimensional solid object with no holes. Furthermore, this representation implies a topological characterization of the problems that can be solved in a wait-free manner [32]. By now there are many papers extending this characterization to other models, or using it to derive algorithms and to prove impossibility results. There are also a few tutorials such as [29, 31, 47] that can help getting into the area. We recall some basic notions next (see the tutorials for a more detailed and precise exposition). Simplexes and complexes A discrete geometric object can be represented by a generalization of a graph, known in topology as a complex. Recall that a graph consists of a base set of elements called vertices, and two-element sets of vertices called edges. More generally, a k-simplex is a set of vertices, of size k +1. Thus, we may think of a vertex as a 0-simplex, and an edge as a 1-simplex. A complex is a set of simplexes closed under containment. As with a graph, it is often convenient to embed a complex in Euclidean space. Then, 1-dimensional simplexes are represented as lines, 2-dimensional simplexes as solid triangles and 3-dimensional simplexes as solid tetrahedrons. Figure 16 depicts a 1-dimensional complex and a 2-dimensional complex.

En savoir plus
I. I NTRODUCTION
**Parallel** workloads are often described by Directed Acyclic task Graphs, where nodes represent tasks and edges rep- resent dependencies between tasks. The interest of this formalism is twofold: it has been widely studied in the- oretical scheduling literature [11] and dynamic runtime schedulers (e.g., StarPU [2], XKAAPI [12], StarSs [19], and PaRSEC [5]) are increasingly popular to schedule them on modern computing platforms, as they alleviate the difficulty of using heterogeneous computing platforms. Concerning task graph scheduling, one of the main objectives that have been considered in the literature consists in minimizing the makespan, or total completion time. However, with the increase of the size of the data to be processed, the **memory** footprint of the application can have a dramatic impact on the algorithm execution time, and thus needs to be optimized [20], [1]. This is best exemplified with an application which, depending on the way it is scheduled, will either fit in the **memory**, or will require the use of swap mechanisms or out-of-core execution. There are few existing studies that take into account **memory** footprint when scheduling task graphs, as detailed below in the related work section.

En savoir plus
A stack-based EM 2 architecture can choose to migrate only a portion of the stack cache—with enough data to continue execu- tion on the remote core while data accesses are being made there, and enough space to carry back any results without overflows— and flush the rest to the stack **memory** prior to migration. Since the migrated depth can be different for every access, determining the best per-migration depth requires a decision algorithm. Indeed, to evaluate such schemes, we can use the same analytical model described for the EM 2 -RA case and a similar optimization formu- lation to compute the optimal stack depths (instead of the binary migrate-vs.-RA decision, the algorithm considers the various stack depths) and compares them against a given depth-decision scheme.

En savoir plus
Broadly, understanding the performance of communication and storage costs of con- sistent distributed **shared**-**memory** systems is an open and ripe research area. In this pa- per, we considered a point-to-point reliable message passing system where the cost of a message is equal to the size of the values it contains. In our model, an open question is whether the **shared** **memory** algorithms of this paper are optimal; such a study will need the construction of lower bounds on the costs incurred. More generally, it is of relevance, both to theory and practice, to explore the performance of other models for message-passing architectures such as wireless systems, packet erasure channels, and wireline communication networks.

En savoir plus
Fig. 1: Event coding in **shared** **memory**. Ap-
plication to the **parallel** processing of tiles.
As a motivating example, we con- sider a **parallel** HPC application as illustrated in Figure 1. The purpose is to calculate the sea level for each time step by applying a wave prop- agation model. After each iteration, some specific places on the map are monitored to detect if there is a threat to the population. The base map is represented as a set of chunks in the **shared** **memory**, each chunk covering a square surface (a tile). Several threads navigate the chunks to update values. Some other threads monitor the criti- cal chunks (represented in red color). One realistic constraint is that it is not possible to modify the HPC code to manage the critical aspect of the calculation. Instead, we expect a smooth and non-intrusive integration of the critical code regarding the HPC code.

En savoir plus
1 Introduction
Since the late 1970s, emulation of **shared**-**memory** systems in distributed message-passing environments has been an active area of research [2–8, 12–18, 24, 29, 30]. The traditional approach to building redundancy for distributed systems in the context of **shared** **memory** emulation is replication. In their seminal paper [7], Attiya, Bar-Noy, and Dolev presented a replication based algorithm for emulating **shared** **memory** that achieves atomic consistency [19, 20]. In this paper we consider a simple multi-writer generalization of their algorithm which we call the ABD algorithm i . This algorithm uses a quorum-based replication scheme [31], combined with read and write protocols to ensure that the emulated object is atomic [20] (linearizable [19]), and to ensure liveness, specifically, that each operation terminates provided that at most ⌈ N −1 2 ⌉ server nodes fail. A critical step in ensuring atomicity in ABD is the propagate phase of the read protocol, where the readers write back the value they read to a subset of the server nodes. Since the read and write protocols require multiple communication phases where entire replicas are sent, this algorithm has a high communication cost. In [14], Fan and Lynch introduced a directory-based replication algorithm known as the LDR algorithm that, like [7], emulates atomic **shared** **memory** in the message-passing model; however, unlike [7], its read protocol is required to write only some metadata information to the directory, rather than the value read. In applications where the data being replicated is much larger than the metadata, LDR is less costly than ABD in terms of communication costs.

En savoir plus