2 EUCLIDEAN DISTANCE MATRIX 2
Standard implementations of the Euclidean distance matrix is to split the original ma- trix in tiles (or chunks/blocks) of columns (or lines). Two nested loops are then applied to fill the distance matrix. Since the distance matrix is symmetric with zero diagonal, only the lower of upper triangular part ic computed. We propose two algorithms for the dis- tance matrix computation on huge datasets: one for share memory/GPU computers and another for the distributed memory computers. We use the triangular numbers properties to collapse the two nested loops over the blocks, in contrary to  who use external pattern (Map-Reduce) to avoid nested loops. Our strategy leads to efficient implementation for sharedmemory computers and GPU. The distributed memory algorithm is based on cir- cular shift using a 1D periodic topology. The distance matrix is computed iteratively, each processor filling one block per iteration. We show that the number of iterations required is almost the number of tiles. The proposed paper is an extension to .
quality time time time time rounds Grain n
Table 1: Comparison of parallel computational models
The synchrony assumption of the model is indicated in the row labeled synch. Lock-step indicates that the processors are fully synchronized at each step, without accounting for synchronization. Bulk-synchrony indicates that there can be asynchronous operations between synchronization barriers. The row labeled mem- ory shows how the model views the memory of the parallel computer: sh. indicates globally accessible sharedmemory, dist. stands for distributed memory and priv. is an abstraction for the case where the only assumption is that each processor has access to private (local) memory. In the last variant the whole memory could either be distributed or shared. The row labeled commun. shows the type of interprocessor communication assumed by the model. Sharedmemory (SM) indicates that communication is effected by reading to and writing from
For the performance analysis we used the Multi-Processing Environment (MPE) library and the graphical visualization tool Jumpshot-4 [
16 . The experiments were performed on a cluster of sixteen SMP servers (two 64-bit Xeon 3.2GHz CPUs, 2MB L2 cache, 2GB of RAM) ]
running Linux 2.6 and connected by an 10Gb/s Infiniband network. The GCC 3.4.6 compiler was used.
Fig. 5(a) presents the obtained mean speedup. It is clearly visible that the parallel algorithm significantly decreases the computation time. In practice, it means that the typical time needed to simulate an organ with 50000 MFUs and consequently 300000 vessel segments can be reduced from 21 hours (with a single processor machine) to 2 hours (with 16 processors).
As explained in the introduction, we have previously proposed a way to restrict the potentially large memory needed for the traversal of a task graph by adding edges that correspond to fictitious dependencies [29, 30]. Our method consists in first computing the worst achievable memory of any parallel traversal, using either a linear program or a min-flow algorithm. Then if the previous computation detects a potential situation when the memory exceeds what is available on the platform, we add a fictitious edge in order to make this situation impossible to reach in the new graph. This study is inspired by the work of Sbîrlea et al. . In that study, the authors focus on a different model, in which all data have the same size (as for register allocation). They target smaller-grain tasks in the Concurrent Collections (CnC) programming model , a stream/dataflow programming language. Their objective is, just as ours, to schedule a DAG of tasks using a limited memory. To this purpose, they associate a color to each memory slot and then build a coloring of the data, in which two data items with the same color cannot coexist. If the number of colors is not sufficient, additional dependence edges are introduced to prevent too many data items to coexist. These additional edges respect a pre-computed sequential schedule to ensure acyclicity. An extension to support data of different sizes is proposed, which conceptually allocates several colors to a single data, but is only suitable for a few distinct sizes.
We propose efficient parallel algorithms and implementations on sharedmemory architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main difficulties specific to linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Sec- ond, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design parallel algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decomposition: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive.
independence of the right-hand sides.
In a distributed memory environment, one possible workaround would be to run parallel instances of the linear solver in parallel, each instance using the whole set of processes to solve for a block of right-hand sides (because all of the distributed factors might have to be accessed). This is not possible using MPI, given that the factors are distributed. Another potential workaround would be to replicate the factors on all processes, or to write the factors “shared” on disk(s), and to simulate a sharedmemory paradigm by launching sequential instances in parallel, each of them accessing the distributed factors. Unfortunately, this is not feasible for any distributed sparse solver that we know of. Furthermore, such a solution would then really lose the benefit of our parallel factorization, and the cost of accessing the distributed factors (which might be stored on local disks) would be prohibitive. Therefore, the blocks of entries to be computed thus have to be processed one at a time.
In this paper, we have presented a methodology to adapt an existing, fully-featured, distributed- memory code to shared-memory architectures. We studied the gains of using efficient multi- threaded libraries (in our case, optimized BLAS) in combination to message-passing, then intro- duced multithreading in our main computation kernels thanks to OpenMP. We then exploited higher-level parallelism, taking advantage of the characteristics of the task graph arising from the problems. Because the task graph is a tree, serial kernels are first used to process inde- pendent subtrees in parallel, and multithreaded kernels are then applied to process the nodes from the top of the tree. We proposed an efficient algorithm based on a performance model of individual tasks to determine when to switch from tree parallelism to node parallelism. Because this switch implies a synchronization, we showed how it is possible to reduce the associated cost by dynamically re-assigning idle CPU cores to active tasks. We note that this last approach depends on the interaction between the underlying compilers and BLAS libraries and requires a careful configuration. We also considered NUMA environments, showed how memory allocation policies affect performance, and how memory affinity can be efficiently exploited in practice.
To cope quickly with all types of failure risks (link, node and Shared Risk Link Group (SRLG)), each router detecting a failure on an outgoing interface activates locally all the backup paths protecting the primary paths which traverse the failed interface. With the observation that upon a SRLG failure, some active backup paths are inoperative and do not really participate to the recovery (since they do not receive any traffic flow), we propose a new algorithm (SRLG structure exploitation algorithm or SSEA) exploiting the SRLG structures to enhance the admission control and improve the protection rate.
as MPI/OpenMP (message passing between nodes and parallel programming within nodes) and task-based models such as OpenCL, StarPu and OmpSs that encapsulate the user code into a specific framework (kernels, tasks, dataflow). These systems have been ported to different processor architectures, even on FPGAs for OpenCL, addressing the heterogeneity of the platforms. Unified dis- tributed memory systems can be built on top of heterogeneous platforms using, for example, cluster implementations of OpenMP and PGAS implementations (provided it does not rely on hardware mechanisms such as RDMA). In this work, we explore the possibility of deploying a full software-distributed sharedmemory system to allow MPMD programming on micro-servers (a distributed architecture with heterogeneous nodes). This is quite new for such systems, for two reasons: First, there is a lack of specification and formalization against hard- ware sharedmemory, and also because of a potential scaling problem. Second, software sharedmemory, while being famous with computing grid and peer-to- peer systems, is seen as a performance killer at the processor scale. We think that micro-servers are standing somewhere in-between: from the multi-processors they inherit the fast-communication links and from the computing grids, they inherit the heterogeneity, the dynamicity of resources and a bit of scaling issues. In this work, we propose an hybrid approach where data coherency is managed by software between nodes and by regular hardware within the nodes. We have designed and implemented a full software-distributed sharedmemory (S-DSM) on top of a message passing runtime. This S-DSM has been deployed over the RECS3 heterogeneous micro-server, running a parallel image processing appli- cation. Results show the intricacies between the design of the user application, the data coherence protocol and the S-DSM topology and mapping. The paper is organized as follows: Section 2 describes some micro-server architectures and the way they are used. Section 3 presents the S-DSM. Section 4 describes the experiments on both homogeneous and heterogeneous architectures. Section 5 gives some references on previous works. Finally, section 6 concludes this paper and brings new perspectives.
Is this common physiological aptitude sufficient to give rise to and justify the feeling of a sharedmemory? Obviously not. First, this requires that each of us be aware of his/her sensation. We manage, on a permanent basis, in the shadows of our memory, and even in its darkest part, countless messages (e.g., interoceptive or proprioceptive) which, as long as they remain unconscious, have no chance of generating a sense of sharing. Without this consciousness which, during the evolutionary process, might have first emerged from primordial emotions (Denton, 2005) or from the ability of our brain to build, first rough, and then more and more complex mental scenes (Edelman and Tononi, 2000), there can be no shared mental representation. Secondly, just like the gravedigger, I must be conscious of my conscience in order to be able, possibly, to talk about it, a skill which “signs” the identity of our species (Candau, 2004b) . Thirdly and fourthly, each of us must be aware that: i) the other is aware of her/his sensation, ii) that this other - the sensory partner - is himself aware that each of us is aware that the other is aware of his sensation. I shall stop here, so as not to fall into a pointless mise en abîme of truly mutual knowledge, and I shall limit myself to stressing the importance of this meta- representational level 8 of the feelings of sharing. This level also conditions the possibility of claiming a sharedmemory. In this case, this metarepresentational level is that of metamemory.
– Requesting a more privileged access to the memory of both VMs to perform
the message transfers. This avoids unnecessary copies to a shared buffer but requires a costly transition to the hypervisor.
This is once again a tradeoff between latency and bandwidth. To provide optimal performance, the most appropriate technique has to be picked dynam- ically depending on the size of the message to transmit. Additionally, VM iso- lation must be taken into consideration. A VM should not be able to read or corrupt another VM’s memory through our communication interface, except for the specific communications buffers to which it has been granted access.
distributed computing sharedmemory models
Sergio Rajsbaum* Michel Raynal**
Abstract: Due to the advent of multicore machines, sharedmemory distributed computing models taking into account asynchrony and process crashes are becoming more and more important. This paper visits some of the models for these systems, and analyses their properties from a computability point of view. Among them, the snapshot model and the iterated model are particularly investigated. The paper visits also several approaches that have been proposed to model crash failures. Among them, the wait-free case where any number of processes can crash is fundamental. The paper also considers models where up to t processes can crash, and where the crashes are not independent. The aim of this survey is to help the reader to better understand recent advances on what is known about the power and limits of distributed computing sharedmemory models and their underlying mathematics.
The papers [12, 32, 54] show, roughly speaking, that the executions of any wait-free algorithm in the asynchronous read/write sharedmemory model with n processes, starting on one input conguration, can be represented by an (n − 1)-dimensional solid object with no holes. Furthermore, this representation implies a topological characterization of the problems that can be solved in a wait-free manner . By now there are many papers extending this characterization to other models, or using it to derive algorithms and to prove impossibility results. There are also a few tutorials such as [29, 31, 47] that can help getting into the area. We recall some basic notions next (see the tutorials for a more detailed and precise exposition). Simplexes and complexes A discrete geometric object can be represented by a generalization of a graph, known in topology as a complex. Recall that a graph consists of a base set of elements called vertices, and two-element sets of vertices called edges. More generally, a k-simplex is a set of vertices, of size k +1. Thus, we may think of a vertex as a 0-simplex, and an edge as a 1-simplex. A complex is a set of simplexes closed under containment. As with a graph, it is often convenient to embed a complex in Euclidean space. Then, 1-dimensional simplexes are represented as lines, 2-dimensional simplexes as solid triangles and 3-dimensional simplexes as solid tetrahedrons. Figure 16 depicts a 1-dimensional complex and a 2-dimensional complex.
I. I NTRODUCTION
Parallel workloads are often described by Directed Acyclic task Graphs, where nodes represent tasks and edges rep- resent dependencies between tasks. The interest of this formalism is twofold: it has been widely studied in the- oretical scheduling literature  and dynamic runtime schedulers (e.g., StarPU , XKAAPI , StarSs , and PaRSEC ) are increasingly popular to schedule them on modern computing platforms, as they alleviate the difficulty of using heterogeneous computing platforms. Concerning task graph scheduling, one of the main objectives that have been considered in the literature consists in minimizing the makespan, or total completion time. However, with the increase of the size of the data to be processed, the memory footprint of the application can have a dramatic impact on the algorithm execution time, and thus needs to be optimized , . This is best exemplified with an application which, depending on the way it is scheduled, will either fit in the memory, or will require the use of swap mechanisms or out-of-core execution. There are few existing studies that take into account memory footprint when scheduling task graphs, as detailed below in the related work section.
A stack-based EM 2 architecture can choose to migrate only a portion of the stack cache—with enough data to continue execu- tion on the remote core while data accesses are being made there, and enough space to carry back any results without overflows— and flush the rest to the stack memory prior to migration. Since the migrated depth can be different for every access, determining the best per-migration depth requires a decision algorithm. Indeed, to evaluate such schemes, we can use the same analytical model described for the EM 2 -RA case and a similar optimization formu- lation to compute the optimal stack depths (instead of the binary migrate-vs.-RA decision, the algorithm considers the various stack depths) and compares them against a given depth-decision scheme.
Broadly, understanding the performance of communication and storage costs of con- sistent distributed shared-memory systems is an open and ripe research area. In this pa- per, we considered a point-to-point reliable message passing system where the cost of a message is equal to the size of the values it contains. In our model, an open question is whether the sharedmemory algorithms of this paper are optimal; such a study will need the construction of lower bounds on the costs incurred. More generally, it is of relevance, both to theory and practice, to explore the performance of other models for message-passing architectures such as wireless systems, packet erasure channels, and wireline communication networks.
Fig. 1: Event coding in sharedmemory. Ap-
plication to the parallel processing of tiles.
As a motivating example, we con- sider a parallel HPC application as illustrated in Figure 1. The purpose is to calculate the sea level for each time step by applying a wave prop- agation model. After each iteration, some specific places on the map are monitored to detect if there is a threat to the population. The base map is represented as a set of chunks in the sharedmemory, each chunk covering a square surface (a tile). Several threads navigate the chunks to update values. Some other threads monitor the criti- cal chunks (represented in red color). One realistic constraint is that it is not possible to modify the HPC code to manage the critical aspect of the calculation. Instead, we expect a smooth and non-intrusive integration of the critical code regarding the HPC code.
Since the late 1970s, emulation of shared-memory systems in distributed message-passing environments has been an active area of research [2–8, 12–18, 24, 29, 30]. The traditional approach to building redundancy for distributed systems in the context of sharedmemory emulation is replication. In their seminal paper , Attiya, Bar-Noy, and Dolev presented a replication based algorithm for emulating sharedmemory that achieves atomic consistency [19, 20]. In this paper we consider a simple multi-writer generalization of their algorithm which we call the ABD algorithm i . This algorithm uses a quorum-based replication scheme , combined with read and write protocols to ensure that the emulated object is atomic  (linearizable ), and to ensure liveness, specifically, that each operation terminates provided that at most ⌈ N −1 2 ⌉ server nodes fail. A critical step in ensuring atomicity in ABD is the propagate phase of the read protocol, where the readers write back the value they read to a subset of the server nodes. Since the read and write protocols require multiple communication phases where entire replicas are sent, this algorithm has a high communication cost. In , Fan and Lynch introduced a directory-based replication algorithm known as the LDR algorithm that, like , emulates atomic sharedmemory in the message-passing model; however, unlike , its read protocol is required to write only some metadata information to the directory, rather than the value read. In applications where the data being replicated is much larger than the metadata, LDR is less costly than ABD in terms of communication costs.