Haut PDF Task-based multifrontal QR solver for heterogeneous architectures

Task-based multifrontal QR solver for heterogeneous architectures

Task-based multifrontal QR solver for heterogeneous architectures

3.2. StarPU-based multifrontal method this section, as well as in Chapter 4, have been run with an interleaved memory allocation policy which basically make the locality aware scheduling policy implemented in qr mumps worthless; for this reason the two codes are also expected to have the same task efficiency. As the results in Figure 3.6 show, the performance difference between the two codes has to be found in the pipeline and and runtime efficiencies. The slightly worse pipeline efficiency is due to the different choice of priorities for the tasks that was made in qrm starpu in order to contain the memory consumption (more comments on this below) and to the fact that assembly operations are not fully parallelized as explained above. The lower runtime efficiency, instead, can partly be explained by the fact that in qrm starpu all the ready tasks are stored in a single, sorted central queue. This clearly incurs a relatively high cost due the contention on the locks used to prevent threads from accessing this data structure concurrently and due to the sorting of tasks depending on their priority; in Section 4.4 we will present a novel scheduler that overcomes most of this issues and consistently delivers better performance. As a result we can conclude that the performance difference between qrm starpu and qr mumps is merely due to minor, technical issues; the results presented in Chapter 4 with a much more refined and optimized implementation confirm this intuition. Memory consumption is an extremely critical point to address when designing a sparse, direct solver. As the building blocks for designing a scheduling strategy on top of StarPU differ (and are more advanced) than what is available in qr mumps (which relies on an ad hoc lightweight scheduler) we could not reproduce exactly the same scheduling strategy. Therefore we decided to give higher priority to reducing the memory consumption in qrm starpu. This cannot easily be achieved in qr mumps because its native scheduler can only handle two levels of task priority; as a result, fronts are activated earlier in qr mumps, almost consistently leading to a higher memory footprint as shown in Figure 3.7. The figure also shows that both qrm starpu and qr mumps achieve on average the same memory consumption as spqr. On three cases out of eleven spqr achieves a significantly lower memory footprint; in Chapter 5 we will show that it is possible to reliably control and reduce the memory consumption of qrm starpu and still achieve extremely high performance (roughly the same as the unconstrained case).
En savoir plus

154 En savoir plus

Task-based hybrid linear solver for distributed memory heterogeneous architectures

Task-based hybrid linear solver for distributed memory heterogeneous architectures

Research Report n° 8913 — May 2016 — 13 pages Abstract: Heterogeneity is emerging as one of the most challenging characteristics of today’s parallel environments. However, not many fully-featured advanced numerical, scientific libraries have been ported on such architectures. In this paper, we propose to extend a sparse hybrid solver for handling distributed memory heterogeneous platforms. As in the original solver, we perform a domain decomposition and associate one subdomain with one MPI process. However, while each subdomain was processed sequentially (binded onto a single CPU core) in the original solver, the new solver instead relies on task-based local solvers, delegating tasks to available computing units. We show that this “MPI+task” design conveniently allows for exploiting distributed memory heterogeneous machines. Indeed, a subdomain can now be processed on multiple CPU cores (such as a whole multicore processor or a subset of the available cores) possibly enhanced with GPUs. We illustrate our discussion with the MaPHyS sparse hybrid solver relying on the PaStiX and Chameleon dense and sparse direct libraries, respectively. Interestingly, this two-level MPI+task design furthermore provides extra flexibility for controlling the number of subdomains, enhancing the numerical stability of the considered hybrid method. While the rise of heterogeneous computing has been strongly carried out by the theoretical community, this study aims at showing that it is now also possible to build complex software layers on top of runtime systems to exploit heterogeneous architectures.
En savoir plus

17 En savoir plus

Task-based multifrontal QR solver for GPU-accelerated multicore architectures

Task-based multifrontal QR solver for GPU-accelerated multicore architectures

Different paradigms exist for programming such a DAG of tasks. Among them, the STF model is becoming increasingly popular because of the high productivity it allows for the programmer. Indeed, this paradigm simply consists in submitting a sequence of tasks through a non blocking function call that delegates the execution of the task to the runtime system. Upon submission, the runtime system adds the task to the current DAG along with its dependencies which are automatically computed through data dependency analysis. The actual execution of the task is then postponed to the moment when its dependencies are satisfied. This paradigm is also sometimes referred to as superscalar since it mimics the functioning of superscalar processors where instructions are issued sequentially from a single stream but can actually be executed in a different order and, possibly, in parallel depending on their mutual dependencies. Figure 1(right) shows the 1D STF version from [17, 18] of the multifrontal QR factorization described above. Instead of making direct function calls (activate, assemble, deactivate, panel, update), the equivalent STF code submits the corresponding tasks. Since the data onto which these functions operate as well as their access mode (Read, Write or Read/Write) are also specified, the runtime system can perform the superscalar analysis while the submission of task is progressing. For instance, because an assemble task accesses a block-column f(i) before a panel task accesses the same block-column in Write mode, a dependency between those two tasks is inferred. Our previous work [17] showed that the STF programming model allows for designing a code that achieves a great performance and scalability as well as an excellent robustness when it comes to memory consumption.
En savoir plus

20 En savoir plus

3D frequency-domain seismic modeling with a Parallel BLR multifrontal direct solver

3D frequency-domain seismic modeling with a Parallel BLR multifrontal direct solver

The O(n 2 ) complexity of a standard, full rank solution of a 3D problem (of N unknowns) from the Laplacian operator dis- cretized with a 3D 7-point stencil is reduced to O(n 5/3 ) when using the BLR format (Amestoy et al., 2015b). Although com- pression rates may not be as good as those achieved with hi- erarchical formats, BLR offers a good flexibility thanks to its simple, flat structure. This makes BLR easy to adapt to any multifrontal solver without a complete rethinking of the code. Next, we describe the generalization of the BLR to a parallel environment. The row-wise partitioning imposed by the distri- bution of the front onto several processes constraints the clus- tering of the unknowns. However, in practice, we manage to maintain nearly the same compression rates when the number of processes grows (see Figure 3). Both LU and CB compres- sion can contribute to reducing the volume of communication by a substantial factor and improving the parallel efficiency of the solver. In our implementation, we do not compress the CB. To fully exploit multicore architectures, MPI parallelism is hy- bridized with thread parallelism by multithreading the tasks of Algorithm 1. In full-rank, we exploit multithreaded BLAS kernels. In low-rank, these tasks have a finer granularity and thus a lower efficiency (flop rate). Thus with multithreaded BLAS we are not able to efficiently transform the compres- sion of flops into reduction in time. To overcome this obstacle, Algorithm 1 can be modified to exploit OpenMP-based mul- tithreading instead, which allows for a larger granularity of computations. The Update task at line 9 is applied on a set of independent blocks A i, j : therefore, the loop at line 8 can
En savoir plus

6 En savoir plus

Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver

Exploiting a Parametrized Task Graph model for the parallelization of a sparse direct multifrontal solver

2 Related work 2.1 Parallel programming models for task-based algorithms The most common strategy for the parallelization of task-based algorithms con- sists in traversing the DAG sequentially and submit the tasks, as discovered, to the runtime system using a non blocking function call. The dependencies be- tween tasks are automatically inferred by the runtime system through a data dependency analysis [4] and the actual execution of the task is then postponed to the moment when all its dependencies are satisfied. This programming model is known as a Sequential Task Flow (STF) model as it fully relies on sequential consistency for the dependency detection. This paradigm is also sometimes re- ferred to as superscalar since it mimics the functioning of superscalar processors where instructions are issued sequentially from a single stream but can actually be executed in a different order and, possibly, in parallel depending on their mu- tual dependencies. As mentioned above, the popularity of this model encouraged the OpenMP board to include it in the 4.0 standard. The simplicity of the STF model facilitates the design of numerical algorithms in a concise manner and can be exploited to efficiently target multicore architectures [2].
En savoir plus

13 En savoir plus

Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems

Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems

(3) It proposes a memory-aware approach for controlling the memory consumption of the parallel multifrontal method, which allows the user to achieve the highest possible performance within a prescribed memory footprint. This technique can be seen as employing a sliding window that sweeps the whole elimination tree and whose size is dynamically adjusted in the course of the factorization in order to accommodate for as many fronts as possible within the imposed memory envelope. These three contributions are deeply related. The approach based on 2D frontal ma- trix factorizations leads to extremely large DAGs with heterogeneous tasks and very complex dependencies; implementing such a complex algorithm in the STF model is conceptually equivalent to writing the corresponding sequential code because paral- lelism is handled by the runtime system. The effectiveness of this approach is sup- ported by a finely detailed analysis based on a novel method which allows for a very accurate profiling of the performance of the resulting code by quantifying separately all the factors that play a role in the scalability and efficiency of a shared-memory code (granularity, locality, overhead, scheduling). This method can be readily extended to the case of distributed-memory parallelism in order to account for other performance critical factors such as the communications. Our analysis shows that the runtime sys- tem can handle very efficiently such complex DAGs and that the relative runtime over- head is only marginally increased by the increased DAGs size and complexity; on the other hand, the experimental results also show that the higher degree of concurrency provided by 2D methods can considerably improve the scalability of a sparse matrix QR factorization. Because of the very dynamic execution model of the resulting code, concerns may arise about the memory consumption; we show that, instead, it is possi- ble to control the memory consumption by simply controlling the flow of tasks in the runtime system. Experimental results demonstrate that, because of the high concur- rency achieved by the 2D parallelization and because of the efficiency of the underly- ing runtime, the very high performance of our code is barely affected even when the parallel code is constrained to run within the same memory footprint as a sequential execution.
En savoir plus

24 En savoir plus

Task-based Conjugate Gradient: from multi-GPU towards heterogeneous architectures

Task-based Conjugate Gradient: from multi-GPU towards heterogeneous architectures

In this paper, we tackle another class of algorithms, the Krylov subspace methods, which aim at solving large sparse linear systems of equations of the form Ax = b, where A is a sparse matrix. Those methods are based on the calculation of approximated solutions in a sequence of embedded spaces, that is intrinsically a sequential numerical scheme. Second, their unpreconditioned versions are exclusively based on non compute intensive kernels with irregular memory access pattern, Sparse Matrix Vector products (SpMV) and level-one BLAS, which need very large grain tasks to benefit from GPU acceleration. For these reasons, designing and scheduling Krylov subspace methods on a multi-GPUs platform is extremely challenging, especially when relying on a task-based abstraction which requires to delegate part of the control to a runtime system. We discuss this methodological approach in the context of the Conjugate Gradient (CG) algorithm on a shared-memory machine accelerated with multiple GPUs using the StarPU runtime system [5] to process the designed task graph. The CG solver is a widely used Krylov subspace method, which is the numerical algorithm of choice for the solution of large linear systems with symmetric positive definite matrices [13].
En savoir plus

18 En savoir plus

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

tion number with dgecon by means of those two results. The main challenge here resides in the LU factorization with partial pivoting, which is difficult to implement us- ing task-based programming models. Indeed, searches for pivot candidates and row swapping generate many global synchronization points within the panel factorization and its resulting updates. Some solutions have been proposed on shared-memory systems [45] but there are no existing solutions that are oblivious of heterogeneous architectures. We thus propose a QR-based solution which consists in estimating the norm of A −1 by computing the norm of R −1 with A = QR. This solution, which turns out to be less costly, alleviates the pivoting issue all together, uses only regular tile algorithms and allows code portability across various architectures, thanks to the underlying run- time system. The third section of the algorithm, rows 21 to 48, is the main loop of the algorithm, which iterates on U k and converges to the polar factors. This section of the
En savoir plus

13 En savoir plus

Introduction of shared-memory parallelism in a distributed-memory multifrontal solver

Introduction of shared-memory parallelism in a distributed-memory multifrontal solver

The objective of this paper is thus to study how, starting from an existing parallel solver using MPI, it is possible to adapt it and improve its performance on multi-core architectures. Although the resulting code is able to run on hybrid-memory architectures, we consider here a pure shared-memory environment. We use the example of the MUMPS solver [3, 5], but the methodology and approaches described in this paper are more general. We study and combine the use (i) of multithreaded libraries (in particular BLAS– Basic Linear Algebra Subprograms [25]); (ii) of loop-based fine grain parallelism based on OpenMP [1] directives; and (iii) of coarse grain OpenMP parallelism between independent tasks. On NUMA architectures, we show that the memory allocation policy and resulting memory affinity has a strong impact on performance and should be chosen with care, depending on the set of threads working on each task. Furthermore, when treating independent tasks in a multithreaded environment, if no ready task is available, a thread that has finished its share of the work will wait for the other threads to finish and become idle. In an OpenMP environment, we show how, technically, it is in that case possible to re-assign idle threads to active tasks, increasing dynamically the amount of parallelism exploited and speeding up the corresponding computations.
En savoir plus

39 En savoir plus

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Alternatively, a modular approach can be employed. First, the numerical al- gorithm is written at a high level independently of the hardware architecture as a Directed Acyclic Graph (DAG) of tasks where a vertex represents a task and an edge represents a dependency between tasks. A second layer is in charge of the scheduling. Based on the scheduling decisions, a runtime system takes care of performing the actual execution of the tasks, both ensuring that dependencies are satisfied at execution time and maintaining data consistency. The fourth layer consists of the tasks code optimized for the underlying architectures. In most cases, the last three layers need not be written by the application developer. Indeed, it usually exists a very competitive state-of-the-art generic scheduling algorithm (such as work-stealing [4], Minimum Completion Time [19]) matching the algorithmic needs to efficiently exploit the targeted architecture (otherwise, a new scheduling algorithm may be designed, which will in turn be likely to apply to a whole class of algorithms). The runtime system only needs to be ex- tended once for each new architecture. Finally, most of the time, the high-level algorithm can be cast in terms of standard operations (such as BLAS in dense linear algebra) for which vendors provide optimized codes. All in all, with such a modular approach, only the high-level algorithm has to be specifically designed, which ensures a high productivity. The maintainability is also guaranteed since the use of new hardware only requires (in principle) third party effort.
En savoir plus

13 En savoir plus

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Multifrontal QR Factorization for Multicore Architectures over Runtime Systems

Alternatively, a modular approach can be employed. First, the numerical al- gorithm is written at a high level independently of the hardware architecture as a Directed Acyclic Graph (DAG) of tasks where a vertex represents a task and an edge represents a dependency between tasks. A second layer is in charge of the scheduling. Based on the scheduling decisions, a runtime system takes care of performing the actual execution of the tasks, both ensuring that dependencies are satisfied at execution time and maintaining data consistency. The fourth layer consists of the tasks code optimized for the underlying architectures. In most cases, the last three layers need not be written by the application developer. Indeed, it usually exists a very competitive state-of-the-art generic scheduling algorithm (such as work-stealing [4], Minimum Completion Time [19]) matching the algorithmic needs to efficiently exploit the targeted architecture (otherwise, a new scheduling algorithm may be designed, which will in turn be likely to apply to a whole class of algorithms). The runtime system only needs to be ex- tended once for each new architecture. Finally, most of the time, the high-level algorithm can be cast in terms of standard operations (such as BLAS in dense linear algebra) for which vendors provide optimized codes. All in all, with such a modular approach, only the high-level algorithm has to be specifically designed, which ensures a high productivity. The maintainability is also guaranteed since the use of new hardware only requires (in principle) third party effort.
En savoir plus

14 En savoir plus

Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver

Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver

7 IRIT Laboratory, University of Toulouse and UPS, F-31071 Toulouse, France S U M M A R Y We put forward the idea of using a Block Low-Rank (BLR) multifrontal direct solver to efficiently solve the linear systems of equations arising from a finite-difference discretization of the frequency-domain Maxwell equations for 3-D electromagnetic (EM) problems. The solver uses a low-rank representation for the off-diagonal blocks of the intermediate dense matrices arising in the multifrontal method to reduce the computational load. A numerical threshold, the so-called BLR threshold, controlling the accuracy of low-rank representations was optimized by balancing errors in the computed EM fields against savings in floating point operations (flops). Simulations were carried out over large-scale 3-D resistivity models representing typical scenarios for marine controlled-source EM surveys, and in particular the SEG SEAM model which contains an irregular salt body. The flop count, size of factor matrices and elapsed run time for matrix factorization are reduced dramatically by using BLR representations and can go down to, respectively, 10, 30 and 40 per cent of their full-rank values for our largest system with N = 20.6 million unknowns. The reductions are almost independent of the number of MPI tasks and threads at least up to 90 × 10 = 900 cores. The BLR savings increase for larger systems, which reduces the factorization flop complexity from O(N 2 ) for the full-rank solver to O(N m ) with m = 1.4–1.6. The BLR savings are
En savoir plus

16 En savoir plus

3D frequency-domain seismic modeling with a Block Low-Rank algebraic multifrontal direct solver

3D frequency-domain seismic modeling with a Block Low-Rank algebraic multifrontal direct solver

Although compression rates may not be as good as those achie- ved with hierarchical formats, BLR offers a good flexibility thanks to its simple, flat structure. Many variants of Algo- rithm 1 can be easily defined, depending on the position of the ’C’ phase. For instance, it can be moved to the last position if one needs an accurate factorization and an approximated, faster solution phase. This might be a strategy of choice for FWI application, where a large number of right-hand sides must be processed during the solution phase. In a parallel en- vironment, the BLR format allows for an easier distribution and handling of the frontal matrices. Pivoting in a BLR matrix can be more easily done without perturbing much the structure. Lastly, converting a matrix from the standard representation to BLR and vice versa, is much cheaper with respect to the case of hierarchical matrices (see Table 1 for the low global cost of compressing fronts into BLR formats). This allows to switch back and forth from one format to the other whenever needed at a reasonable cost; this is, for example, done to simplify the assembly operations that are extremely complicated to perform in any low-rank format. As shown in Fig. 3, the O(N 2 ) com- plexity of a standard, full rank solution of a 3D problem (of N unknowns) from the Laplacian operator discretized with a 3D 11-point stencil is reduced to O (N 4/3 ) when using the BLR format. All these points make BLR easy to adapt to any multi- frontal solver without a complete rethinking of the code.
En savoir plus

6 En savoir plus

Decentralized task allocation for heterogeneous teams with cooperation constraints

Decentralized task allocation for heterogeneous teams with cooperation constraints

11: Identify invalid tasks: J IV , {j ∈ J RD |(z ij 1 = NULL) XOR (z ij 2 = NULL)} 12: end while or, (c) only one subtask is assigned. For case (a), one can claim that the RDT j is assigned to an appropriate pair of agents, and for case (b), one can claim that the RDT j is not assigned because it is not attractive to the team. Case (c) can be problematic, because one agent is committed to take its part in RDT based on a wrong optimistic view, and actually obtains no reward from it. The performance degradation due to this invalid assignment can be larger than the loss of assumed score for the subtask the agent has committed to do. The reason for this additional performance degradation is the cost of opportunity. The agent may have been able to do other tasks had it been aware that this RDT would only be partially assigned, thus no actual reward could be obtained from it. Had the agent been aware of that fact, it would not have bid on the task in the first place. One approach to improve performance is to eliminate these invalid tasks and re-distribute the tasks accordingly.
En savoir plus

7 En savoir plus

Finding new heuristics for automated task prioritizing in heterogeneous computing

Finding new heuristics for automated task prioritizing in heterogeneous computing

Han Lin et al. present a way of generating DAGs for heterogeneous scheduling evaluation purposes [10]. An implementation of this algorithm has been created by Frederic Suter [18]. It has 5 hyperparameters (n, fat, regularity, jump, and CCR). This graph generating method was our first idea for generating our dataset, but it has several downsides. It does not handle task types. Therefore each task is treated individually. In a real application, however, there are usually different task types. Besides, there are often correlations between a task’s dependencies and its type. For example, an application can have a regular pattern where task A performs a matrix inversion and then 16 tasks of type B read the matrix. In this case, A tasks would systematically have 16 successors of type B, and conversely for B tasks. This kind of behavior is not represented at all with the presented DAG generation method.
En savoir plus

44 En savoir plus

Priority-based Riemann solver for traffic flow on networks

Priority-based Riemann solver for traffic flow on networks

This theory was applied to different domains, including vehicular traffic [ 14 ], supply chains [ 1 ], irrigation channels [ 3 ] and others. For a complete account of recent results and references we refer the reader to the survey [ 6 ]. For vehicular traffic, authors considered many different traffic situations to be modeled, thus proposing a rich set of alternative junction models even for the scalar case, see [ 8 , 9 , 12 , 13 , 14 , 18 , 22 , 24 ]. Here, we first propose a new model which considers priorities among the incoming roads as the first criterion and maximization of flux as the second. The main idea is that the road with the highest priority will use the maximal flow taking into account also outgoing roads constraints. If some room is left for additional flow then the road with the second highest priority will use the left space and so son. A precise definition of the new Riemann solver, called Priority Riemann Solver, is based on a traffic distribution matrix A (Definition 11 ), a priority vector P = (p 1 , . . . , p n ) (with p i ≥ 0 and P i p i = 1) and requires a recursion
En savoir plus

34 En savoir plus

EEA-Aware For Large-Scale Scientific Applications On Heterogeneous Architectures

EEA-Aware For Large-Scale Scientific Applications On Heterogeneous Architectures

Ref. John A. García H. et al. Energetically Efficient Acceleration EEA-Aware. Degree work to obtain the title of Master of Science in Systems Engineering and Informatics at UIS 2016. Ref. Víctor Martinez et al. Towards Seismic Wave Modeling on Heterogeneous Many-Core Architectures Using Task-Based Runtime System. SBAC-PAD 2015. Seismic Wave Equation for a solid medium. Constitutive relation in the case of an isotropic medium.

2 En savoir plus

TBES: Template-Based Exploration and Synthesis of Heterogeneous Multiprocessor Architectures on FPGA

TBES: Template-Based Exploration and Synthesis of Heterogeneous Multiprocessor Architectures on FPGA

HLS is the transformation of a functional description — provided as C code in our framework — into a Register Transfer Level (RTL) description. The generated architecture described at the RTL level is compliant with an architecture model that depends on the HLS tool. The HLS tool currently used in our framework is GAUT [28], which is linear in complexity. It is a free HLS tool based on a Data Flow Graph model and therefore dedicated to data-dominated algorithms. From a C/C++ specification and a set of design constraints, GAUT automatically generates a potentially pipelined RTL architecture in two formalisms: in VHDL for synthesis, and in SystemC for virtual prototyping. In our flow, the set of design constraints is provided by the HLS/DSE controller (cf. Fig.7), in order to generate and evaluate a set of coprocessors with several performance/resource tradeoffs. The HLS constraints we consider for automated exploration are the latency and the communication model (FIFO or shared memory). The generated architecture is composed of i) a processing unit composed of the data-path (logic/arithmetic operators + storage elements + steering logic) and an FSM controller; ii) a memory unit composed of memory banks and associated controllers and iii) a communication interface which can be implemented as a FIFO, a shared memory (optionally with ping pong mode), or a 4-phase handshake. Given that the underlying model of architecture of GAUT is clearly defined, it is possible to perform a resource usage estimation. So after behavioral synthesis, the following features are known exactly:
En savoir plus

22 En savoir plus

Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver

Large-scale 3D EM modeling with a Block Low-Rank multifrontal direct solver

and 30 per cent of the size for the matrices of factors compared to the conventional full-rank (FR) factorization method. There have so far been no reports on the application of multi- frontal solvers with low-rank approximations to 3-D EM problems. EM fields in geophysical applications usually have a diffusive na- ture which makes the underlying equations fundamentally different from those describing seismic waves. They are also very different from the thermal diffusion equations since EM fields are vectors. Most importantly, the scatter of material properties in EM problems is exceptionally large, for example, for marine CSEM applications resistivities of seawater and resistive rocks often differ by four or- ders of magnitude or more. On top of that, the air layer has an essentially infinite resistivity and should be included in the compu- tational domain unless water is deep. Thus, elements of the system matrix may vary by many orders of magnitude, which can affect the performance of low-rank approaches for matrix factorization.
En savoir plus

15 En savoir plus

E-HEFT: Enhancement Heterogeneous Earliest Finish Time algorithm for Task Scheduling based on Load Balancing in Cloud Computing

E-HEFT: Enhancement Heterogeneous Earliest Finish Time algorithm for Task Scheduling based on Load Balancing in Cloud Computing

The makespan results (the average of 5 executions) for the Montage and Cybershake DAGs are shown in Fig 6, while Fig. 7 shows the comparison of degree of imbalance between E-HEFT, HEFT and MinMin-TSH Algorithms. According to the experimental results in Fig 6, it can be seen that the total running time of scientific workflows lengthens with increase of the number of submitted tasks. It is clearly evident from the graph that E-HEFT is more efficient when compared with other two algorithms based on both workflows. According to Montage workflow, our proposed algorithm (E-HEFT) outper- forms HEFT and MinMin-TSH algorithms in terms of the average makespan of the submitted tasks by 21.37%, and 28.98% respectively as shown in Fig 6a. On the other hand, The E-HEFT showed an average improvement of 26.07% in relation of HEFT and 21.93% in relation of MinMin-TSH based on Cybershake workflow as shown in Fig 6b. It is worth mentioning that makespan is closely pertinent to the volume of data movement for data intensive workflow. So, we can state that those improvements variations are mainly due to the volume of data that is transferred among activities in each workflow. Compared with two other algorithms, our algorithm
En savoir plus

10 En savoir plus

Show all 10000 documents...