Event File - A Distributed Hardware Algorithm for Scheduling Dependent Tasks

A Distributed Hardware Algorithm for Scheduling Dependent Tasks

10.4.4 Event File

The event fileEprocesses the provision of eventpin a similar way as theMunits do, as shown in the algorithm of Fig.10.8.

10.4.5 Migration

In order to exploit the performance acceleration of local coprocessors and increase the reaction times of a task, we need to let the control flow migrate to different processing cores. Since a multithreaded processor is a resource of a task, we employ the distributed algorithm proposed in this paper to schedule migrations over the processing cores.

Assuming that some cores C_m,…, C_n are associated to the colored events xm,…,xn, the following code implements a migration from the coreCito the core Cj by first obtaining access to a processor’s context and then transferring the context-specific contents of the M unit. This transfer could have also been accomplished by a dedicated bus structure rather than in software.

To simplify matters, in this code we have omitted three features which we describe here in text:

Fig. 10.8 Provision to the event file

144 L. Di Gregorio

1. in line 7, the transfer is affected by a race condition becauseCi.M.R(x) might get provided after having been read from C_i.Mbut before being written into C_j.M. This race condition can be avoided by any of several well known techniques.

2. in line 9, on leaving the coreC_i, the executing task must issue PROVIDE(x_i) only if the task had migrated onto C_i previously. A new task, which gets initiated onC_i, has not migratedonto it and does not need to release it with a PROVIDE(), in fact the corresponding variablex_i would not be in the task’s DECLARE().

3. on terminating, if the task has migrated at all, it issue a PROVIDE() to release the last core it has migrated onto.

10.5 Experimental Setup

We have modeled the proposed algorithm for a generic multicore system of multithreaded processor cores as shown in Fig.10.9. A distributor agent dis-patches tasks to a subset of ‘‘entry’’ processors and these tasks are then free to migrate through the remaining ‘‘data plane’’ cores.

The figures of interest are:

• makespan: the time required to complete all the tasks divided by the total number of scheduled tasks.

• sojourn time: the time elapsed between the start and termination of a task.

• execution time: the time necessary for executing a task, including the peripheral access times but excluding the scheduling delays caused by thread preemption.

• CPU time: the time in which the task keeps the CPU busy.

These figures have been measured for two topologies which we have modeled:

parallel pipelines of processorsandpipelines of parallel processors. Our goal has

Fig. 10.9 Scheme of the generic multicore system employed in simulation

10 A Distributed Hardware Algorithm for Scheduling Dependent Tasks 145

been to investigate how the scheduling of tasks over these processor clusters can be improved. The basic topology consists of four lanes with eight stages each.

Every processing core bears four contexts in its basic configuration. In the case of parallel pipelines, tasks are not allowed to move from one lane to the next. In the case of pipelines of parallel processors, they may do so.

With respect to Fig.10.1, in parallel pipelines of processor, every processorP_i,j can communicate only with P_i+1,j. In a pipeline of parallel processors, every processorP_i,jcan communicate with any processorPiþ1;k;8k2f1;2;3;4g:

We have randomized most characteristics to address generality. Both migration points and the destination of the migration are random. The context switch policy is also completely randomized and reflects the generalized processor sharing discipline common in many applications which process streaming data. It has the effect of equalizing the sojourn times of the tasks within the system: if tasksT₁and T₂are started at the same time, instead of executing task T₁as first until timeD₁ and subsequently taskT₂until timeD₂, the execution of both tasks is distributed over the time max(D₁, D₂), consequently the average sojourn time will be max(D₁,D₂) instead of (D₁?D₂)/2.

The figures for the tasks in isolation, reported in Table10.1, correspond to the case in which 32 simultaneous tasks are executed on 32 parallel processors and show that 24% of the idle time in this workload is caused by dependencies between the tasks. Figure10.10shows how the sojourn time of 32 simultaneous tasks decreases and the processor idle time increases when moving from 32 contexts on a single processor to 32 single processors. It demonstrates that the idle time in the workload can be eliminated by multithreading.

10.6 Results

Our main results are summarized in Table10.2. The scheduling performance achieved by the colored events is considerably higher than the one achieved by the standard events, i.e. pure dependency-based scheduling. Quadruplicating the width Table 10.1 Workload characteristics

Instruction % Characteristics

Execution 88 Takes up to 2,000 instructions

Access 7 Random latency up to 8 cycles

Synchronization 3 REQUIRE up to 16 events out of 64

PROVIDE up to 32 events out of 64

Migration 1 Random migration points

Figures for the tasks in isolation

Average execution time 1,026 cycles

Average CPU time 826 cycles

Average utilization 76%

146 L. Di Gregorio

of the pipeline, and hence the number of processors, still does not cope completely with the task congestion.

The parallel pipelines deliver a better performance than their equivalent pipe-lines of parallel processors because there is less traffic. In the case of pipepipe-lines of parallel processors, tasks may need to wait longer because their destinations can be occupied by tasks from other lanes. This penalty is not compensated by the fact that some lanes increase their availability due to the tasks which leave them.

The reason why the colored events perform better is that they allow wormhole routing of tasks while retaining deadlock freedom. The problem is shown in Fig.10.11: task A may overtake task B and fill up the free context in the stage below B. If A depends on B and B shall provide its dependency only after having moved to the subsequent stage, a deadlock happens because B cannot move to the next stage occupied by A and A cannot leave it without B having provided the dependency first.

Without carrying out a finer functional partition to solve the problem ‘‘manu-ally’’, the overtaking of tasks must be disabled to avoid deadlocks.

Instead, the colored events sequence only dependent tasks over the available contexts; therefore they provide a less strict policy for a deadlock-free routing than just disabling the overtaking.

Fig. 10.10 Workload sensibility to multithreading

Table 10.2 Effect of task wormhole

Topology Makespan Sojourn Utilization (%)

Parallel pipelines (colored) 35.87 3,273.43 73

Parallel pipelines (standard) 90.05 1,697.96 29

Pipeline of parallels (colored) 40.94 3,689.08 63

Pipeline of parallels (standard) 187.32 2,477.33 14 Pipeline of parallels (double size) 91.70 2,742.70 14 Pipeline of parallels (quad size) 55.79 3,250.93 12

10 A Distributed Hardware Algorithm for Scheduling Dependent Tasks 147

In Fig.10.12we report the effect of increasing the number of contexts in a cluster of parallel pipelines. The makespan can be largely reduced by moving from one context to two, but it does not improve much by adding more than three contexts:

further increases in the sojourn time of the tasks do not eliminate further idle time.

Subsequently, we have analyzed the effect of increasing the depth of several parallel pipelinesandpipelines of parallelprocessors. Figure10.13represents the outcome of the measurements for a set of 64 processors bearing four contexts each.

The processors have been initially organized in 32 parallel groups of two stages each and subsequently in 16, 8, 4 and 2 groups of respectively 2, 4, 8, 16 and 32 stages each. From the data in Fig.10.13, we can estimate an increase of the makespan by about 5% for every halving of the number of parallel groups and doubling of the groups depth.

The additional flexibility of a pipeline of parallel processors costs from 13.5%

(narrowest configuration: 2 groups of 32 stages each) to 25% (widest configura-tion: 32 groups of two stages each) in terms makespan for a random workload. The sojourn time increases about 1% slower than the makespan because of the lower utilization achieved in the last stages of the narrower configurations.

Fig. 10.11 Deadlock in wormhole routing of tasks

Fig. 10.12 Effect of increasing the number of contexts on a cluster of four parallel pipelines of eight processors each

148 L. Di Gregorio

The results of Fig.10.14show the performance increase achieved by adding stages of four processors each to a four processors wide configuration. Every doubling of the pipeline depth leads to a performance increase of about 80%, with the pipeline of parallel processors delivering between 13.5 and 17.5% less per-formance than its equivalent parallel pipelines of processors.

10.7 Conclusions

We have presented a novel algorithm for scheduling tasks on multicore archi-tectures. Its most striking feature is the hardware support for avoiding deadlocks and livelocks. In comparison to the fundamental algorithm by Tomasulo in [4,5], Fig. 10.13 Effect of task

congestion in a pipeline of parallel processors

Fig. 10.14 Delay caused by deeper parallel pipelines

10 A Distributed Hardware Algorithm for Scheduling Dependent Tasks 149

we have introduced a separateddeclarationstage on a dedicated serial bus (Q-bus) and multiplerequirement and provisionstages. This generalization allows us to employ the algorithm for detecting and renaming data dependencies across mul-tiple concurrent tasks, rather than across single instructions.

The approach of employing dependency renaming for scheduling tasks has been proposed in software by Perez et al. in [7], but it requires tasks of coarse granu-larity (10⁵cycles or more) to deliver a good performance. Instead, our hardware approach can efficiently schedule tasks of much finer granularity (down to a few tens of cycles), which are much more performing on embedded applications like the ones examined by Stensland et al. in [8].

Within our generalization, we have introduced the colored eventsfor dealing with hardware resources supporting multiple concurrent accesses.

We have applied the colored events in the scheduling of tasks over pipelines of processors and we have shown that we can allow a deadlock-free wormhole scheduling of tasks across multithreaded processor networks. We have presented numerical evidence of how this scheduling can deliver more performance than a large increase in the number of processors.

This algorithm has been validated by intensive simulation. We have also carried out some hardware implementations, but they are not final and shall be a subject for future work.

This approach provides a partial sequencing of tasks with regard to selected resources, but it does not clash with other existing scheduling techniques, e.g. for increasing performance. As the number of processing cores per chip keeps increasing, traditional synchronization techniques will not cope with the scaling and we believe that this approach provides a more advanced and distributed sequencing technique, enabling a smooth transition from existing legacy code.

Acknowledgments This work has been partially supported by the German Federal Ministry of Education and Research (BMBF) under the project RapidMPSoC, grant number BMBF-01M3085B.

References

1. Lee EA (2006) The problem with threads. Computer 39(5):33–42

2. Ayguadé E, Copty N, Duran A, Hoeflinger J, Lin Y, Massaioli F, Teruel X, Unnikrishnan P, Zhang G (2009) The design of OpenMP tasks. IEEE Trans Parallel Distributed Syst 20(3):404–418

3. Bellens P, Perez JM, Badia RM, Labarta J (2006) CellSs: a programming model for the Cell BE architecture. In: SC ‘06: Proceedings of the 2006 ACM/IEEE conference on supercomputing. ACM, New York

4. Tomasulo RM (1967) An efficient algorithm for exploiting multiple arithmetic units. IBM J Res Dev 11(1):25–33

5. Tomasulo RM, Anderson DW, Powers DM (1969) Execution unit with a common operand and resulting bussing system. United States Patent, August, number US3462744

6. Duran A, Pérez JM, Ayguadé E, Badia RM, Labarta J (2008) Extending the OpenMP tasking model to allow dependent tasks. In: International workshop on OpenMP ‘08, pp 111–122

150 L. Di Gregorio

7. Perez J, Badia R, Labarta J (2008) A dependency-aware task-based programming environment for multi-core architectures. In: IEEE international conference on cluster computing, October 2008, pp 142–151

8. Stensland HK, Griwodz C, Halvorsen P (2008) Evaluation of multicore scheduling mechanisms for heterogeneous processing architectures. In: NOSSDAV ‘08: Proceedings of the 18th international workshop on network and operating systems support for digital audio and video. ACM, New York, pp 33–38

9. Frigo M, Leiserson CE, Randall KH (1998) The implementation of the Cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN ‘98 conference on programming language design and implementation, Montreal, Quebec, Canada, June, 1998, pp 212–223 (proceedings published ACM SIGPLAN Notices, vol 33(5), May 2008)

10. OpenMP Architecture Review Board (2008) OpenMP application program interface-version 3.0. Avaliable online:http://www.openmp.org/mp-documents/spec30.pdf

11. Salverda P, Zilles C (2008) Fundamental performance constraints in horizontal fusion of in-order cores. In: 14th international symposium on high performance computer architecture (HPCA), pp 252–263

10 A Distributed Hardware Algorithm for Scheduling Dependent Tasks 151

Chapter 11

Dans le document Lecture Notes in Electrical Engineering (Page 146-154)