Linux - Current Main-Stream Operating Systems

3 State of the Art Multi-Core Operating Systems

3.5 Current Main-Stream Operating Systems

3.5.1 Linux

Linux was developed by Linus Thorvalds at the Helsinki University of Technol-ogy and first released as an open source operating system in 1992. Since then, it has undergone major redesigns and today is one of the most widely used operating systems, especially in server installations and supercomputers (about 80% of the top 500 supercomputers use Linux as the operating system [11]).

Linux is an open source SMP system with a monolithic kernel, a feature fa-mously highlighted in the debate between Thorvalds and Tannenbaum, the leader of the MINIX project (a micro-kernel based Unix variant). Its design has undergone several major redesigns, most recently through the swap of the core scheduler in release 2.6.23 (currently—2010—the Linux kernel has a 2.6.3x release version), thus it is quite difficult to describe the features of Linux in general; the description given in this chapter is based on version 2.6.33.

3.5.1.1 Scheduling

Scheduling in Linux is structured on two levels: the core scheduler and the sched-uler classes. The task division between the two layers is quite well defined: while the scheduler classes’ layer takes the scheduling decisions according to various policies, hence deciding which task to run next, the core scheduler takes care of the general task management and task switching activities, independently of how—ac-cording to which policy—the next task to be executed has been selected. The over-all architecture and relationships between layers is shown in Fig. 3.5.

The core scheduler is activated in two ways: either a task yields the CPU or through a mechanism that is run periodically and decides whether task switching is necessary. It consists of two main functions: the periodic scheduler and the main scheduler. The periodic scheduler performs—besides the collection of scheduling specific statistics—two major tasks: it decides and if needed, performs load re-balancing between CPUs in an SMP system and activates the periodic scheduling

3.5 Current Main-Stream Operating Systems

method of the scheduling classes to which the currently running task belongs. This, in turn, may decide that a switch of active tasks is needed and indicates this through setting a kernel flag that signals the need to run the main scheduler.

The main scheduler is invoked at many points throughout the kernel—for exam-ple, after system calls, if the kernel flag indicating the need to reschedule is set—and is the entity that actually decides and triggers a task switch. At this point it’s neces-sary to clarify the relationship that exists between different scheduling classes (or scheduling policies): from the main scheduler’s perspective, these are organized in a strict hierarchy and no task belonging to a lower priority scheduling class will be selected as long as there are any runnable tasks belonging to higher priority schedul-ing classes. On the other hand, each task belongs to one schedulschedul-ing class only (and each task must belong to a scheduling class). Due to this algorithm, the current main scheduler is also called the priority scheduler, as it selects the next task based on the priority of the scheduling class to which it belongs; for the same priority level, however, the decision—task selection policy—is delegated to the scheduler class.

It is important to clarify what exactly is being actually scheduled by the Linux kernel. We used so far the term task; in reality however the kernel works with sched-ulable entities, where a schedsched-ulable entity may be simply a thread but also a group of processes. From the overall scheduling architecture point of view this is practi-cally irrelevant, hence our choice of using the term task.

Fig. 3.5 The architecture of the Linux scheduler

CPUCPU

CPUCPU Core Scheduler (main priority scheduler, one instance / CPU)

Scheduler Classes (priority ordered, each class has one queue / CPU) Main

Linux today supports, by default, two scheduler classes, each with different scheduling policies. The (soft) real-time scheduling class has the higher priority and supports the round-robin and FIFO scheduling policies; for a thorough discus-sion of this scheduling class, we recommend one of the many books and sources available, such as Ref. [12]. Herein we will focus on the second, more widely used general purpose scheduling class that supports the normal, batch and idle schedul-ing policies.

The general Linux scheduling class used prior to kernel version 2.6.23 was based on the O(1) scheduler [13], capable of scheduling processes within a con-stant amount of time, but this was replaced with a new, O(log n) scheduler called the Completely Fair Scheduler ( CFS). It’s based on the Rotating Staircase Dead-line scheduler, also developed within the Linux kernel community. Consequently, today this main scheduling class is also called the completely fair scheduling class.

The basic principle of CFS is to provide as close to ideal as possible fairness to each task, with respect to the computational power that it gets allocated. In a simpli-fied case of N tasks with the same priority, it would mean equal processor time. The core method to approximate such a situation is to keep an ordered list of tasks (us-ing a red–black tree as implementation method), so that the tasks with the longest waiting times are at the head of the list and will execute next. In fact, the quantity by which tasks are ordered is modified to take into account the time it should receive and hence simulate the ideal case more precisely; this modification is based on the virtual clock of the CFS, which weights the wall clock by the number of available tasks.

Once a task is selected for execution, its waiting time will be decreased periodi-cally with the amount of time it was allowed to run and hence eventually it will not be at the head of the ordered list—triggering the selection of another task for execu-tion, the task that becomes first in the list.

In practice, this algorithm has to be fine-tuned in order to cater for several con-straints: different priority levels shall be factored into the waiting time of tasks (tasks with higher priority shall have higher ‘fair share’ times than lower prioritized tasks), fairness shall be weighed against the cost of switching tasks too often: the overhead of doing so may outweigh the benefits, if done too often. In fact, the Linux kernel has two built-in parameter that controls the latency of scheduling; the first one indicates the time period within which all tasks must get the chance to execute at least once (default value is 20 ms); the second one sets the maximum number of tasks that are supposed to be handled within this time period (if this configured value is exceeded, the interval will be extended linearly).

3.5.1.2 Multi-Processor and Multi-Core Support

The multi-processor support of Linux is exclusively symmetric multi-processing centered, with support for non-uniform topologies (such as NUMA systems), but assuming CPUs with equal capabilities.

3.5 Current Main-Stream Operating Systems

SMP support in Linux is an extension of single-processor scheduling. Each CPU has its own scheduler, but this is coupled with a periodic re-balancing between CPUs. Essentially, at each periodic invocation (tick) of the scheduler, on each CPU, the need for rebalancing is checked—in practice, this means that if sufficient time has elapsed since the last rebalancing, a new rebalancing procedure is initiated.

The rebalancing is done per scheduling class and always within scheduling do-mains. A scheduling domain is a set of CPUs that define a domain within which re-scheduling can be performed (e.g., share the same card, processor socket or NUMA domain). In a perfectly flat SMP, there would naturally be only one scheduling domain.

The rebalancing is based on a task stealing mechanism. When rebalancing is de-cided, the thief CPU will identify the CPU with the busiest run queue and, if the load on that CPU is higher, it will attempt to move tasks from that CPU to itself (only tasks that are currently not executing may be moved). In case the move of tasks can-not be done for some reason, the CPU with the busiest queue will be triggered to perform itself the off-loading of some of its tasks; this will be achieved by a special thread called the migration thread, associated with every CPU and the procedure is called active balancing: tasks will be moved forcefully if it’s deemed necessary.

The distributed nature of this rebalancing method is both its strength and weak-ness. As it’s not a synchronized activity, it can work autonomously e.g. across mul-tiple scheduling domains; for the same reason however it may be the source of contention between CPUs as several may attempt to move tasks from the same busiest CPU. The run-queue of any CPU may be inspected at any time by any other CPU and attempts to modify it may happen at any time and concurrently with lo-cal scheduling decisions. While careful locking schemes may make the mechanism work smoothly for reasonable number of CPUs, it may become impractical for large scale SMP systems, where it’s often either disabled or confined to reasonably-sized scheduling domains.

3.5.1.3 Memory Management

Memory management in Linux is based on the buddy allocator for general purpose memory management and the slab allocator for kernel-specific memory allocation.

Physical memory is divided, usually in a 3:1 ratio, between applications and the kernel.

From multi-processor and multi-core processor support point of view, Linux of-fers support for NUMA systems and memory page migration. For multi-processor systems in general and NUMA architectures in particular, memory is organized around the concepts of node and CPU sets. Each node owns part of the physical memory and in general corresponds to one CPU (or several processor cores that share part of the memory). A CPU set groups several CPUs (nodes) into one group that has a uniform access to part of the memory.

Applications may be assigned to certain CPU sets and the OS supports automatic migration of memory pages in case the application is re-assigned; the same effect

can be achieved by defining memory policies that force the allocation of memory for a specific application from a specific set of nodes. The memory policy manager can be configured to perform automatic page migration by redefining the set of nodes on which the pages for that specific application may be allocated, coupled with an explicit request for complying with the new policy (which in practice means page migration).

Beside these two methods—assignment to CPU sets and memory management policy—the Linux kernel does not support any additional automatic page migration policies, as the fundamental principle guiding memory management design has al-ways been to leave the control of it to the application, for lack of a good method to predict best fit memory page localization.

3.5.1.4 Linux: Summary

Linux is a monolithic kernel, SMP-based multi-processor and multi-core enabled operating system. It has been used for very large SMP deployments, but usually with quite strong constraints in order to prevent automatic task migration and cross-node memory allocations: experiments have shown that these are the areas where Linux faces the most severe scalability issues. Research has also shown [14] that Linux is quite intrusive with regards to cache behavior: system calls tend to destroy application-specific cache, impacting application performance.

Dans le document Programming Many-Core Chips (Page 74-78)