From Supercomputing to Soupercomputing

Key Concept

2.3 Parallel Computing

2.3.1 From Supercomputing to Soupercomputing

A supercomputer is the fastest computer of its time; today’s supercomputer is tomorrow’s desktop or laptop computer. One of the first supercomputers of historical significance was the Cray-1. It was used quite successfully in many applications involving large-scale simulation in the early 1980s. The Cray-1 was not a parallel computer, however, but it employed a powerful (at the time) vector processor with many vector registers attached to the main memory (see figure 2.13). Today, all supercomputers are parallel computers. Some are based

Vector

Figure 2.13: Schematic of the ﬁrst Cray computer, the Cray-1.

on specialized processors and networks, but the majority are based on commodity hardware and open source operating system and applications software. In this section, we will review brieﬂy some of the history and the recent trends.

Types of Parallel Computers

A popular taxonomy for parallel computers is the description introduced by Michael Flynn in the mid 1960s [36] of the programming model as single instruction/ multiple data stream (SIMD) or multiple instruction/ multiple data stream (MIMD). In a SIMD computer, such as the Thinking Machines CM-2 or the NCUBE Inc. computers of the 1980s, each pro-cessor performs the same arithmetic operation (or stays idle) during each computer clock, as controlled by a central control unit (see ﬁgure 2.14). In this model (also referred to as a data parallel program) high-level languages (e.g., CM Fortran, C^∗, and Lisp) are used, and computation and communication among processors are synchronized implicitly at every clock period.

On a MIMD computer (see figure 2.15) each of the parallel processing units executes oper-ations independently of each other, subject to synchronization through appropriate message passing at specified time intervals. Both parallel data distribution as well as the message passing and synchronization are under user control. Examples of MIMD systems include the Intel Gamma and Delta Touchstone computers and, with fewer but more powerful processors, the Cray C-90, and the first generation of IBM SP2 (all made in the 1990s).

While it is often easier to design compilers and programs for SIMD multiprocessors because of the uniformity among processors such systems may be subject to great compu-tational ineﬃciencies. This is due to their inﬂexibility when stages of a computation are

Processing

Figure 2.14: Schematic of SIMD parallel computer.

encountered in which there is not a large number of identical operations. There has been a natural evolution of multiprocessor systems towards the more ﬂexible MIMD model, espe-cially the merged programming model in which there is a single program (perhaps executing distinct instructions) on each node. This merged programming model is a hybrid between the data parallel model and the message passing model and was successfully exempliﬁed in the Connection Machine CM-5. In this SPMD (single program multiple data) model, data parallel programs can enable or disable the message passing mode, and thus one can take advantage of the best features of both models.

Processing

Control unit 1 Control unit P−1

Control unit 0 ...

...

Figure 2.15: Schematic of MIMD parallel computer.

MIMD computers can have either shared memory as the SGI Origin 2000 or distributed memories as in the IBM SP system. The issue of shared memory requires further clariﬁcation as it is diﬀerent from the centralized memory. Shared memory means that a single address space can be accessed by every processor through a synchronized procedure. In non-shared memory systems explicit communication procedures are required. The prevailing paradigm

in parallel computing today is one where the physical memory is distributed, but the address space is shared as this is a more ﬂexible and easier as a programming environment.

PC Clusters

The most popular and cost-eﬀective approach to parallel computing is cluster computing, based for example, on PCs running the Linux operating system (hereafter referred to merely as Linux). The eﬀectiveness of this approach depends on the communication network con-necting the PCs together, which may vary from fast Ethernet to Myrinet that can broadcast messages at a rate of several Ggabits per second (Gbs).

NODE 0

Figure 2.16: Schematic of Generic Parallel Computer (GPC).

Issues of computer design, balancing memory, network speed, and processing speed can be addressed by examining the Generic Parallel Computer (GPC) depicted in ﬁgure 2.16.

The key components of the GPC are an interconnecting set of P processing elements (PE) with distributed local memories, a shared global memory, and a fast disk system (DS). The GPC serves as a prototype of most PC-based clusters that have dominated supercomputing in the last decade both on the scientiﬁc as well as the commercial front.

The first PC cluster was designed in 1994 at NASA Goddard Space Flight Center to achieve one Gigaflop. Specifically, 16 PCs were connected together using a standard Ethernet network. Each PC had an Intel 486 microprocessor with sustained performance of about 70 Megaflops. This first PC cluster was built for only $40,000 compared to $1 million, which was the cost for a commercial equivalent supercomputer at that time. It was named Beowulf after the lean hero of medieval times who defeated the giant Grendel. In 1997 researchers at the Oak Ridge national laboratory built a Beowulf cluster from many obsolete PCs of various types; for example, in one version it included 75 PCs with Intel 486 microprocessors, 53 Intel Pentium PCs and five fast Alpha workstations. Dubbed the stone soupercomputer because it was built at almost no cost, this PC heterogeneous cluster was able to perform important simulations producing detailed national maps of ecoregions based on almost 100 million degrees of freedom [54]. A picture of this first soupercomputer is shown in figure 2.17.

Figure 2.17: Soupercomputer of the Oak Ridge national laboratory. (Courtesy of F. Hoffman) Building upon the success of the first such system, the BEOWULF project [7, 81], sev-eral high performance systems have been built that utilize commodity microprocessors with fast interconnects exceeding one Gigabits per second in bandwidth. Moore’s law (an em-pirical statement made in 1965 by the Intel co-founder Gordon Moore) suggests that the performance of a commodity microprocessor doubles every 18 months, which implies that, even without fundamental changes in the fabrication technology, processors with a speed of several tens of Gigaflops can become available. Nanotechnology can help in prolonging the validity of this statement, which has been true for at least four decades. New developments include the TeraHertz transistor and the packaging of more than one billion transistors on a single chip will hopefully keep Moore’s law alive. Intel’s Pentium-4 (see figure 2.12) has about 42 million transistors).

In addition to enhancements in the speed of individual processors, there have been several key developments that have enabled commodity supercomputing:

• The development and maturization of the free operating system Linux, which is now available for all computer platforms. The freely distributable system and the open source software movement has established Linux as the operating system of choice, so almost all PC clusters are Linux based.

• The MPI standard that has made parallel coding portable and easy. There are sev-eral implementations such as MPICH, SCore, etc. but they all share the same core commands which we present in this book.

• The rapid advances in interconnect and fast switches with small latencies, which are now widely available unlike the early days of proprietory and expensive systems avail-able only by a few big vendors.

Grid Supercomputing

The computational grid is a new distributed computing paradigm, similar in spirit to the electric power grid. It provides scalable high-performance mechanisms for discovering and

negotiating access to geographically remote resources. It came about by the internet and world wide web advances and the fact that similarly to Moore’s law for computer speed, the speed of networks doubles every about nine months. This is twice the rate of Moore’s law, and it implies that the performance of a wide area network (WAN) increases by two orders of magnitude every ﬁve years!

Computing on remote platforms involves several steps, to ﬁrst identify the available sites, to negotiate fast access to them, and conﬁgure the local hardware and software to access them. The Grid provides the hardware and software infrastructure that allows us to do this.

The community-based opensource Globus toolkit is the most popular software infrastructure [38], see also

http://www.globus.org

It implements protocols for secure identiﬁcation, allocation and release of resources from a globally federated pool of supercomputers, i.e., the Grid.

The Grid also allows the implementation of network-enabled solvers for scientiﬁc com-puting, such as the package NetSolve [14]. NetSolve searches for available computational resources within the Grid and chooses the best available resource based upon some sort of match-making procedure. It consists of three parts: a client, an agent, and a server. Client is the user issueing a request that is received by the agent. The latter allocates the best server or servers which perform the computation and return the results to the client. The server is a daemon process, which is on the alert awaiting requests from the client.

Performance Measurements and Top 500

As regards performance of parallel computers, there is no universal yardstick to measure it, and in fact the use of a single number to characterize performance such as the peak perfor-mance quoted by the manufacturer is often misleading. It is common to evaluate perforperfor-mance in terms of benchmark runs consisting of kernels, algorithms, and applications so that diﬀer-ent aspects of the computer system are measured. This approach, however, is still dependdiﬀer-ent on the quality of software rather than just hardware characteristics. The controversy over performance evaluation methods has been recognized by the computer science community and there have been several recent attempts to provide more objective performance metrics for parallel computers [57]. A discussion of some of the most popular benchmarks, the BLAS routines, was presented in 2.2.7, and more information can be found on the web at:

http://www.netlib.org/benchmark

A good basis for performance evaluation of supercomputers is also provided in the Top500 list, see:

URL: http://www.top500.org/

This was created by Dongarra in the early 1990s and it is updated twice a year. This list reports the sites around the world with the 500 most powerful supercomputers. Performance on a LINPACK benchmark [28] is the measure used to rank the computers. This is a code that solves a system of linear equations, see chapter 9, using the best software for each

platform. Based on the data collected so far and the current Teraflop sustained speeds achieved, it is predicted that the first PETAFlop/s (10¹⁵floating point operations per second) supercomputer would be available around 2010 or perhaps sooner.

Dans le document in C++ and MPI (Page 78-84)