List of Figures

(1)

Automatic Deployment Tool for DIET

A Thesis submitted in fulfillment of the requirements for Diplˆome d’´etudes Approfondies

in

Informatique Fondamentale by

Pushpinder Kaur Chouhan

Under the guidance of

Dr. Eddy Caron Assistant Professor at ENS-Lyon

GRAAL, LIP

Report No. DEA-2003-01

Laboratoire de l’Informatique du Parall´elisme Ecole Normale Sup´´ erieure de Lyon

July 2003

(2)

(3)

Acknowledgments

I would like to take this opportunity to thanks Prof. Pierre Lescanne for giving me an opportunity to do my DEA at ENS-Lyon.

I am greatly indebted to my supervisor Prof. Eddy Caron of ENS Lyon, for his invaluable technical guidance and moral support during the stage. I am grateful to Prof. Fr´ed´eric Desprez for his suggestions during our project meetings.

I am grateful to Prof. Yves Robert, as his guidance has helped me to select a field, and thus to start a career. I am indebted to Prof. Jean-Michel Muller for giving me an opportunity to complete my DEA stage at LIP, ENS-Lyon.

I would like to thank Arnaud Legrand for the kind help provided by him in setting the initial stage for the project and his valuable suggestions on and off during the project. His suggestions during the deadlocks helped giving the project a renewed momentum.

My sincere thanks to Philippe Combes for his opinions, suggestions and (most useful of all) DIET code that were very useful for simulations and experiments. I sincerely appreciate his interest in this project and his feedbacks during our informal discussions.

I would like to thank to the member of GRAAL team, for their cooperation and support during the stage. Last but not the least, I would also like to thank the staff of ENS-Lyon for their help.

(4)

(5)

List of Figures

1 DIET Architecture . . . 9

2 Initialization of a DIET system . . . 10

3 Problem submission example . . . 11

4 Classification of the operating models. . . 12

5 Throughput of graphs in different operating models . . . 13

6 Architectural Model . . . 14

7 Different homogeneous networks . . . 17

8 Throughput of different graphs with 8 nodes . . . 18

9 Throughput of different graphs with 32 nodes . . . 18

10 Throughput of binary graph with different number of nodes . . . 19

11 Heterogeneous network . . . 20

12 Throughput of heterogeneous network as more number of LA are added . . . 21

(7)

Abstract

With the advances in high speed networks, distributed heterogeneous computing has become an attractive computational paradigm. Many Network Enabled Servers have been developed to take advantage of this paradigm. In order to achieve high performance, it is necessary to organize the distributed NES components in some particular fashion. But no- body has given any tool for the better organization of these components. We have developed a specific tool “Network Selection Tool” for the automatic deployment of DIET on grid, but we can use it for any hierarchical NES. This tool can find the bottleneck in the network and thus by breaking the bottleneck, can improve the performance of the network. With the help of this tool we can also find which is the best structure, according to the performance, among the given structures. By the use of this tool we can predict, what will be the effects on the performance if we make some specific changes in the network. We did the simulations to check the performance of this tool and results are encouraging.

1 Introduction

With the spread of Internet use and its increasing bandwidth and reliability enhancement, the possibility to interconnect geographically scattered groups of high speed and low cost parallel machines is becoming reality and thus, solving a common large-scale problem can be done cooperatedly by these machines. Most of the current applications on the Grid are numerical, and Remote Procedure Call (RPC) [28] paradigm is mostly used to build Problem Solving Environments (PSE) for these applications. There are several Network Enabled Servers (NESs) such as DIET [12, 7], NetSolve [28, 23], NINF [28, 24], Neos [22] and others, that follows this approach. The number and placement of different components of NES, enhance functionality and improve performance. But the organization of resources (machines) and scheduling of the tasks in an architecture can not be possible by exhaustive search. To ameliorate this problem, a deployment tool is required that can tell best architecture configuration on the basis of given criteria.

This report presents a tool “Network Selection”, that tackles this problem. Its implementation is described in the form of a Perl-based program. The main finding of this research is that, network selection tool is able to provide the best architecture that should be configured on the basis of given criteria (number of components, bandwidth of the link, resource computing power, size of the message, etc), so that maximum requests should be responded in a given time step. With the help of network selection tool we can find the bottleneck in the network and by adding more agents we can break the bottleneck and thus improve the overall performance of the network. Simulations with homogeneous and heterogeneous networks provide evidence that if resources are arranged in a particular fashion, then throughput of the network can be increased.

1.1 Background

To make the basics we read papers related to scheduling and allocation of tasks on heterogeneous platform. We got good overview from different papers and base for our work.

As computer networks and sequential computers are advancing, distributed computing systems, such as a network of heterogeneous workstations or PCs become more attractive alterna- tive to expensive, massively parallel machines. To exploit effective parallelism on a distributed system, tasks must be properly allocated to the processors. This problem, task assignment, is well known to be NP-hard in most cases. In general optimal solutions can be found through an exhaustive search, but because there are n^m ways in which m tasks can be assigned to n

(8)

processors, an exhaustive search is often not possible. Many algorithms [17, 9] have been proposed to provide optimal solutions for task assignment in heterogeneous distributed computing systems, that reduce the search space and/or lower the time complexity.

With the advances in high speed networks, distributed heterogeneous computing became an attractive computational paradigm, as Heterogeneous Networks of Workstations could be an efficient and cheap solution for data intensive computation. Many research centers have already taken advantage of their LAN facilities to build local low cost parallel machines for parallel processing. But in order to achieve high performance, it is necessary to merge efficient machine resource allocation and data distribution politics [1]. A typical grid-based distributed computing system is consist of a collection of heterogeneous workstations, multiprocessors and mobile nodes. Such a distributed computing system is heterogeneous both in the computing nodes and in the communication network. Scheduling algorithms for the heterogeneous system and scheduling algorithms for the collective communication pattern [6] have been developed for efficient collective communication. Several research projects, such as Globus [13], Legion [19]

are developing toolkits and infrastructure support to enable the use of these systems for high performance computing.

The Grid [4] is an emerging technology considered as a next generation computing infra structure. In the grid environment, users can use practically unlimited resources as a single entity. To achieve the high performance, a dynamic topology selection (a kind of resource selection method) is proposed in [18]. This dynamic topology selection overcome the problem caused by the communication bottleneck on a wide area links. An overlay network topology called as Virtual and Dynamic Hierarchical Architecture for discovering high performance grid services is proposed in [15].

Many complexities arise while managing and using a large collection of heterogeneous computational resources but grid computing is a convenient and powerful abstraction for dealing with these complexities. Grid computing resources often comprise individual workstations and other computers accessed by their owners and other users directly, without the control of any scheduling software. Exploiting the performance potential in grid computing environments requires effective application scheduling: the selection and allocation of resources to the application. There are already many research projects that focus on the problems of scheduling on cluster computing. One of the most fundamental characteristics of a meta-computing system is the algorithm it uses for the scheduling placement of jobs on processing nodes. Comparison of different schedule placement algorithms, and report on their success and failure modes when used to schedule job (independent tasks) distribution is described in [16]. Work in [30] shows that effective scheduling of meta-applications is possible, if sufficient application and systems resource cost information is provided to the system.

To integrate heterogeneous data sources dispersed over a computer network, database middleware systems, such as database gateways and mediator systems are used. To achieve data in- tegration, the middleware layer imposes a global data schema on top of the individual schemas used by each source. The translation of the data items to the global schema is performed by either a wrapper or database gateway. Database middleware systems require the deployment of application-specific data types and query operators to the servers and clients in the system.

Since new applications and data sources are added to the system as time progresses, the global schema must be changed to reflect these new additions. Earlier, middleware solutions rely on developers and system administrators to port and manually install all this application specific functionality to all the sites in the system. But now there exist a metadata-driven framework to automate the deployment of all application-specific functionality used by a middleware system [27].

The heterogeneous networks such as non globally addressable networks have created an

(9)

explosion of network address translators due to the lack of IP addressing space. Application level gateways are proposed to solve this problem but they are usually specific to a given distributed application. In [20] an application level addressing scheme and a routing mechanism is given, in order to overcome the non-global addressing limitation of heterogeneous networks.

In the current information technology era any new service involves a larger number of end users. A scalable application implementation must adapt to the intensity and relative distribution of the client load using a dynamic set of servers. A static approach, using a fixed number of servers and conventional traders, would lead to inefficient resource usage solutions. Mobile agent systems can be a suitable technology to answer these requirements, if some aspects of the architecture are carefully designed. A co-operative mobile agent system is proposed in [5] with a very dynamic and scalable trading service. The system allows applications to deploy servers onto network to respond to demand making them self-configurable.

2 Related Work

2.1 Grid Computing

Grid [4] is a type of parallel and distributed system that enables the sharing, selection, and aggre- gation of geographically distributed “autonomous” resources dynamically at runtime depending on their availability, capability, performance, cost, and user’s quality-of-service requirements.

Grid applications are special class of distributed applications that have high computing and resource requirements, and are often collaborative in nature.

Networks connect resources on the Grid, the most prevalent of which are computers with their associated data storage. Although the computational resources can be of any level of power and capability, some of the most interesting Grids for scientists involve nodes that are themselves high performance parallel machines or clusters. Such high performance Grid nodes provide major resources for simulation, analysis, data mining and other compute-intensive activities.

Grid Computing [10, 11] can be defined as applying resources from many computers in a network at the same time to a single problem; usually a problem that requires a large number of processing cycles or access to large amounts of data. At its core, Grid Computing enables de- vices, regardless of their operating characteristics to be virtually shared, managed and accessed across an enterprise, industry or workgroup. This virtualization of resources places all of the necessary access, data and processing power at the fingertips of those who need to rapidly solve complex business problems, conduct compute-intensive research and data analysis, and engage in real-time.

In short we can say that, grid computing enables the virtualization of distributed computing resources such as processing, network bandwidth and storage capacity to create a single system image, granting users and applications seamless access to vast IT capabilities. Just as an Internet user views a unified instance of content via the Web, a Grid user essentially sees a single, large virtual computer.

2.2 Problem Solving Environment

Problem Solving Environments (PSEs) [25, 26] form another class of higher-level computing environments. A PSE is a computer system that provides all the computational facilities needed to solve a target class of problems. These features include advanced solution methods, automatic and semiautomatic selection of solution methods, and ways to easily incorporate novel solution methods. Moreover, PSEs use the language of the target class of problems, so users can run them without specialized knowledge of the underlying computer hardware or software. By exploiting modern technologies such as interactive color graphics, powerful processors, and networks of

(10)

specialized services, PSEs can track extended problem solving tasks and allow users to review them easily. A PSE is comprised of a number of modular functions that can be composed into a more complex, composite application. Each of these modules or functions provides a service and hides low level details involved in grid use. Overall, they create a framework that they provides all services to all people: they solve simple or complex problems, support rapid prototyping or detailed analysis, and can be used in introductory education or at the frontiers of science.

2.3 NES/GridRPC Systems

The Network Enabled Server (NES) paradigm [29, 14], which enables Grid-based RPC [28]

(GridRPC), is a good candidate as a viable Grid middleware that offers a simple yet powerful programming paradigm for programming on the Grid. GridRPC systems offer features and capabilities that make it easy to program medium- to coarse-grained task parallel applications that involve hundreds to thousands or more high-performance nodes, either concentrated as a tightly coupled cluster, or a set of them spread over a wide area network. GridRPC also have other features such as dynamic resource discovery, dynamic load balancing, fault tolerance, security (multi-site authentication, delegation of authentication, adapting to multiple security policies, etc.), easy-to-use client/server management, firewall and private address considerations, remote large file and I/O support etc. As such GridRPC systems either provide these features themselves, or build upon the features provided by lower level Grid substrates such as Condor, Globus, and Legion. GridRPC systems abstract away much of the grid infrastructure and the associated complexities, allowing the users to program in a style he is accustomed to in order to exploit task-level-parallelism, i.e., asynchronous parallel procedure invocation where arguments and return values are passed by value or reference depending on his preference. This paradigm is amenable to many large-scale applications and especially to scientific simulations. Several systems that facilitate whole or part of the paradigm are NetSolve, NINF, NEOS, DIET etc.

2.3.1 NetSolve

NetSolve [23, 28] developed at the University of Tennessee, Knoxville, and Oak Ridge National Laboratory, is a client-server system that enables users to solve complex scientific problems remotely. The system allows to access both hardware and software computational resources distributed across the network. NetSolve searches for computational resources on a network, chooses the best one available, and using retry for fault-tolerance solves a problem, and returns the answers to the user. An agent based design has been implemented to ensure efficient use of system resources. A load-balancing policy is used by the NetSolve system to ensure good performance by enabling the system to use the computational resources available as efficiently as possible.

2.3.2 NINF

NINF [24, 28] developed at the Electrotechnical Laboratory, Tsukuba, is a software which allows users to access computational resources including hardware, software and scientific data distributed across a wide area network. In order to facilitate location transparency and network- wide parallelism, the NINF MetaServer maintains global resource information regarding computational server and databases. It can therefore allocate and schedule coarse-grained computations to achieve good global load balancing. NINF is also very similar to NetSolve in design and motivation. Adapters have been implemented to enable system to use numerical routines installed on the other.

(11)

2.3.3 NEOS

NEOS [22] is an environment for solving optimization problems over the internet. NEOS provides the user with the input format and a list of solvers for the optimization problem. NEOS was designed so that solvers in a wide variety of optimization areas can be added easily. Given an optimization problem, NEOS solvers compute derivatives and sparsity patterns of nonlinear problems with automatic differentiation tools, link with the appropriate libraries, and execute the resulting binary. The user is provided with a solution and runtime statistics.

NEOS uses the Condor pool for solving complementarity problems. NEOS provides an interface that is problem oriented and independent of the computing resources. Users need to provide only a specification of the problem; all other information needed to solve the problem is determined by the NEOS solver. Condor provides the computational resources to solve the problem.

2.3.4 Others

• Condor [21] is a distributed resource management system, developed at the University of Wisconsin, that manages large heterogeneous clusters of workstations. Due to the ever decreasing cost of low-end workstations, such resources are becoming prevalent in many workplaces. A Condor pool consists of any number of machines, of possibly different architectures and operating systems, that are connected by a network. The Condor design was motivated by the needs of users who would like to take advantage of the under-utilized capacity of these clusters for their long-running, computationally intensive jobs. In order to generate vast amounts of computational resources, such a system must use any kind of resource whenever it is made available. Condor acts like a matchmaker, pairing these computational resources with jobs that require processing. Condor is flexible and fault- tolerant: the design features ensure the eventual completion of the job.

• Legion [19] is developed at University of Virginia. Legion is a middleware that provides the illusion of a single virtual machine and the security to cope with its untrusted, distributed realization. Legion connects networks, workstations, supercomputers, and other computer resources together into a system that can encompass different architectures, operating systems, and physical locations. There is no central “big brother” that over- sees and controls each Legion resource: instead, each resource is an independent element.

Legion provides a coherent framework in which these elements can be combined into a metasystem. Legion seamlessly schedules and distributes processes on available and appropriate hosts, then returns the results. Legion aims to take advantage of the growing bandwidth of wide- and local-area networks without compromising the network security and functionality and without asking the user to handle the complex arrangements between incompatible platforms and architectures.

• The Globus [13] project is developing the fundamental technology that is needed to build computational grids, execution environments that enable an application to integrate geographically distributed instruments, displays, and computational and information resources. Such computations may link tens or hundreds of these resources. Typical research areas of Globus project include resource management, data management and access, application development environments, information services, and security. Globus project software development has resulted in the Globus Toolkit, a set of services and software libraries to support Grids and Grid applications. The Toolkit includes software for security, information infrastructure, resource management, data management, communication, fault detection and portability.

(12)

3 DIET: Distributed Interactive Engineering Toolbox

3.1 Introduction

The NES environments with centralized scheduler, become a bottleneck when many clients try to access several servers. Moreover as networks are highly hierarchical, the location of the scheduler has a great impact on the performance of the overall platform. So a hierarchical set of components to build NES application is a better option.

In 2000, the DIET project was started, for the development of a hierarchical set of components to build Network Enabled Server (NES) applications. The aim of DIET [12, 7] is to provide a transparent access to a pool of computational servers at a very large scale. DIET target platform is the fast network VTHD connecting several research centers (and their clusters) from INRIA. This project involves several research teams in CS laboratories across France:

GRAAL at LIP (Lyon), R´es´edas at LORIA (Nancy), and SDRP at LIFC (Besan¸con).

DIET is a hierarchical set of components to build NES applications in a Grid environment.

This environment is built on top of different tools which are able to locate an appropriate server depending on the client’s request, the data location (which can be anywhere on the system, because of previous computations) and the dynamic performance characteristics of the system.

MA MA

LA

LA LA

SeD CRD Client

MA MA

Figure 1: DIET Architecture

3.2 Components

• Client– A client is an application which uses DIET to solve problems. Many kinds of clients should be able to connect to DIET. A problem can be submitted from a web page, a PSE such as Scilab, or from a complied program.

• Master Agent (MA) – A MA receives computation requests from clients. These requests are generic descriptions of problem to be solved. Then the MA collects computation abilities from the servers and chooses the best one. The reference of chosen server is returned to the client. A client can be connected to a MA by a specific name server or a web page which stores the various MA locations.

(13)

MA MA MA MA MA Cl

LA LA LA

LA

LA LA

5 4

2 3 1

Figure 2: Initialization of a DIET system

• Local Agent (LA) –A LA aims at transmitting requests and information between MAs and servers. The information stored on each LA is the list of requests and, for each of its subtrees, the number of servers that can solve a given problem and information about the data distributed in this subtree. Depending on the underlying network architecture, a hierarchy of LAs may be deployed between an MA and the servers it manages. No scheduling decision is made by an LA.

• Server Daemon (SeD) –A SeD encapsulates a computational server. The information stored on a SeD is a list of the data available on its server (with their distribution and the way to access them), the list of problems that can be solved on it, and all information concerning its load (memory available, number of resources available etc). SeD declares the problems it can solve to its parent LA and provides an interface to clients for submitting their requests. A SeD (with the use of FAST [8]) can give performance prediction for a given problem.

• Computational Resources Daemons (CRD)–A computational resource represents a set of hardware an software components that can perform sequential or parallel computations on data sent by a client (or another server). For instance a CRD can be the entry point as a parallel computer. It usually provides a set of libraries and is managed by a SeD.

3.3 Architecture

In DIET, a server is built upon CRD and a SeD. A client that has a problem to solve should be able to obtain a reference to the server that is best suited for him. DIET has a hierarchical set of agents including LA and MA. Requests for computation from a client are sent to the nearest MA. These requests are generic descriptions of problems to be solved. MA collects computation services from the SeDs and chooses the best one. The reference of this server is returned to the client. A LA transmits requests and information between MAs and SeDs. Depending on the underlying network architecture, a hierarchy of LAs may be deployed between an MA and its SeDs. DIET architecture is shown in Figure 1.

3.4 Initialization

Figure 2 shows each step of the initialization of a simple grid system. The architecture is built in the hierarchical order, each component connecting to its parent. The MA is the first entity

(14)

to be started (1). It waits for connections from LAs or requests from clients. Then, when a LA is launched, it subscribes to the MA (2). At this step of the system initialization, two kinds of components can connect to the LA: a SeD (3), which manages some computational resource, or another LA (4), to add a hierarchical level in this branch. When SeD registers to an LA it publishes a list of the services it offers, which is forwarded to the parent agent until MA.

Finally, any client can access the registered resource through the platform. It can contact a MA (5) to get a reference to the best server available and then directly connect to the server to launch the computation.

P() P()

S S

L S

L L

F

P() Problem submitted by the client S

L F

Status of the servers that can satisfy the request Sorted list of servers

Pool of fast servers

Submission of the problem for execution

Reply to the problem

S111 S112 S122

LA11 LA12

LA1 LA2 LA3

MA

S31 S32

S21

B

CLIENT

A

S123 S121

S111 S112 S122

LA11 LA12

LA1 LA2 LA3

MA

S31 S32

S21

B

CLIENT

A

S123 S121

Figure 3: Problem submission example

3.5 Solving a problem

Assuming that the DIET architecture includes several servers able to solve the same problem.

The algorithm presented below allows a MA to choose one of the best servers which will perform the computation. This decision is taken in four steps:

• the MA propagates the client request through its subtrees down to the capable servers.

• each server that can satisfy the request, that calls FAST to estimate the computation time necessary to process the request, and sends this estimation back to its parent.

• each LA of the tree that receives one or more positive answers from its children sorts the servers and forwards their answers to the MA, through the hierarchy.

• once the MA has collected all the answers from its direct children, it chooses a pool of fast servers and sends their references to the client.

In order to solve the problem, the client connects to one of the chosen servers. It sends its local data and specifies if the result should be kept in place for further computation or if they should be brought back. The transfer of persistent is performed at this time.

An example presented in Figure 3 considers the submission of the problemP()by the client.

The MA propagates the client request through its subtrees down to the capable servers. The servers that can solve the problem (s121, s123, s31) send their status information to the neighbor LA. LAs (LA12, LA3) sort the servers according to their status and forward the sorted list to neighbor agent (either MA or LA). MA forward the pool of fast servers to the client and then client select a server to get its work done.

(15)

M(r*||s*||w) M(r||s||w) M(s||r,w) M(r||s,w) M(w||r,s) M(r,s,w)

w w

w

w r

r

r r

s

s s

r*

s* s

Figure 4: Classification of the operating models.

4 The master-slave scheduling problem

The master-slave paradigm finds important applications in parallel computer scheduling. In [3], authors solve the master-slave scheduling problem for a tree-shaped heterogeneous platform.

They took independent and equal size tasks to allocate on a heterogeneous grid computing platform and each processor gets an individual task. They have shown how each node locally can attain the best allocation of tasks to resources that maximizes the steady-state throughput.

They have proposed a bandwidth-centric strategy which states that, if enough bandwidth is available, then all nodes are kept busy; if bandwidth is limited, then tasks should be allocated only to the children which have sufficiently small communication times, regardless of their computation power. They mentioned six different architectural models depending upon the operating way of the processor shown in Figure 4, where “r” stands for receive, “s” stands for sendand “w” stands for work, i.e. compute. In Figure 4, when two squares are placed next to each other horizontally, it means that only one of them can be accomplished at a time, while vertical placement is used to indicate that concurrent operation is possible. We also use “k”

(respectively “,”) to indicate parallel (sequential) order of operations in the models.

M(r^∗ks^∗kw): Full overlap, multiple-port - In this first model, a processor node can simultaneously receive data from its parent, perform some (independent) computation, and send data to all of its children. This model is not realistic if the number of children is large.

M(rkskw): Full overlap, single-port - In this second model, a processor node can simultaneously receive data from its parent, perform some (independent) computation, and send data to one of its children. At any given time-step, there are at most two communications taking place, one from the parent and/or one to a single child.

M(rks, w): Receive-in-Parallel, single-port - In this third model, as in the next two, a processor node has one single level of parallelism: it can perform two actions simultaneously.

In theM(rks, w) model, a processor can simultaneously receive data from its parent and either perform some (independent) computation, or send data to one of its children. The only parallelism inside the node is the possibility to receive from the parent while doing something else (either computing or sending to one child).

M(skr, w): Send-in-Parallel, single-port - In this fourth model, a processor node can simultaneously send data to one of its children and either perform some (independent) computation, or receive data from its parent. The only parallelism inside the node is the

(16)

Binary Chain Star 6.75

13.25 13.75

5 10

Number of requests responded per second

Graphs with eight processors in (r||s||w) model 15

Binary Chain Star

1.5

1

0.5

Number of requests responded per second

Graphs with eight processors in (r,s,w) model 0.8

1.6

0.47

Figure 5: Throughput of graphs in different operating models

possibility to send to one child while doing something else (either computing or receiving from the parent).

M(wkr, s): Work-in-Parallel, single-port - In this fifth model, a processor node can simultaneously compute and execute a single communication, either sending data to one of its children or receiving data from its parent.

M(r, s, w): No internal parallelism - In this sixth and last model, a processor node can only do one thing at a time: either receiving from its parent, or computing, or sending data to one of its children.

In [2] authors solve the master-slave scheduling problem for an undirected graph. Their work is a follow-on of the previous work. Given an undirected graph rooted at the master, they aim at determining the optimal steady-state scheduling strategy. They also took the same models and they have shown the example implementation on one model (Full overlap, single-porty model).

So, first we calculated the performance of all the models by applying the proposed constraints (as the linear programming equations) in [2]. Then we modify the constraints according to our requirement. First, the limit on the depth of graph is removed from the proposed constraints, because we want that each node (server) at the last level of the graph should get a task to execute. Second, the time step is increased from 1 second to 10 seconds, because in one second the throughput of the graph cannot be calculated, as the time taken to send the task to the bottom of the graph increases as the depth of graph increases. We took different types of graphs (Star graph, Chain graph and Binary graph) to evaluate the performance. In Figure 5, we can see the performance of two models for different graphs.

Using linear programming we cannot fulfill our main conditions, that all processors should get same request for execution. So we used Perl programming for implementing our work.

5 Deployment

Main work is to find the architecture which can give best throughput (i.e., number of request answered per second) depending upon the number of agents, number of servers, computing power of each nodes and the bandwidth of the links.

(17)

We model a collection of homogeneous/heterogeneous resources (a processor, or a cluster, or whatever) and the communication links between them as nodes and edges of an undirected hierarchical graph (tree-shaped). Each node is a computing resource capable of computing and communicating with its neighbors at same/different rates. We assume that one specific node, referred as client, initially generates requests. The client floods the requests to his neighbor nodes (MAs). These nodes check whether the request is right (means having all the parameters that a request should have), and if so, then the request is flooded to the neighboring nodes (LAs or MAs). These nodes forward the requests to the connected servers. These servers send reply packets to their neighbor LAs. These packets contain the status (memory available, number of resources available, performance prediction, ...) of the server. LA compares the reply packet sent to it, by each of its connected server and selects the best server among them. Now the reply packet of the selected server is sent by the LAs to the neighboring LA or MA. Best server (or list of available servers, ranked in order of availability) among the selected servers is being informed by the MA to the client. The client attempts to contact a server (from the list, starting with the first and moving down through the list). Then client sends the input data to the server.

Finally the server executes the function on behalf of the client and returns the results.

The main problem is to determine a steady state scheduling policy for each processor, i.e. the fraction of time spent in computing the request coming from client to server, fraction of time spent in computing to select the best server, the fraction of time spent sending the request, and the fraction of time spent in receiving the reply packet (reply of the request), so that the (average) overall number of requests processed at each time-step can be maximized. In the homogeneous case we took them as a foreknown constants but in heterogeneous they are dependent on some conditions.

Servers Local Agants Master Agent Client w₀

w₁

w₃

w₇ w₅

w₄ w₂

b₀₁

w₆ b₂₄ b₃₆

b₂₅ b₃₇ b₁₃ b₁₂

Figure 6: Architectural Model

5.1 Architectural Model

The target architectural/application framework is represented by a node-weighted, edge-weighted graph G=(V,E,w, c) as shown in Figure 6. Let N = |V| be the number of nodes. Each P_iV represents a computing resource of computing power w_i, meaning that node P_i execute w_i megaflops/second (so bigger thew_i, faster the computing resourceP_i). There is a client node,

(18)

i.e. a nodeP_c, which generates the requests that are passed to the following nodes. The size of the request generated by the client is S_in and S_out_i is size of the reply request at each node in a time step and its measuring unit is megabytes/request. Size of reply request is different for each node, as it depends on the number of node’s neighbors. alpha_in_i is the fraction of incom- ing request (request coming from client) andalpha_out_i is the fraction of the out going request (selecting the best server based on the reply packet) computed in a time step by the node Pi

and its measurement unit is megaflops/request. Servers are connected to the local agents at the last level of the graph.

Each edge eij : Pi → Pj is labelled by the bandwidth value bij which represents the size of data sent per second between P_i and P_j. Measuring unit of bandwidth of link is megabytes/second. Assumption is made for the communication links and computing power of nodes.

• Links are bidirectional and symmetric, i.e. same size of data can be sent fromP_i to P_j in one second, as can be sent from Pj to Pi. If there is no communication link between Pi

and P_j thenb_ij = 0, so that b_ij should be some positive value if, P_i and P_j are neighbors in the communication graph.

• Computing power of a node should be a positive rational number, i.e. w_i = 0 is not possible since it would permit nodeP_i to have no computing power.

We have allocated the number of servers to the base local agent in round robin fashion in homogeneous network. We selected round robin method, based on the comparison given in [16]. Instead of using FAST for computing the performance forecasting of the servers, we have considered a fixed time as the time taken by the servers to reply for a request.

5.2 Steady State operation

Our objective is to compare the maximum number of requests answered per second by a specific type of architecture so that best architecture can be selected. If there are large number of requests to be replied, then it is better to operate in a periodic fashion and throughput of the architecture should be calculated when every node is working at its maximum speed. So we study the throughput of the architecture at steady state. Let n(i) denote the index set of the neighbors of node P_i and R is the number of requests that are answered during one time unit.

Number of requests replied in a time step depend on bandwidth of the link, size of the request, fraction of request being computed by a processor in a time step and the computing power of the processor. Constraints for two operating models with time step to be one second are:

M(r, s, w): No internal parallelism - In this model, the computation and other operation performed by the node is done sequentially, so the summation of all the operations performed by an agent should be less than the time step.

• Ri×Si≤bij ∀eij (A)

• ^X

jn(i)

R_i×S_i

b_ij +R_i×alpha_i ≤1 ∀Pi (B)

where Si=Sin+Sout_i,

alpha_i=alpha_in_i+alpha_out_i

Lemma 1: Number of requests answered by each node in a time step will always be the minimum value obtained from the two constraints A and B :

(19)

R_i =M in ^b_S^ij

i, ¹

X

jn(i)

S_i

b_ij +alpha_i

!

Proof: Let us make an assumption that the maximum value obtained from the constraints A and B is the throughput of a node. Then two cases are possible.

Case 1: (A > B) Value of constraintAis greater than the values obtained from constraint B. So throughput of the node is considered to be A from above assumption. As value obtained from constraint B shows that the computing power of the system is less. High bandwidth of the link can just speed up sending and receiving of requests but cannot increase the computing power of node. Thus number of requests replied depend on the computing power of the processor.

Case 2: (A < B)Value of constraintB is greater than the value obtained from constraint A. So throughput of the node is considered to be B from above assumption. As value obtained from constraintA shows that the bandwidth of the system is low and thus node cannot compute more requests than it is receiving.

From above two cases it is clear that we have to take the minimum value obtained from two constraints.

M(rkskw): Full overlap, single-port - In this model, receiving, sending and computing is done in parallel so the constraint “B” has to be replaced by two constraints

• ^X

jn(i)

R_i×S_i

b_ij ≤1 ∀Pi (C)

• R_i×alpha_i≤1 ∀P_i (D)

Lemma 2: Number of requests answered by each node in a time step will always be the minimum value obtained from the constraints A, C and D :

Ri =M in ^b_S^ij

i, ¹

X

jn(i)

S_i b_ij

,_alpha¹

i

!

Final throughput of each node that will be considered as the part of the throughput of the structure, also depends on the throughput of its children.

Lemma 3: Final throughput of each node is calculatedas the minimum either of its own throughput or the summation throughput of its children :

F T_i=min

R_i, ^X

jn(i)

R_j

Lemma 4: Number of requests answered per second by the structure should be the final throughput of the root node.

(20)

Binary Structure

Star Structure

Chain Structure

2−Chain Structure

2−Depth Star Structure

Local Agent Master Agent Server Client

Figure 7: Different homogeneous networks

5.3 Homogeneous Network

Initially we took the homogeneous structures, means all the nodes of the graph have the same computing power, bandwidth link between two node is same for all the linked nodes in the graph, size of the out going message is same for all nodes and fraction of request computed by all the processors are also same. In short, identical processors were implemented in the homogeneous network. Variables of homogeneous architecture are of two types: constants and constraints

Constants:

• Bandwidth of a link between two nodes =b Megabytes/sec

• Computing power of the nodes =w Megaflops/sec Constraints:

• S_in= 90 + 50×pm

• Sout_i = 8 +nci×(100 + 8×pm)

• alpha_in_i = ^np_wⁱ

• aplhaout_i = ^ncⁱ^.log.nc_w ⁱ(+0.005 if node is a LA next to servers) where pmis the number of parameters of a request

np_i is the number of parents of nodeP_i nci is the number of children of nodePi

In the above described constraints, the constant integer values are given by DIET program- mers. According to DIET hierarchical structure, we have only one parent of each node, thus np_i = 1. We took complexity of selecting a server from a list of servers to benc_i.log.nc_i, as we considered that sorting of servers is done according to quick-sort algorithm. We added 0.005 seconds to the LA next to server because it is the maximum time taken by the servers to inform about their status to the nearer LA.

We did simulations, by considering different structures shown in Figure 7, to check that which structure will be the best in performance and what will be the effect if we increase the number of agents (or number of servers or both) in a structure.

We took computing power of each node to bew=25 Mflops/sec and bandwidth link between two nodes to be b=1 MegaBytes/sec. From Figure 8, it can be seen that the performance of binary structure is best when number of servers are less than 20, but as the number of

(21)

0 2 4 6 8 10 12 14

10 20 30 40 50 60 70 80 90 100

Number of requests per second

Number of servers

Comparision of structures based on the number of requests responded per second Binary Graph

Star Graph Chain Graph 2 Depth Star Graph 2 Chain Graph

Figure 8: Throughput of different graphs with 8 nodes

servers increases the performance of all the structures decreases except the star structure. The performance of star structure also decreases when the number of servers are increased to 160 and the performance of star and binary structures is approximately same when the servers are increased to 600. So if the number of agents is less and the servers are more then it is better to take a star type network.

0 2 4 6 8 10 12 14

10 20 30 40 50 60 70 80 90 100

Number of servers

Comparision of structures based on the number of requests responded per second Binary Graph

Star Graph Chain Graph 2 Depth Star Graph 2 Chain Graph

Figure 9: Throughput of different graphs with 32 nodes

To check the performance of structures with the increase in number of agents, the simulation is performed with same structures but with 32 nodes. In Figure 9, we can see that performance of binary structure is very good as compare to other structures.

There is one important thing to be noticed in Figure 9 that, we cannot calculate the throughput of binary, star and 2 depth star structures when number of servers is less than 20, 30 and 30

(22)

0 2 4 6 8 10 12 14

50 100 150 200 250 300

Number of servers

Comparision of Binary type structures based on the number of requests responded per second 8 nodes 16 nodes 32 nodes

Figure 10: Throughput of binary graph with different number of nodes

respectively, because in DIET architecture a LA cannot exist without any server (child). Thus if all the servers of a LA get down then that LA has to be disconnected from the architecture.

And following this we cannot have more LAs than the number of servers at the last level of the architecture. For example, in case of binary type structure when number of nodes is 32 and number of servers is 10 then 6 nodes have no children and if server is not connected to a LA then LA cannot give the reply to the client request and LA at upper level or MA will seek for the reply from the specific LA which is not possible.

From Figure 9 it is confirmed that binary type of architecture performs best as compared to other architectures. Then we did simulations to check the performance of binary graph by changing the number of servers and number of agents as shown in Figure 10. From simulation we came to know that, when new agents should be added to increase the throughput of the graph and how many servers can be added to the architecture without increasing the number of agents.

5.4 Heterogeneous Network

Distributed networks are all heterogeneous, i.e., every node has its own (may be different from other nodes) computing power and bandwidth link between two nodes are also different (mostly).

We did simulation on a real heterogeneous network shown in Figure 11. We considerVeloce to be a client, and a system atRocquencourt as MA which is connected to two LAs: one atRennes and another at Grenoble. Rennes is connected to another LA namedParaski with 40 servers.

Grenoble is connected with two LAs Sophia and Icluster. Icluster has 200 servers. Sophia is connected to another LA named Galere which has 15 servers. The link between the nodes are of different bandwidths, as shown in the Figure 11.

Configuration of a network can be read by using a software ALNeM (Application Level Net- work Mapper) written by Arnaud Legrand and Martin Quinson. The configuration of network is written in a text file that we used to calculate the performance of the network.

Variables of heterogeneous network can be of two types: random and constraints-

(23)

VTHD Fast Ethernet Ethernet

25

30

25 25

25 30 25

30

1 ... 40 1 ... 200

1 ... 15 Veloce

Rocquencourt

Rennes Grenoble

Sophia Icluster Paraski

Resources Client Master Agent

Links

Galere Local Agent

Servers

Figure 11: Heterogeneous network

Random

• Bandwidth of a link between two nodes P_i and P_j = b_ij megabytes/sec (range from 1 megabytes to 2.5 gigabytes per second)

• Computing power of a node P_i = w_i megaflops/sec (range from 25 megaflops to 30 megaflops per second)

Constraints

• S_in= 90 + 50×pm

• S_out_i = 8 +nc_i×(100 + 8×pm)

• alpha_in_i = ^np_wⁱ

i

• aplhaout_i = ^ncⁱ^.log.nc_w ⁱ

i (+0.005 if node is a LA next to servers) where pmis the number of parameters of a request

np_i is the number of parent of nodeP_i nc_i is the number of children of nodeP_i

If we want to improve the performance of a real network we can not implement it in new design as already a lot of time, labor and money has been invested. But we can improve the throughput of the network by breaking the bottleneck. With the use of network selection tool we can find the bottleneck in the network and break it by adding more LA to the parent of a loaded LA so as to divide the load of that particular LA. We add new LA according to the algorithm 1.

In algorithm 1, condition of node tells whether it is possible to divide the load of a node or not. There may be many reasons due to which the condition of node can be “no ”. For example, a node P_i having only one child cannot divide its load, so the condition of this node

(24)

1: Calculate the throughputR_i of each node P_i.

2: Calculate the throughputR of structure.

3: if (number of available nodes >0) then

4: Find node with minimum throughput

5: if (condition of this node == yes)then

6: Split the load by adding new node to its parent

7: else

8: find next node with next minimum throughput

9: Goto step 5

10: Decrease the number of available nodes

11: Goto step 1.

Algorithm 1: Algorithm to add LA

P_i is no. If a new LA is added to share the load of node P_i, whose parent is P_j, then new LA hasw_i computing power and the bandwidth link between new LA and parentP_j isb_ij. We have fixed the condition of new added LA to be “no”. But if required we can specify the condition, computing power and bandwidth link of each new LA.

We calculated the throughput of this real heterogeneous network by using the formula mentioned before for calculating the throughput of structure. The performance of network is not good, it is only 2 requests/second. But we can improve the throughput of the network by breaking the bottleneck with the addition of more LAs as mentioned in the algorithm 1. In the network shown in Figure 11, the throughput of LA named Icluster is minimum, so we add a new LA to Grenoble so as to divide the load ofIcluster. Now the new LA and Icluster both have 100 servers each. The throughput of network increased to 2.2 request/second. In Figure 12, it can be seen that by just adding three more LAs we can double the throughput of the network and we reach the unbreakable bottleneck (bottleneck occurs due to the node which have condition equal to no) by adding 9 LAs and throughput of the network is 17.88 requests/sec.

2 4 6 8 10 12 14 16 18

1 2 3 4 5 6 7 8 9 10

Number of agents added Throughput of heterogeneous network

throughput of network by adding LA

Figure 12: Throughput of heterogeneous network as more number of LA are added

(25)

If we want to know the new architecture that have specific number of nodes and servers, where nodes can have some computing power between a given range, and bandwidth link (between two nodes) can also be established within specific range. Then on the basis of random values of computing power and bandwidth link of nodes many graphs are generated. After comparing the throughput of each graph, the architecture can be configured by actually linking the processors according to the best graph. By random selection of computing power of nodes and bandwidth we can even select an architecture that should give specific throughput. We can put different conditions like number of nodes, number of servers, range of computing power of nodes and range of bandwidth to set the throughput accordingly.

6 Conclusion and Future Work

This report presents a tool for automatic deployment of DIET on grid. With the help of this tool we can find which is the best structure according to the performance, among the given structures. By the use of this tool we can predict, what will be the effect on the performance if we make some specific changes in the network.

It is very important to find, when a bottleneck occurs in a network and also the cause of it. Methods to remove the bottleneck may be different for different reasons due to which it is caused. This tool can find the bottleneck in the DIET deployed network, and we can break the bottleneck by adding new LAs to improve the performance.

From the simulation results of homogeneous structures, it can be seen that binary type structure is the best. If number of nodes is less and number of servers is more (than 70), it is better to take star graph type structure. To maintain the throughput of the graph, it is better to increase the number of nodes in proportion to the number of servers.

From the simulation results of heterogeneous networks, it can be seen that by adding new LAs we can improve the performance of the network. This tool helps in modelizing the DIET, hence to find the structure that can provide the best throughput.

Currently, we have to do comparison of my simulation results with the DIET experiments on cluster. Then we have to upgrade my tool for the dynamic updation of the network configuration with the use of package ALNeM (Application Level Network Mapper). After that we have to add timer into the tool to get real value for CORBA implementation of DIET. And finally we have to integrate this tool in DIET code and check the validation of work by real deployment.

(26)

References

[1] A.Furtado, A.Rebou¸cas, E.Argollo, J.R.de Souza, D.Rexachs, and E.Luque. How can geographically distributed clusters collaborate ? Submited to 10th European PVM/MPI Users’ Group Conference (EuroPVM/MPI03).

[2] Cyril Banino, Olivier Beaumont, Arnaud Legrand, and Yves Robert. Scheduling strategies for master-slave tasking on heterogeneous processor grids. Technical Report 2002-12, LIP, mar 2002.

[3] Olivier Beaumont, Larry Carter, Jeanne Ferrante, Arnaud Legrand, and Yves Robert.

Bandwidth-centric allocation of independent tasks on heterogeneous platforms. In In- ternational Parallel and Distributed Processing Symposium IPDPS’2002. IEEE Computer Society Press, 2002.

[4] Fran Berman, Geoffrey C. Fox, and Anthony J.G. Hey. The grid: past, present, future. In Fran Berman, Geoffrey C. Fox, and Anthony J.G. Hey, editors,Grid Computing, Making the Global Infrastructure a Reality, Communications Networking and Distributed Systems, chapter 1, pages 3–50. Wiley Series, 2003. ISBN 0-470-85319-0.

[5] Luis Bernardo and Paulo Pinto. Scalable service deployment using mobile agents. Lecture Notes in Computer Science, 1477:261–??, 1998.

[6] P. Bhat, C. S. Raghavendra, and V. Prasanna. Efficient collective communication in distributed heterogeneous systems. In19th International Conference on Distributed Comput- ing Systems (19th ICDCS’99), Austin, Texas, May 1999. IEEE.

[7] Eddy Caron, Frédéric Desprez, Frédéric Lombard, Jean-Marc Nicod, Martin Quinson, and Frédéric Suter. A Scalable Approach to Network Enabled Servers. In B. Monien and R. Feldmann, editors, Proceedings of the 8th International EuroPar Conference, volume 2400 of Lecture Notes in Computer Science, pages 907–910, Paderborn, Germany, August 2002. Springer-Verlag.

[8] Eddy Caron and Fr´ed´eric Suter. Parallel Extension of a Dynamic Performance Forecasting Tool. InProceedings of the International Symposium on Parallel and Distributed Comput- ing, pages 80–93, Iasi, Romania, Jul 2002.

[9] C.C.Shen and W.H.Tsai. A graph matching approach to optimal task assignment in distributed computing system using a minimax criterion. IEEE Trans. Computers, C- 34(3):197–203, March 1985.

[10] Grid Computing1. http://www.gridcomputing.org/.

[11] Grid Computing2. http://www-1.ibm.com/grid/index.shtml.

[12] DIET. http://graal.ens-lyon.fr/~diet.

[13] GLOBUS. http://www.globus.org/.

[14] H.Casanova and J.Dongarra. Network-enabled server systems: Examples and applications.

February 1999.

[15] H.Lican, W.Zhaohui, and Pan Yunhe. Virtual and dynamic hierarchical architecture for e- science grid. In Track on Clusters and Grids ICCS 2003, editors,Lecture Notes in Computer Science, volume 2659, Part III, page 316, Melbourne, Australia and St. Petersburg, Russia, June 2-4 2003. Springer.

(27)

[16] H. A. James, K. A. Hawick, and P. D. Coddington. Scheduling Independent Tasks on Metacomputing Systems. Technical Report DHPC-066, Distributed High Performance Computing Group, Adelaide University, March 1999. To be published in Proc. of Parallel and Distributed Computing Systems (PDCS’99), Fort Lauderdale, August 1999.

[17] Muhammad Kafil and Ishfaq Ahmad. Optimal task assignment in heterogeneous distributed computing systems. IEEE Concurrency, 6(3):42–51, July/September 1998.

[18] K.L.Park, H.J.Lee, K.W.Koh, O.Y.Kwon, S.Y.Park, H.W.Park, and S.D.Kim. Dynamic topology selection for high performance mpi in the grid environments. InTo appear at 10th European PVM/MPI Users’ Group Conference (EuroPVM/MPI03), Venice, Italy, Sep 29 - Oct 4 2003.

[19] Legion. http://legion.virginia.edu/.

[20] Damien Magoni. Hierarchical addressing and routing mechanisms for distributed applications over heterogeneous networks. In Workshop on Innovative Solutions for Grid Com- puting ICCS 2003, editor,Lecture Notes in Computer Science, volume 2659, Part III, page 1093, Melbourne, Australia and St. Petersburg, Russia, June 2-4 2003. Springer.

[21] M.C.Ferris, M.P.Mesnier, and J.J.More. Neos and condor: Solving optimization problems over the internet. ACM Transactions on Mathematical Software, 26(1):1–18, March 2000.

[22] NEOS. http://www-neos.mcs.anl.gov/.

[23] NetSolve. http://icl.cs.utk.edu/netsolve/.

[24] NINF. http://ninf.apgrid.org/.

[25] PSE. http://www-cgi.cs.purdue.edu/cgi-bin/acc/pses.cgi/.

[26] R.Buyya, T.Didson, D.Gannon, E.Laure, S.Matsuoka, T.Priol, J.Saltz, E.Seidel, and Y.Tanaka. Problem solving environment comparision whitepaper. February 2001.

[27] Manuel Rodr´ıguez-Mart´ınez and Nick Roussopoulos. Automatic deployment of application- specific metadata and code in MOCHA. Lecture Notes in Computer Science, 1777:69–??, 2000.

[28] Keith Seymour, Hidemoto Nakada, Satoshi Matsuoka, Jack Dongarra, Craig Lee, and Henri Casanova. Overview of GridRPC: A remote procedure call API for Grid computing.Lecture Notes in Computer Science, 2536:274–??, 2002.

[29] S.Matsuoka, H.Casanova, and J.Dongarra. Network-enabled server systems and the computational grid. 2001.

[30] Jon B. Weissman. Scheduling multi-component applications in heterogeneous wide-area networks. In9th Heterogeneous Computing Workshop, page 209. University of Minnesota, May 2000. Cancun, Mexico.

List of Figures

Automatic Deployment Tool for DIET

Acknowledgments

Contents

List of Figures

1 Introduction

2 Related Work

3 DIET: Distributed Interactive Engineering Toolbox

4 The master-slave scheduling problem

5 Deployment

6 Conclusion and Future Work

References