Heterogeneous Distributed Computing Model

In this section, we describe a general MapReduce-inspired coded distributed computing problem with heterogeneous computational powers, and define the metric that we will use to assess the goodness of our proposed algorithms.

We consider the problem of computingK arbitrary output functions from a dataset consisting of N files ofF bits on a cluster of Λ distributed computing nodes, for some positive integers K, N, F,Λ ∈ N. We assume that each node λ ∈ [Λ] has a computa-tional power described by the dimensionless quantity cλ, such thatc = (c1, c2, . . . , cΛ) denotes the vector of the computational powers of the nodes. Without loss of gener-ality we assume that c_i ≥ c_j for j ≥i, and we also assume that such computational powers are normalized such that the total computational power of the Λ nodes sums to one, i.e. P_Λ

λ=1c_λ = 1. The distributed computation is coordinated by a central controller which, for each λ∈ [Λ], assigns L_λ output functions to computing node λ, such that PΛ

λ=1L_λ = K. We will refer to the vector L = (L1, L₂, . . . , L_Λ) as the re-duce load vector, which we assume to be sorted in descending order, i.e. Li ≥Lj forj≥i.

More specifically, given the input files F = {f1, f2, . . . , fN} ∈ F₂^F, the objective of the computing problem is to evaluate the output values µ₁, µ₂, . . . , µ_K, where µ_k= φ_k(f₁, f₂, . . . , f_N)∈F₂B,k∈[K], for someB ∈N, andφ_k: (F₂F)^N −→F₂B is a function

that maps all the input files to a binary stream of lengthB. Inspired by the distributed computing framework MapReduce, we decompose the computation as

φk(f1, f2, . . . , fN) =hk(g_k,1(f1), g_k,2(f2), . . . , gk,N(fN)), k∈[K] (8.2) where gn , [g_1,n, g_2,n, . . . , gK,n] is the function that maps input file fn into K intermediate values (IVs) v_k,n = g_k,n(f_n) ∈ F₂T, k ∈ [K], for some T ∈ N, while hk : (F₂^T)^N −→ F₂^B is the reduce function that maps the IVs v_k,1, v_k,2, . . . , vk,N into the output bit stream u_k. The evaluation of the output functions{φ_k}^Kk=1 as in (8.2) is done in three sequential phases: the map phase, the shuffle phase and the reduce phase.

We assume that each phase starts only when the previous one has finished (sequential MapReduce).

We assume that a central controller coordinates the computation. Before the compu-tation starts a central controller partition the set of filesF into Λ setsM1,M2, . . . ,MΛ

such that node λ is assigned to the set Mλ such that M_λ , |Mλ|. We will refer to M= (M1,M2, . . . ,MΛ) as the file assignment vector. Furthermore, the central con-troller decides which output values each computing node will have to evaluate. We denote byLλ the set of the indices of the output values that will be computed by node λ, for some setLλ ⊆[K] such that|Lλ|=L_λ. Hereinafter, the quantityL_λ will be also referred to as thereduce load of nodeλ, λ∈[Λ]. We now proceed with an important definition which tells us the class of file assignments Mconsidered throughout this chapter.

Definition 4. We say that a file assignment M is homogeneous if all files f ∈ F are assigned to the same number of computing nodes.

8.2.1 Map Phase

In this phase, the central controller distributes to nodeλ, λ∈[Λ] a set of files Mλ ⊆ F which are mapped by nodeλto obtain the intermediate values{vk,n}^K_k=1for eachfn∈ Mλ. We definecomputation load of nodeλthe number of assigned files to nodeλ, normalized by the total number of files N, i.e. γλ , ^|M_N^λ^|, while the total computation load is t,PΛ

λ=1γλ. We will use the symbolγ to refer to the vector of the computation loads of the nodes, i.e. γ = (γ1, γ2, . . . , γΛ). Furthermore, we assume that node k processes its assigned files with a mapping computational speed of c^(m)_λ bits per second, where c^(m)_λ ,c_λ·b_m, for someb_m∈R. It follows that the time that nodeλspends in mapping files in Mλ can be expressed as

T_λ^(m)= M_λF c^(m)_λ

, (8.3)

thus resulting in a total map time of

T^(m) = max

λ∈[Λ]

(M_λF c^(m)_λ

)

(8.4) bits per second.

8.2.2 Shuffle Phase

After all the nodes have finished mapping their assigned files, the Λ computing nodes exchange some of the computed intermediate values such that at the end of the shuffle phase each nodeλwill have acquired all intermediate values v_k,1, v_k,2, . . . , vk,N,∀k∈ Lλ

required for computing output values{u_k|k∈ Lλ}in the subsequent reduce phase. To do so, each nodeλconstructs the message X_λ∈F₂_`λ for some `_λ∈N, as a function of the intermediate values computed in the map phase. In particular, node λuses some encoding function Ψ_k : (F₂^T)^K|M^λ^|−→F₂_`λ to construct the symbol

X_λ = Ψ_λ(gn(fn) :n∈ Mλ),

which is later multicasted to all other nodes through a channel of capacitycs bits per second. We define communication load the normalized total number of bits sent through the channel during the shuffle phase as

τ , PΛ

λ`λ

KN T, (8.5)

which corresponds to a shuffle time of T^(s)=

PΛ λ `λ

c_s (8.6)

bits per second.

We now proceed with an important definition.

Definition 5. We say that a shuffle scheme isone-shot if any computing node can recover each of all its required bits (of the required intermediate values) from the intermediate values computed in the map phase available locally and at most one transmitted message by any other node.

8.2.3 Reduce Phase

In the last phase, nodeλ uses the received symbolsX₁, X₂, . . . , X_Λ and the computed intermediate values {g_n(f_n) : n ∈ Mλ} to obtain the inputs of the reduce functions {hk:k∈ Lλ}, i.e. for eachk∈ Lλ and some encoding functionχ^k_λ =F₂^`1 ×F₂^`2 × · · · × F₂`Λ ×(F₂T)^K|M^λ^|−→(F₂T)^N, node λcomputes

(v_k,1, v_k,2, . . . , v_k,N) =χ^k_λ(X₁, X₂, . . . , X_Λ,{gn(fn) :n∈ Mλ}).

Finally, node λcomputes for eachk∈ Lλ the reduce functionh_k(v_k,1, v_k,2, . . . , v_k,N) to complete its assigned jobs. We assume that the reduce computational speed of nodeλis ofc^(r)_λ bits per second, wherec^(r)_λ =c_λ·b_r, for someb_r∈R. Hence, the time that node λ spends in computing its assigned output values can be expressed as

T_λ^(r)= L_λN T c^(r)_λ

, (8.7)

thus resulting in the total reduce time T^(r)= max

λ∈[Λ]

(LλN T c^(r)_λ

)

bits per second. (8.8)

8.2.4 Problem Formulation

For the heterogeneous coded distributed computing model described above, the metric of interest is the total computing time

TΣ ,T^(m)+T^(s)+T^(r).

The optimal duration of each phase is denoted asT^∗(m),T^∗(s),T^∗(r), respectively. In this work, we consider three different scenarios which we describe in the following paragraphs.

Scenario with fixed computation loads and fixed reduce loads

In this scenario we assume that the computation load of each computing node is fixed, i.e.

the vectorγ is fixed and given, which implies that the mapping timeT^(m) is fixed and cannot be optimized. At the same time, we assume that also the reduce load vectorL is given, which in turn implies that also the reduce phase timeT^(r) is fixed, thus leaving the shuffle timeT^(s) the only quantity to be minimize. We notice that minimizing the shuffle timeT^(s) is equivalent to minimizing the communication loadτ. Hereinafter, we denoteτ^∗(L,γ) the optimal communication load for any given reduce load vectorL and any given computation load vector γ. Furthermore, we will use τ_hom,os^∗ (L,γ) to refer to the optimal communication load under the assumptions of homogeneous file assignment (cf. Definition 4) and one-shot shuffling (cf. Definition 5).

Scenario with flexible computation loads and fixed reduce loads

We now consider the scenario where the reduce load vector L is given and that the vector of the computation loads γ can be optimized as a function ofL. Nevertheless, we assume that the total computation load to distribute among the nodes equals the value t, i.e. PΛ

λ=1γλ=t. Unlike the previous scenario, here, we can optimize the time that the system spend in the shuffle and map phase, while the reduce time is fixed.

For this scenario, we denote the optimal communication load under the assumption of homogeneous file assignment asτ^∗(L, t).

Scenario with flexible computation loads and flexible reduce loads

In this third scenario, we assume that both the computation loads and reduce loads can be optimized as a function of the computational power vectorc.

8.3 Main Results

In this section we present our main contributions on heterogeneous coded distributed computing for the three considered scenarios described above. Hereinafter, unless explicitly stated, all our results hold for integer values of the total communication load t.

8.3.1 Fixed Computation Loads and Fixed Reduce Loads

We start by presenting a lower bound under the assumption of homogeneous file assign-ment.

Theorem 11. For the heterogeneous coded distributed computing problem with given computation load vector γ and given reduce load vectorL, the optimal communication load, under the assumption of homogeneous file assignment, satisfies

τ_hom^∗ (L,γ)≥ P

λ∈[Λ]Lλ(1−γλ)

Kt , (8.9)

where t=PΛ

λ=1γλ and t∈[Λ].

The proof is relegated to Section 8.6.

Equipped with the above lower bound, we continue by identifying a set of pairs (L,γ) for which we can provide an optimal achievable scheme. To this end, we start by defining a computation load vector γ given any reduce load vector L.

Definition 6. For any reduce load vectorL, any t∈[Λ] and any λ∈[Λ] we define γ¯λ(L),

L_λP

η∈C_t−1^[Λ]\{λ}

Q_t−1

j=1L_η(j) P

ω∈C_t^[Λ]

Q_t

j=1L_ω(j) , (8.10)

which in turn defines the vector ¯γ(L) = ( ¯γ1(L),γ¯2(L), . . . ,γ¯Λ(L)).

Observation 3. We notice that the expression of the computation loadγ¯λ(L) defined above coincides with equation (4.16), which tells the normalized memory allocated to cacheλas a function of the cache occupancy vector L for the topology-aware shared-cache problem discussed in Chapter 4.

Equipped with definition 6, we are now ready to present our first optimality result of this chapter.

Theorem 12. For an heterogeneous distributed computing problem with reduce load vector L and computation load vector γ= ¯γ(L), the optimal communication loadτ_hom^∗ (L,γ(L))¯ under the assumption of homogeneous file assignment is given by

τ_hom^∗ (L,γ(L)) =¯ P

λ∈[Λ]L_λ(1−γ¯λ(L))

Kt . (8.11)

Proof. For the considered computation load vector ¯γ(L), which is function of the reduce load vector L, the achievable scheme is presented in Section 8.4, while the matching converse was presented in Theorem 11, which is here used withγ= ¯γ(L).

Remark 10. We notice that the achievable scheme presented in Section 8.4 can be seen as a D2D version of the topology-aware scheme for the shared-cache problem presented in Chapter 4, Section 4.4.

Theorem 12 states the optimality of our achievable scheme, under the aforementioned assumption of homogeneous file assignment, for the pair (L,γ(L)) where the computation¯ load γ is function of L (for any L, where L_λ ∈ N ) as in Definition 6. The optimal communication load for any pair (L,γ) remains an open problem.

We notice that for any non uniform reduce load vectorL, the achievable communication load for the tuple (L,γ(L)) is lower than the communication load¯ τ_{CM R}^∗ (Λ, γ) in equation (8.1) by the standard coded MapReduce algorithm which uses L = (^K_Λ, . . . ,^K_Λ) and γ= (_Λ^t, . . . ,_Λ^t). While this implies that if we want to reduce the communication load we have to choose a reduce load vector Las skewed as possible, we also notice that, in the scenario where all nodes have the same computational power, a skewed (non uniform) Limplies both a higher map time and reduce time than the case when the reduce load vector is selected to be uniform, i.e. L = (^K_Λ, . . . ,^K_Λ). Therefore, it is evident that in order to minimize the overall computing time (comprised of map time, shuffle time and reduce time) we have to take into account the specific computing problem which will affect the time that each node will spend in mapping a file of the dataset as well as the time required to produce a given output value in the reduce phase. When we will consider the scenario where both the computation load vector and the reduce load vector can be optimized we will present a strategy that aims at minimizing the reduce time while also reducing the shuffle time.

8.3.2 Flexible Computation Loads and Fixed Reduce Loads

In the next theorem we present an order optimality result for the considered scenario where the computation load vectorγ can be optimized as a function of the given reduce load vectorL in order to minimize the communication loadτ.

Theorem 13. For an heterogeneous distributed computing problem with reduce load vector L, the optimal communication load τ_hom,os^∗ (L, t) under the assumptions of homogeneous file assignment and one-shot shuffling satisfies

2·τ_hom^∗ (L,γ(L))¯ ≤τ_hom,os^∗ (L, t)≤τ_hom^∗ (L,γ(L)),¯ (8.12) where τ_hom^∗ (L,γ(L))¯ is provided in Theorem 12, for some t∈[Λ].

Proof. The converse bound and derivation of the optimality gap for Theorem 13 are presented in Section 8.5, while the achievable scheme is the one that achieves the communication load in Theorem 12 and it is presented in Section 8.4.

Theorem 13 tells us that an order-optimal selection of the computation load vectorγ, given a total computation loadt, is the one described by the vector ¯γ(L). This strategy allocates more computation load to those computing nodes that have to evaluate more output functions, in a way that in the shuffle phase a coding gain oftis always achievable regardless of L. At the same time, allocating more computation load to those nodes that are more loaded in the reduce phase allows to simultaneously reduce also the total number of intermediate values that have to be exchanged among the nodes, i.e. such approach allows to get an higher ”local caching gain”² than the one that would be achieved with a uniform distribution of the total computation load among the nodes.

Equipped with an order-optimal solution for reducing the communication load for a setting with a non uniform reduce load vector L, in the next paragraph we show a simple yet effective way to select such reduce load vectorL as a function of the computational power vector c.

8.3.3 Flexible Computation Loads and Flexible Reduce Loads

In this scenario the computation load vector γ and the reduce load vector L can be selected as a function of the computational power vector c in order to minimize the overall computing time TΣ. To this end, we take a separation approach and we first focus on minimizing the reduce time T^(r), before proceeding with the minimization of the shuffle time T^(s).

The optimal reduce time, minimized over all possible reduce load vectorsL ∈N^Λ, can be evaluated as

T^∗(r)= min

L∈N^Λmax

λ∈[Λ]

( LλN T

c^(r)_λ )

= N T br

L∈Nmin^Λmax

λ∈[Λ]

Lλ

c_λ

, (8.13)

which it can be seen that is minimized forL_λ=Kc_λ. Therefore, the minimum reduce time takes the value

T^∗(r)= KN T

b_r . (8.14)

With the optimal reduce load vector L^∗(c) , Kc at hand, we can proceed to minimizing the shuffle time T^(s) as a function of L^∗(c), which in turn is equivalent to minimizing the communication loadτ(Kc, t). The following corollary holds.

Corollary 2. For the coded distributed computing problem with computational power vector c and optimal reduce load vector L^∗(c) =Kc, the optimal communication load, under the assumptions of homogeneous file assignment and one-shot shuffling, satisfies

2 ·τ_hom^∗ (Kc,γ(Kc))¯ ≤τ_hom,os^∗ (Kc, t)≤τ_hom^∗ (Kc,γ(Kc))¯ (8.15) Proof. Corollary 2 follows directly from theorem 13.

2Here, the term local caching gain is borrowed from the coded caching literature.

Therefore, in this scenario where we can select the output functions assignment (i.e.

the reduce loads of the nodes) as well as the computation loads of each node as a function of the computational powers, selecting the reduce load vector L^∗(c) = Kc minimizes the reduce time and allows for an order-optimal shuffle time, achieved by the algorithm described in Section 8.4. This solution has the drawback to slightly increase the map timeT^(m) compared to the case where all computing nodes have the same storage size.

In particular, for the optimal choice of the reduce load vectorL^∗(c) =Kc, the map time takes the value

T^(m) = max

λ∈[Λ]

(γ¯λ(Kc)N F c^(m)_λ

)

, (8.16)

which is in general higher than the map time given by a uniform computation load vector γ = (_Λ^t, . . . ,_Λ^t). However, we recall that in practical scenarios the time spent by the system in the map phase is much smaller than the shuffle time and the reduce time (cf. [16]), therefore this increased map time does not cancel-out the benefits of our proposed solution which aims at minimizing the more dominant reduce time and shuffle time.

8.4 A Novel File Assignment and Shuffle Scheme for

Dans le document Fundamental limits of shared-cache networks (Page 137-144)