• Aucun résultat trouvé

Designing Algorithms in MapReduce

N/A
N/A
Protected

Academic year: 2022

Partager "Designing Algorithms in MapReduce"

Copied!
14
0
0

Texte intégral

(1)

Designing Algorithms in MapReduce

Mauro Sozio

Institut Mines-Telecom

sozio@telecom-paristech.fr

(2)

BigData Paris, 21/01/13

Massive amount of data generated daily

Some facts:

Facebook >900M users, 900M objects (pages, groups, events).

Flickr >50 million users, 6 billion images!

Twitter > 300M users, 1.6 billion search queries per day

Google > 3 billion search queries per day

How to make sense of all these data?

Tuesday, June 18, 13

(3)

MapReduce

Reduce

Map Map Map Map

Reduce

Output

Reduce Reduce

Initially developed by Google, it is nowadays used by several companies (Yahoo!, IBM) and universities (Cornell, CMU...).

Many sequential algorithms have been adapted to MapReduce.

(4)

BigData Paris, 21/01/13

MapReduce Algorithms

 Matrix-vector iterative algorithms efficient in MapReduce:

PageRank;

Linear and logistic regression, naive Bayes, k-means clust., SVM;

Pair-wise document similarity, language modeling.

Tuesday, June 18, 13

(5)

MapReduce in Brief

 MapReduce code consists of two functions:

Map transforms the input into key-value pairs to process Reduce aggregates the list of values for each key

 The MapReduce environment takes care of distributing the computation

 Non-trivial MapReduce programs consist of many Map and Reduce tasks.

 Higher-level languages (Pig, Hive, etc.).

(6)

BigData Paris, 21/01/13

Three operations on key-value pairs

 User-defined: map : (key,value) → List(key’,value’) e.g. map (docId, document) → ((jaguar,1),(mac,1),(jaguar,1),...)

 Built-in function: shuffle. Group pairs with a same key and assign them to a reducer. Seamless to the user.

 User-defined: reduce: (key,List(values)) → List(key’,value’) e.g. reduce: (jaguar,(1,2,1)) → ((jaguar,4)

Tuesday, June 18, 13

(7)

Some facts

 Map and Reduce tasks are independent from each other!

 Reducer is an independent unit of computation.

 The output from a reduce could be the input for another

MapReduce iteration.

(8)

BigData Paris, 21/01/13

Map,Reduce,Shuffle

Tuesday, June 18, 13

(9)

MR: Counting Words

void map(String docId, String document):

// docId: name or physical address of the documents // document: document content

for each word w in document:

Output(w, "1");

void reduce(String word, Iterator partialSums):

// word: a word

// partialSums: a list of partial sums int sum = 0;

for each psum in partialSums:

sum += ParseInt(psum);

Output(word, AsString(sum));

(10)

BigData Paris, 21/01/13

Example

DocID Content

d1 the jaguar is a new world mammal of the felidae family.

d2 for jaguar, atari was keen to use a 68k family device.

d3 mac os x jaguar is available at a price of us $199 for apple’s

new “family pack”.

d4 one such ruling family to

incorporate the jaguar into their name

is jaguar paw.

d5 It is a big cat. 

Tuesday, June 18, 13

(11)

Example

(d1,content),,...,(d5,content) Map

jaguar 1 mammal 1 family 1 jaguar 1 jaguar 2 ...

Reduce Shuffle

jaguar 1,1,2 mammal 1 family 1 ...

jaguar 5

mammal 1

family 1

...

(12)

BigData Paris, 21/01/13

MapReduce Code (Java)

public class MapperText extends Mapper<LongWritable, Text, Text, IntWritable> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = value.toString();

String[] tab = s.split("[^-a-zA-Z0-9]");

for(String w : tab){

if(!w.isEmpty())

context.write(new Text(w.toLowerCase()),new IntWritable(1));

} }

}

public class ReducerText extends Reducer<Text, IntWritable, Text, LongWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{

long total = 0;

for(IntWritable vals:values){total += vals.get();}

context.write(key, new LongWritable(total));

} }

Mapper Class

Reducer Class

Tuesday, June 18, 13

(13)

A MapReduce cluster

 Nodes inside a MapReduce cluster are decomposed as follows:

 A jobtracker acts as a master node; MapReduce jobs are submitted to it.

 Several tasktrackers run the computation itself, i.e., map and reduce tasks

 A given tasktracker may run several tasks in parallel

 Tasktrackers usually also act as data nodes of a distributed filesystem (e.g., GFS, HDFS)

+ a client node where the application is launched.

(14)

BigData Paris, 21/01/13

PageRank

 Find r s.t. r = Ar

 Suppose there are N web pages

 Initialize: r 0 = [1/N,….,1/N] T

 Iterate:

 r k+1 = Ar k

 Stop when |r k+1 - r k | 1 < ε

 |x| 1 = ∑ 1≤i≤N |x i | is the L 1 norm

Tuesday, June 18, 13

Références

Documents relatifs

[r]

[r]

More precisely, it is shown that the number of distinct répétition roots u that are bond to the occurrence of their cube u 3 somewhere along the textstring is bounded by n, whence

public void doGet (HttpServletRequest, HttpServletResponse ) throws IOException public void doPost (HttpServletRequest, HttpServletResponse res) throws IOException private

public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException{. String

public static void main(String [] args) throws IOException{. ServerSocket serveur =

public void ajouter(String id, double somme) throws java.rmi.RemoteException {((Compte)clients.get(id)).ajouter(somme); }. public void retirer(String id, double somme)

public static void main(String[] args) throws IOException { InputStream in=new FileInputStream(args[0]);.. Et avec