Distributed Indexing and Computing
Web data management and distribution
Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart
http://gemo.futurs.inria.fr/wdmd
February 2, 2010
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 1 / 21
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 2 / 21
Local Networks
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 3 / 21
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 4 / 21
Design issues for distributed trees
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 5 / 21
Basic features of the DST
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 6 / 21
Balancing the tree with a rotation
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 7 / 21
The client caching mechanism
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 8 / 21
Example of an out-of-range request followed by an adjutment
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 9 / 21
Four replication strategies in a binary distributed tree
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 10 / 21
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 11 / 21
Overview of Bigtable structure
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 12 / 21
Persistence management in Bigtable
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 13 / 21
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 14 / 21
Centralized computing with distributed data storage
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 15 / 21
Distributed computing with distributed data storage
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 16 / 21
The programming model of MapReduce
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 17 / 21
Counting terms occurrences
The map() function
mapCW ( String key , String value ):
// key : d o c u m e n t name
// value : d o c u m e n t c o n t e n t s for each term t in value :
return (t , 1);
The reduce() function.
r e d u c e C W( String key , I t e r a t o r values ):
// key : a term
// values : a list of counts int result = 0;
// Loop on the values list ; c u m u l a t e in result for each v in values :
result += v ;
// Send the result return result ;
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 18 / 21
// A s p e c i f i c a t i o n object for M a p R e d u c e e x e c u t i o n M a p R e d u c e S p e c i f i c a t i o n spec ;
// Define input files
M a p R e d u c e I n p u t * input = spec . a d d _ i n p u t ();
input - > s e t _ f i l e p a t t e r n (" d o c u m e n t s. xml " );
input - > s e t _ m a p p e r _ c l a s s ( " MapWC " );
// S p e c i f y the output files :
M a p R e d u c e O u t p u t * out = spec . output ();
out - > s e t _ f i l e b a s e (" wc . txt " );
out - > s e t _ n u m _ t a s k s (100);
out - > s e t _ r e d u c e r _ c l a s s (" R e d u c e W C" );
// Now run it
M a p R e d u c e R e s u l t result ;
if (! M a p R e d u c e( spec , & result )) abort ();
// Done : ’ result ’ s t r u c t u r e c o n t a i n s result info return 0;
}
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 19 / 21
Distributed execution of a MapReduce job.
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 20 / 21
The end for today
Merci
Gemo, Lamsade, LIG, Télécom (WDMD) Distributed Indexing and Computing February 2, 2010 21 / 21