Distributed Indexing and Computing
Web data management and distribution
Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart
http://webdam.inria.fr/textbook
March 19, 2010
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 2 / 23
Introduction to Distributed Systems
Local Networks
Replication protocols: distributed transactions
Main algorithm: two-phases commit.
Downside: 2PC is a blocking protocol. The participant are entitled to receive a decision from the coordinator.
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 4 / 23
Introduction to Distributed Systems
Replication protocols: distributed recovery
Extension of standard techniques for recovery (left: centralized; right:
distributed).
If a node fails, the replicated log file can be used to recover the last
transactions on one ot its mirrors.
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 6 / 23
Distributed Search Trees
Design issues for distributed trees
Basic features of the DST
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 8 / 23
Distributed Search Trees
Balancing the tree with a rotation
The client caching mechanism
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 10 / 23
Distributed Search Trees
Example of an out-of-range request followed by an
adjutment
Four replication strategies in a binary distributed tree
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 12 / 23
A Case Study: Bigtable
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Overview of Bigtable structure
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 14 / 23
A Case Study: Bigtable
Persistence management in Bigtable
Outline
1 Introduction to Distributed Systems
2 Distributed Search Trees
3 A Case Study: Bigtable
4 Distributed Computing: MapReduce
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 16 / 23
Distributed Computing: MapReduce
Centralized computing with distributed data storage
Distributed computing with distributed data storage
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 18 / 23
Distributed Computing: MapReduce
The programming model of MapReduce
Counting terms occurrences
The map() function
mapCW ( String key , String value ):
// key : d o c u m e n t name
// value : d o c u m e n t c o n t e n t s for each term t in value :
return (t , 1);
The reduce() function.
r e d u c e C W( String key , I t e r a t o r values ):
// key : a term
// values : a list of counts int result = 0;
// Loop on the values list ; c u m u l a t e in result for each v in values :
result += v ;
// Send the result return result ;
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 20 / 23
Distributed Computing: MapReduce
// A s p e c i f i c a t i o n object for M a p R e d u c e e x e c u t i o n M a p R e d u c e S p e c i f i c a t i o n spec ;
// Define input files
M a p R e d u c e I n p u t * input = spec . a d d _ i n p u t ();
input - > s e t _ f i l e p a t t e r n (" d o c u m e n t s. xml " );
input - > s e t _ m a p p e r _ c l a s s ( " MapWC " );
// S p e c i f y the output files :
M a p R e d u c e O u t p u t * out = spec . output ();
out - > s e t _ f i l e b a s e (" wc . txt " );
out - > s e t _ n u m _ t a s k s (100);
out - > s e t _ r e d u c e r _ c l a s s (" R e d u c e W C" );
// Now run it
M a p R e d u c e R e s u l t result ;
if (! M a p R e d u c e( spec , & result )) abort ();
// Done : ’ result ’ s t r u c t u r e c o n t a i n s result info return 0;
}
Distributed execution of a MapReduce job.
Gemo, Lamsade, LIG, Télécom (WebDam) Distributed Indexing and Computing March 19, 2010 22 / 23
Distributed Computing: MapReduce