• Aucun résultat trouvé

Distributed Computing and Data Base Systems for the Big Data


Academic year: 2022

Partager "Distributed Computing and Data Base Systems for the Big Data"

En savoir plus ( Page)

Texte intégral


Parallelism Parallelism

Master 1 International Master 1 International

Andrea G. B. Tettamanzi

Université de Nice Sophia Antipolis Département Informatique



Lecture 7 – Part a

Distributed Computing and Data Base Systems

for the Big Data



• NoSQL Databases

• Hadoop and MapReduce

• Apache Spark

• Document-based DBs: MongoDB

• Graph-based DBs

• RDF, triple stores, and SPARQL (→ MOOC)

• The CAP Theorem


• Apache Hadoop is a software library

• Framework for the distributed processing of large data sets

• Designed to

– scale up from single servers to 1000s of machines – detect and handle failures at the application layer

• Four modules (components)

– Hadoop Common: file-system and OS-level abstractions – Hadoop distributed file system (HDFS)

– Hadoop YARN: job scheduling and cluster resource mgmt – MapReduce: parallel processing of large data sets

• Related projects: Cassandra, HBase, Spark, and several others


Hadoop Architecture

Location awareness:

Rack (= switch) name

JRE 1.6 or higher




• Distributed, scalable, and portable file-system written in Java

• A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available

• Each datanode serves up blocks of data over the network using a block protocol specific to HDFS.

• TCP/IP sockets used for communication

• Clients use RPC to communicate with one another

• Large files stored across multiple machines

• Reliability achieved through replication (cf. Part b of this lecture)


MapReduce (1)

• A programming model for processing parallelizable problems on huge datasets using a large number of computers (nodes)

• Inspired by map and reduce functions in functional programming

• Originally a proprietary Google technology

• MapReduce libraries have been written in many programming languages

• A popular open-source implementation is part of Apache Hadoop

• Key contributions of the MapReduce framework are – Scalability (optimized distributed shuffle operation) – Fault-tolerance


MapReduce (2)

Three-Step Parallel and Distributed Computation:

"Map" operation: Each worker node applies the “map(key, value)” function to the local data, and writes the output to a

temporary storage. A master node ensures that only one copy of redundant input data is processed.

"Shuffle" operation: Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node.

"Reduce" operation: Worker nodes apply the “reduce(key, list of values)” to each group of output data, per key, in parallel.


MapReduce: Example

function map(String name, String document):

// name: document name

// document: document contents for each word w in document:

emit (w, 1)

function reduce(String word, Iterator partialCounts):

// word: a word

// partialCounts: a list of aggregated partial counts sum = 0

for each pc in partialCounts:

sum += pc

emit (word, sum)


Apache Spark

• The Spark project was started by Matel Zaharia at UC Berkeley to provide programming tools for big data volumes that are easy to use and versatile as those for single machines

• Main motivation: overcome limitations in the MapReduce model

• It has language-integrated APIs in Python, Java, R, and Scala

• Supports streaming, batch, and interactive computations

• Open source (donated to the Apache Software Foundation)

• Implicit data parallelism and fault-tolerance

• Spark requires:

– cluster manager (e.g., native Spark or Apache YARN), – distributed storage system (e.g., HDFS, Cassandra, …)


NoSQL Databases

• Storage and retrieval of data modeled in means other than the tabular relations used in relational databases

• Non-relational databases have existed since the 1960s

• However, real NoSQL (“not only SQL”) databases appeared as a solution for the needs of Web 2.0 companies such as Google, Facebook, and Amazon.

• Increasingly used in big data and real-time web applications

• Motivations include:

– simplicity of design

– simpler horizontal scaling to clusters of machines – finer control over availability.


Document-Oriented Databases

• Document Stores

• Documents encapsulate and encode data (or information) in some standard formats or encodings:

– XML – JSON – Etc.

• Documents contain both the data and the metadata (think of an XML document!)

• One of the most popular document-oriented databases is nowadays MongoDB



• MongoDB is a scalable, high-performance, open-source, schema- free, document-oriented DB based on BSON (= binary JSON)

• Documents are organized in collections (~ DB tables)

• Performance: no joins, no complex transactions

• Scalability: sharding (= horizontal data distribution)

• Rich JavaScript-based query syntax – Allows deep nested queries:

db.order.find({ shipping : { carrier : “DHL” } } )

– Documents are just JSON objects that Mongo stores in binary – Queries return a cursor (= a result set iterator): more efficient – Cursor methods: hasNext(), forEach()


MongoDB's Features

• Capped collections: fixed-sized, ldt-operation, auto-LRU age-out – Fixed insertion order

– Super fast, ideal for logging and caching

• Sharding: a method for storing data across multiple machines – Horizontal vs. Vertical Scaling:

• Vertical: add more CPUs and RAM/disks

• Horizontal: divide the dataset and distribute over multiple servers (shards).

– Each shard is an independent DB

– Collectively, the shards make up a single logical DB

– MongoDB supports sharding through the configuration of a


MongoDB Sharded Cluster

Query routers

Cluster's metadata

Shard keys are used to assign documents

to shards


Apache Cassandra

• An open-source distributed DB management system – handles large amounts of data across many servers – provides high availability with no single point of failure – asynchronous masterless replication

• Initially developed at Facebook by Avinash Lakshman

• Based on DHTs

• Borrows many elements from Amazon's Dynamo

• Hybrid data model: key-value, column-oriented, row-partitioned

• Cassanda Query Language (CQL) – Syntax similar to SQL

– Alternative to RPC-based interface


Graph-Oriented Databases

• A graph database is database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data

• Most graph databases store their data in a key-value store or document-oriented database

• The two most popular graph databases are OpenLink Virtuoso and AllegroGraph

• One particularly influential graph model used by these databases is RDF (the resource description framework)



• The data model of the Semantic Web

– Published as a W3C recommendation in 1999 – Initially a model for metadata

– Now used to represents all sorts of data and knowledge

• Resource Description Framework

– Resource: anything that can have an URI, a node of the graph – Decription: attributes, features, and relations of the resources – Framework: model, languages and their syntax and semantics

• Based on triples of the form (subj pred obj)

– Subj and obj are resources (= nodes); obj can be a literal – Pred is an arc label


RDF Data Model

RDF graph =

collection of RDF triples


Acme Ltd hasLegalName

http://example.org/person/1 hasCEO

January 01, 1970



RDF Syntax


<rdf:RDF xmlns:voc=“http://example.org/vocabulary”>

<voc:RegisteredOrganization rdf:about=“http://example.org/org/1”>

<voc:hasLegalName> “Acme Ltd” </voc:hasLegalName>

<voc:hasCEO rdf:resource=“http://example.org.person/1”/>


<voc:Person rdf:about=“http://example.org.person/1”>

<voc:DOB>January 1, 1970</voc:DOB>



• Turtle

@prefix voc: <http://example.org/vocabulary/>

<http://example.org/org/1> rdf:type voc:RegisterdOrganization ; voc:hasLegalName “Acme Ltd” ;

voc:hasCEO <http://example.org.person/1> .

<http://example.org.person/1> rdf:type voc:Person ;



• SPARQL = SPARQL Protocol And RDF Query Language

• SPARQL is the query language for RDF

– Based on the RDF data model (triples/graph) – Main idea: pattern matching

– Declarative: describe subgraphs of the queried RDF graph – Graph patterns (= RDF graphs with variables

• SPARQL is a Protocol

– Transmission of SPARQL queries and results

– SPARQL Endpoint: a Web service implementing the protocol

• W3C standard since January 2008


SPARQL Query Types

• SELECT ?x ?y … WHERE graph-pattern(?x, ?y, …)

– Return a table of all x, y, … matching the description of the graph contained in the graph pattern.

• CONSTRUCT {?s ?p ?o} WHERE graph-pattern(?s, ?p, ?o) – Find all x, y, … satisfying the given conditions and replace

them in the given triple templates to create a new RDF graph from them

• DESCRIBE ?x WHERE graph-pattern(?x)

– Find all declarations providing information about the given resource(s) matching the given graph conditions

• ASK WHERE graph-pattern(?x, ?y, …)


SPARQL: Example

• SELECT ?companyName WHERE {

?company voc:hasLegalName ?companyName . ?company voc:hasCEO ?ceo .

?ceo voc:DOB “January 1, 1970” . }

• Returns:


“ACME Ltd”


Semantic Web and Linked Data

• To learn more about RDF, SPARQL, and other Semantic Web related technologies...

• If you can understand French:

– MOOC “Web sémantique et Web de données”

– www.fun-mooc.fr/courses/inria/41002S02/session02/about

• Otherwise:

– MOOC “Knowledge Engineering with Semantic Web Technologies”, 2015 edition

– https://open.hpi.de/courses/semanticweb2015


The CAP Theorem

• Probably the must cited distributed systems theorem these days

• Stated by Eric Brewer, 2000, proved by Gilbert and Lynch, 2002

• Relates the following 3 properties – C: Consistency

• Cf. Part b of this lecture – A: Availability

• Every client’s request is served (receives a response)

unless a client fails (despite a strict subset of server nodes failing)

– P: Partition-tolerance

• System functions properly even if the network is allowed to


CAP Theorem




C, A, P: pick two!


CAP Theorem: an Illustration


0 0 1

Add item OK

Check out



The CAP Theorem in Practice

• In practical distributed systems – Partitions may occur

– This is not under your control (as a system designer)

• Designer’s choice

– You choose whether you want your system in C or A when/if (temporary) partitions occur

– Note: You may choose neither of C or A, but this is not a very smart option

• Bottom line

– Practical distributed systems are either in CP (e.g.: MongoDB, RDBMs) or in AP (e.g.: Cassandra)


Thank you for your attention!


Documents relatifs

However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet

The new supplemental_data handshake message type is defined to accommodate communication of supplemental data objects as agreed during the exchange of extensions in the client

In this section, we define extensions to the offer/answer model defined in SDP Offer/Answer Model [RFC3264] and extended in the SDP Capability Negotiation [RFC5939] to allow

The abstraction layer network is the collection of abstract links that provide potential connectivity across the server networks and on which path computation can be

Dependencies [Bravo et al, 2007], Matching Dependencies [Bertossi et al, 2011], Editing Rules [Fan et al, 2010], Fixing Rules [Tang, 2014]. • Each fragment covers a new

Cependant, le cas du Marwar démontre qu’afin d’être bien interprétées, les problématiques liées aux ressources naturelles doivent être l’objet d’une

The digital revolution has affected most of the arenas that previously produced and used analog data: photographs, sounds, drawings, and countless scienti fi c recordings, such

In that light, the phenomenon of Big Data in the humanitarian field is discussed and an agenda for the ‘datafication’ of the supply network set out as a means of

The structure and recommendations for the deployment of a hardware and software stand for testing algorithms for detecting financial fraud on the basis of data mining analysis as

Kerstin Schneider received her diploma degree in computer science from the Technical University of Kaiserslautern 1994 and her doctorate (Dr. nat) in computer science from

A layered architecture, DMEs and KVSs The idea is simple: A DME (the data consumer, DC) wants to use a data manipulation task at another DME (the data producer, DP).. For

The relatively simple principles are at the core of big data processing technology: “Divide and Conquer” – distributed architecture of data storage system and

Estimates for the Cox model with hospital fixed effects are closer to those of the stratified Cox model, but the negative competition effect of the reform for non-teaching

Après un passage en revue et une définition critique des différents types de données, l'auteur évoque la notion d'assemblages sociotechniques complexes de données,

Data pruning In all the previous items, the entire set of data is used. However, because of a large redundancy, not all data are equally important, or useful. Hence, one can

Opportunities from using big data to improve quality of care, increase health care system efficiency, and conduct high-quality research are now presented, alongside

Conclusion: This work quantified the within-subject variability of phenol and phthalate biomarker concentrations during pregnancy over various time scales (day to months), and

Our measure is associated to a degenerate elliptic PDE, it gives rise to a comprehensive elliptic theory, and, most notably, it is absolutely continuous with respect to

For this purpose, we have: (i) performed VNS in freely moving animals, chronically implanted with an electrode on the left cervical VN using stimulation parameters

Based on received information from sensors detecting weather condition, soil moister, and irrigation conditions of the plant, the system using intelligent agent decides whether

CBDCom 2016 has been organized into 11 tracks: Big Data Algorithms, Applications and Services (co-chaired by Boualem Benatallah, University of New SouthWales, Australia

The key issues here are: (i) being able to associate a cost model for the data market, i.e., associate a cost to raw data and to processed data according on the amount of data

The main difference with data posting is the focus: to preserve the relationships with the source database, classical data exchange only considers dependencies delivering