• Aucun résultat trouvé

STREAMLINE - Streamlined Analysis of Data at Rest and Data in Motion

N/A
N/A
Protected

Academic year: 2022

Partager "STREAMLINE - Streamlined Analysis of Data at Rest and Data in Motion"

Copied!
1
0
0

Texte intégral

(1)

STREAMLINE -

Streamlined Analysis of Data at Rest and Data in Motion

Philipp M. Grulich

1

, Tilmann Rabl

1,2

, Volker Markl

1,2

, Csaba Sidló

3

, Andras Benczur

3

1 German Research Center for Artificial Intelligence (DFKI),2TU Berlin,3Hungarian Academy of Sciences (MTA SZTAKI)

1[email protected],2[email protected], 3[email protected]

ABSTRACT

STREAMLINE aims for improving the overall workflow of big data analytics systems. For this goal, it combines re- search in different areas to reduce the complexity of the work withdata at restanddata in motionin a unified fashion. As a foundation STREAMLINE offers a uniform programming model on top of Apache Flink, for which it drives innovations in a wide range of areas, such as interactivedata in motion visualization and advanced window aggregation techniques.

1. PROJECT SUMMARY

The STREAMLINE project aims to improve the workflow and usability of current big data analysis systems. Therefore it provides a uniform system, which is able to handle the analysis of big data at rest as well as fast data in motion.

With this platform, STREAMLINE enables a reduction of complexity, costs, and latency.

Traditionally batch- and stream-processing were consid- ered as two very different types of applications, but in the last years, it has been shown that the most real-world use- cases required systems for both workloads. This forces com- panies to integrate different specialized systems, which leads not only to complex system architecturesand introduces main- tenance overhead, it also introduces a high latency to the general data analysis workflow. This is also known as the problem of system and human latency in big data analysis.

Even technologies that are able to combinedata in motion anddata at rest are currently very complex and difficult to deploy, maintain and use. Beside this many companies have a demand for much more advanced analyses, which are still hard to implement in current systems.

To reduce this complexity STREAMLINE combines re- search and innovations in the areas of distributed systems, data management, and machine learning. Whereby STREAM- LINE’s key goal is to arrive at sustainable innovation by technology transfer to an established and growing open source project. STREAMLINE focuses on innovations in the area of the following four reactive and proactive applications:

2017, Copyright is with the authors. Published in Proc. 20th Inter-c national Conference on Extending Database Technology (EDBT), March 21-24, 2017 - Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceed- ings.org. Distribution of this paper is permitted under the terms of the Cre- ative Commons license CC-by-nc-nd 4.0

customer retention, personalized recommendations, target advertisement and multilingual Web processing. To inte- grate the innovations into the industry, STREAMLINE part- ners with multiple companies.

As its system foundation STREAMLINE relies on the open source data processing system Apache Flink, which is able to handle batch and stream processing on a single pipelined execution engine [1]. On top of this STREAM- LINE offers a single uniform programming model that can automatically be optimized, parallelized, and adopted to the system load, data distribution, and architecture.

Research Highlights:Cutty [2] introduces a general ag- gregation sharing framework for streaming windows, which outperforms previous solutions in order of magnitudes. This technique utilizes the fact that window aggregations are one of the most redundancy-prone operations in current stream processing. Cutty is also suitable for multi query aggregation sharing and non-periodic windows, such as ses- sion window, which can be used for more complex busi- ness logic. Based on this technique STREAMLINE enables higher throughput and improves the efficiency of its data processing platform.

I2[3] in contrast, focuses on the visualization and inter- active aggregation ofdata in motion, which is a key enabler for fast and efficient real-time data analysis. It contributes an interactive development environment, which coordinates the cluster application and includes interactive stream visu- alization techniques. With this I2is able to handle advanced and adaptive aggregations directly on the cluster. As one ex- ample we provide an aggregation algorithm for timer-series data, which reduces the amount of data in a data-rate in- dependent manner and is proven to be correct and minimal in terms of transferred data. Therefore I2is an important part of STREAMLINE, because it enhances the usability and accessibility of its platform.

2. ACKNOWLEDGEMENTS

This work was supported by the EU Horizon 2020 project Streamline (688191).

3. REFERENCES

[1] Carbone et al. Apache flink: Stream and batch processing in a single engine.IEEE Data Eng. Bull., 38(4), 2015.

[2] Carbone et al. Cutty: Aggregate sharing for

user-defined windows. InCIKM, pages 1201–1210, 2016.

[3] Traub et al. I2: Interactive real-time visualization for streaming data. InEDBT, EDBT, 2017.

Références

Documents relatifs

Motion estimation on ocean satellite images by data assimilation in a wavelets reduced modelE. Etienne Huot, Giuseppe Papari, Isabelle Herlin,

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

To determine the critical angle g at which a stationary ruck starts rolling, we tilt the plane smoothly until the ruck moves, wherein it quickly reaches a steady speed V.. g is

one of the newest and fastest time series databases that received much attention lately. This category of databases is heavily used for data mining and forecasting [17], [18].

Attempts to curb the spread of coronavirus by introducing strict quarantine measures apparently have different effect in different countries: while the number of new cases

Read carefully the text and answer the different questions.. What was going on in

To compare with a time series database, namely In- fluxDB, we use the InfluxDB benchmark [? ]. The second test creates 250, 000 nodes and inserts 1, 000 values in each of the

1) Storing Graphs as Time Series: To build this candi- date solution we leveraged a plain graph whose values are versioned in I NFLUX DB [14] (version 1.1.2). InfluxDB is one of