• Aucun résultat trouvé

IbmdbPy: Accelerating Python Analytics by In-Database Processing

N/A
N/A
Protected

Academic year: 2022

Partager "IbmdbPy: Accelerating Python Analytics by In-Database Processing"

Copied!
1
0
0

Texte intégral

(1)

IbmdbPy: Accelerating Python Analytics by In-Database Processing

Edouard Fouché and Michael Wurst IBM Deutschland Research & Development GmbH

Abstract. The Python programming language is becoming widely used in data science and machine learning. Thus, Python ecosystem is very rich and provides intuitive tools for data analysis. However, most Python libraries require the data to be extracted from the database to working memory and ressources are limited by computational power and memory.

Analyzing a large amount of data is often impractical or even impossible.

IbmdbPy is an open-source python package, developed by IBM, which provides a Python interface for data manipulation and machine learn- ing algorithms such as Kmeans or Linear Regression to make working with databases more efficient by seamlessly pushing operations written in Python into the underlying database for execution. This does not only lift the memory limit of Python, but also allows users to profit from performance-enhancing features of the underlying database management system. IbmdbPy is designed for IBM dashDB, a database system avail- able on IBM BlueMix, the IBM cloud application development and ana- lytics platform. Via remote connection, user operations can benefit from dashDB specific features, such as columnar technology and parallel pro- cessing, without having to interact with the database explicitly. Some in-database functions additionally use lazy loading to load only parts of the data that are actually required to further increase efficiency. Keeping the data in the database also avoids security issues that are associated with extracting data and ensures that the data that is being analyzed is as current as possible. IbmdbPy can be used by Python developers with very little additional knowledge, since it imitates the well-known inter- face of Pandas library for data manipulation and Scikit-learn library for machine learning algorithms. The project is still at an early stage, but several experiments have already been conducted to measure the advantage of using IbmdbPy over the corresponding in-memory imple- mentation. The results show that it provides a great runtime advantage for operations on medium to large dataset, i.e. on tables that have 1 mil- lion rows or more. The project aims to extend the BlueMix ecosystem, by providing a Python interface for dashDB, bridging the gap between the analytics platform and end-user environment, so that developers can benefit both from the expressivity of Python and from the speed-up pro- vided by SQL execution in dashDB, which can be run on a cluster.

Copyright © 2015 by the paper’s authors. Copying permitted only for private and academic purposes. In: R. Bergmann, S. Görg, G. Müller (Eds.): Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9.

October 2015, published at http://ceur-ws.org

77

Références

Documents relatifs

After adjustment for age, smoking status, – – pack-years, pack-years , energy intake, race/ethnicity, US region, body mass index and physical activity, the consumption of cured meats

The three cases presented enclose three different paths in which minority communities have developed a presence on the cyberspace; for instance: (1) sites in which

Thanks to Python and its modular approach, this framework makes it possible to integrate a variety of tools defined in different modeling context, in particular tools from

-La deuxième analyse se donne pour objectif d’évaluer l’effet du champ lexical dans les textes d’aide sur le nombre et le type (base de texte (T1) vs modèle de situation (T2)) des

Relationship between QTL for grain shape, grain weight, test weight, milling yield, and plant height in the spring wheat cross RL4452/‘AC Domain’.. Cabral, Adrian L.; Jordan, Mark

Figures 11 and 12 represent the user code needed to get the parallel version of the flow direction calculation above-cited.. The main function is in charge of the initialization of

Figure 2 shows an example with magnetic field as color plot and norm of the core flow for the stream lines, for a re-analysis of VO and GO data using a diagonal AR-1 model. The plot

Finally, the package implements one triplet learner and one quadruplet learner: Sparse Compositional Metric Learning (SCML, Shi et al., 2014) and Metric Learning from