• Aucun résultat trouvé

Exploiting Latent Information in Databases via Database Embedding: technology, applications, ethics

N/A
N/A
Protected

Academic year: 2022

Partager "Exploiting Latent Information in Databases via Database Embedding: technology, applications, ethics"

Copied!
1
0
0

Texte intégral

(1)

Exploiting Latent Information in Databases via Database Embedding:

technology, applications, ethics (Invited Talk)

Oded Shmueli

Technion – Israel Institute of Technology Haifa, Israel

oshmu@cs.technion.ac.il We are witnessing the emergence of AI-powered database sys-

tems, embedding AI-ideas and techniques in query processors, concurrency controllers, and more. We aim at improving rela- tional querying, as well as other functionalities, by introducing another layer of data, word vectors, into traditional database systems. Word vectors originate in Natural Language Processing (NLP) where they are used to represent words in a language. In NLP, there are a number of methods for obtaining word vectors from text, we use a variation of one of these methods, word2vec.

The idea in a nutshell is as follows: we produce text from a relation (or a view thereof) and then use this text to generate a model, i.e., a set of vectors, for all terms in the database. Once the model is available, we can formulate Cognitive Intelligence (CI) queries. These queries may be realized by SQL queries, enhanced by User Defined Functions (UDFs) that take advantage of the model to formulate conditions that were previously practically not expressible in SQL.

The process of vector construction is different than in NLP. It reflects the characteristics of relations, with integrity constraints and named columns which contain various data types, strings, dates, numeric values, images and more. We call this process db2vec. There are a number of options for model generation:

based on the textification of a single or multiple relations, incor- porating external text sources (e.g., Wikipedia), incorporating externally produced models, standalone, or as building material for constructing a local model.

There are many application areas that may benefit from our approach: Commerce, Finance, HR, Science, and more. Whereas there are generic UDFs, some application areas require develop- ing specialized UDFs. One example is a food database application in which a record has a list of ingredients, in decreasing order of importance.

A model reflects the textual and vector sources used to produce it. As decisions may be based on queries using the model, the production of models brings to the forefront issues of fairness and ethics. An important issue is the specifics of the data and text sources, their weighting in producing the model, and whether they are biased in some way.

© 2020 Copyright held by the owner/author(s). Published in the Workshop Pro- ceedings of the EDBT/ICDT 2020 Joint Conference, March 30-April 2, 2020 on CEUR-WS.org. Distribution of this paper is permitted under the terms of the Cre- ative Commons license CC BY 4.0.

Limiting information disclosure is also an important consider- ation. Especially within an organization, there is a need to share information. However, it would be desirable that this sharing enable productive work while hiding information that is not essential for that work. To this end, we introduce degrees of disclosure. Here, some information in the database is encrypted, some is simply not supplied, while additional information is in- tentionally supplied in the form of a model.

Références

Documents relatifs

Location vignettes also contain the spatial arrangement of objects. In ad- dition to the extracted relations from the description corpus, we also designed a series of AMT tasks

Operators in the query plan tree leaves corresponding to keywords in the query will simply read an inverted file for this word and return tuples consisting of

Furthermore we performed a keyword-analysis extracting top-terms in the categories blood pres- sure (21 terms), food & nutrition (20 terms), weight (16 terms), medication (7

Procedure 2 First pass of identifying key phrases: indexesIntersection() Require: Indexes of titles of Web pages (Index web ) & Wikipedia articles (Index wiki ),.. Wikipedia

Referring to the U-OWL case, an error discovered by semantic tests is the following: the input model is composed by a class diagram containing a hierarchy of classes each

Eine neue Lieblingsspeise habe ich in Deutschland auch kennengelernt: Leberknödel.. Jetzt koche ich für meinen Mann und vor

Print list of files in a directory on the remote machine to local- file (or to the screen if local-file is not specified).. If remote- directory is unspecified, the current

Typical M2T tools such as Acceleo or Xtend provide a forward escape to embed a control expression within literal text.. The name