• Aucun résultat trouvé

TableCNN: deep learning framework for learning tabular data

N/A
N/A
Protected

Academic year: 2022

Partager "TableCNN: deep learning framework for learning tabular data"

Copied!
2
0
0

Texte intégral

(1)

TableCNN: Deep Learning Framework for Learning Tabular Data

?

Pranav Sankhe, Elham Khabiri, Bhavna Agrawal, and Yingjie Li

1 University at Buffalo, Buffalo, NY

2 IBM Research, Yorktown Heights, USA pranavgi@buffalo.edu

{khabiri,bhavna,yingjie}@us.ibm.com

Abstract. Databases and tabular data are among the most common and rapidly growing resources. But many of these are poorly annotated (lack sufficient metadata), and are filled with domain specific jargon and alpha-numeric codes. Because of the domain specific jargon, no pre- trained language model could be applied readily to encode the cell con- tent. We propose a deep learning based framework, TableCNN, that en- codes the semantics of the surrounding cells to predict the meaning of the columns. We propose application of Byte Pair Encoding (BPE)[5]

to create tokens for each cell and treat each cell as a phrase of existing tokens. Once tokenized, we process it with a CNN network to develop a classifier.

1 Introduction

Tables are rich in data and can provide vital information about the object due to the virtue of its structure. Extracting useful insights from tabular data may re- quire domain expertise, especially if the information is comprised of domain spe- cific jargon’s or alpha-numeric codes. Our method provides a supervised learning solution to classify an unknown column in such tables into predefined column classes which can also come from a knowledge graph. Existing methods[2, 3, 1]

cannot accommodate such data.

2 Methodology and Results

Cell entries are tokenized using Byte-Pair Encoding with a stopping condition defined by the token frequency threshold. Cell embedding is generated using Word2Vec[4]; each row across the tokenized table is treated as a sentence for Word2Vec model learning. We extract micro tables from the table, with a target column and surrounding columns having set number of rows; which are model parameters. Micro tables are then processed through TableCNN to classify which

?Supported by IBM Research

Copyright c2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0).

(2)

2 Pranav Sankhe, Elham Khabiri, Bhavna Agrawal, and Yingjie Li

class it belongs to. Class is defined as the header of a column in the table used for training.

Network: TableCNN Fig 1 is an abstract view of the TableCNN network ar- chitecture. We extract column and row features in separate networks and com- bine them to regress final output. Row and column features are computed by a convolution operation over the first row and the target column of micro table respectively. Outputs from row and column features are concatenated and fed to a fully connected layer with a SoftMax layer to make final prediction.

For our experiment, we use a manufacturing database table that contains 112 columns and 115 thousand rows. Fig 2 shows that we obtain correct column predictions, except for column 49. Upon further inspection, we found that most of the cells in column 49 were empty, and thus the loss of classification accuracy.

This shows that network is able to learn features from the surrounding columns along with the target column entries.

Fig. 1: TableCNN Architecture.

Fig. 2: Confusion Matrix.

3 Conclusions

In this poster, we present a supervised learning framework that can classify columns of a table with arbitrary alpha-numeric data. The arbitrary alpha- numeric nature of data prevents us from using pre-trained language models.

References

1. Chen, J., Jim´enez-Ruiz, E., Horrocks, I., Sutton, C.A.: Learning semantic annota- tions for tabular data. CoRRabs/1906.00781(2019)

2. Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Matching web tables with knowledge base entities: From entity lookups to entity embeddings.

In: International Semantic Web Conference (2017)

3. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB3, 1338–1347 (09 2010)

4. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word repre- sentations in vector space. CoRRabs/1301.3781(2013)

5. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. ArXiv abs/1508.07909(2016)

Références

Documents relatifs

Doing so, we address our second research question, (How can we characterize the managerial approach (and its associated principles in terms of investment decision, economic

“Extracting gamma-ray information from images with convolutional neural network methods on simulated cherenkov telescope array data,” in IAPR Workshop on Artificial Neural Networks

In this work, we employ a transductive label propagation method that is based on the manifold assumption to make predictions on the entire dataset and use these predictions to

In Thrun, S., Saul, L., and Sch¨olkopf, B., editors, Advances in Neural Information Processing Systems 16, Cambridge, MA.. Semi-supervised learning using gaussian fields and

We successfully use a graph-based semi-supervised learning approach to map nodes of a graph to signal amplitudes such that the resulting time series is smooth and the

Recent works have applied Deep learning techniques to Knowledge tracing (Deep Knowledge Tracing – DKT) in order to model single student’s behavior during the time, in terms of

Deep Learning models proposed so far [11] have used external informa- tion to enrich entities, such as synonyms from the ontology, definitions from Wikipedia and context from

Keywords: Music Information Retrieval, Singing voice separation, Deep learning, Train- ing datasets, Data augmentation.... 1.3 The singing voice