Quantum-inspired Multimodal Representation

(1)

Quantum-inspired Multimodal Representation

Qiuchi Li

[email protected]

Department of Information Engineering University of Padova

Padova, Italy

Massimo Melucci

[email protected]

Department of Information Engineering University of Padova

Padova, Italy

ABSTRACT

We introduce our work in progress that targets on building multimodal representation under quantum inspiration. The challenge for multimodal representation falls on a fusion strategy to capture the interaction between different modalities of data. As the most successful approaches, neural networks lack a mechanism of explicitly showing how different modalities are related to each other.

We address this issue by seeking inspirations from Quantum The- ory (QT), which has been demonstrated advantageous in explicitly capturing the correlations between textual features. In this paper, we give an overview of the related works and present the proposed methodology.

KEYWORDS

multimodal data fusion, quantum physics, neural networks

1 INTRODUCTION

In human communication, messages are often conveyed through a combination of different modalities, such as visual, audio and linguistic modalities. In order for an automatic understanding of the multimodal messages, one needs to fuse the information from different modalities to construct a joint multimodal representation.

The challenge falls on how to characterise the interactions between different modalities, which can be complicated in some scenarios. In the example shown in Fig. 1, the visual-linguistic query cannot be understood solely by either the text or image individually, but the relation between the image and text must be correctly recognized.

This is where the significance and challenge sit for multimodal representation learning.

Most existing research focus on neural network-based fusion strategies for constructing multimodal representation, ranging from the earlier Hidden Markov Model (HMM)-based models [7] to different RNN variants [1] and more recently tensor-based approaches [12] and seq-to-seq structures [8]. Despite their strong accuracy performances, the interplay between different data modalities are encoded in an inherent way by the neural network components, making it difficult for humans to understand the contribu- tions of each modality in a particular task.

Quantum Theory (QT) provides a well-established theoretical and mathematical formalism for describing the physical world on a microscopic level. Beyond physics, QT frameworks have been ap- plied to many research areas [2, 9, 10, 15], among which successful

Copyright @ 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

IIR 2019, September 16–18, 2019, Padova, Italy

Web Search of Fashion Items with Multimodal Querying

Katrien Laenen

KU Leuven

Human Computer Interaction Heverlee, Belgium [email protected]

Susana Zoghbi

KU Leuven

Marie-Francine Moens

KU Leuven

ABSTRACT

In this paper, we introduce a novel multimodal fashion search para- digm where e-commerce data is searched with a multimodal query composed of both an image and text. In this setting, the query image shows a fashion product that the user likes and the query text allows to change certain product attributes to fit the product to the user’s desire. Multimodal search gives users the means to clearly express what they are looking for. This is in contrast to current e-commerce search mechanisms, which are cumbersome and often fail to grasp the customer’s needs. Multimodal search requires intermodal representations of visual and textual fashion attributes which can be mixed and matched to form the user’s desired product, and which have a mechanism to indicate when a visual and textual fashion attribute represent the same concept. With a neural network, we induce a common, multimodal space for visual and textual fashion attributes where their inner product measures their semantic similarity. We build a multimodal retrieval model which operates on the obtained intermodal representations and which ranks images based on their relevance to a multimodal query.

We demonstrate that our model is able to retrieve images that both exhibit the necessary query image attributes and satisfy the query texts. Moreover, we show that our model substantially outperforms two state-of-the-art retrieval models adapted to multimodal fashion search.

ACM Reference format:

Katrien Laenen, Susana Zoghbi, and Marie-Francine Moens. 2018. Web Search of Fashion Items with Multimodal Querying. InProceedings of WSDM 2018: The Eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, February 5–9, 2018 (WSDM 2018),9 pages.

https://doi.org/10.1145/3159652.3159716

1 INTRODUCTION

Today’s consumers have become very exigent. When shopping online, they have in mind a specific clothing item in a particular color and style, and they want to find it without too much effort.

However, current e-commerce search mechanisms are often too limited to provide this kind of service. A common way for searching

1Image material adapted from www.amazon.com

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA

ACM ISBN 978-1-4503-5581-0/18/02...$15.00 https://doi.org/10.1145/3159652.3159716

Figure 1: Example of a multimodal query¹. It consists of a query image and a query text that alters the query image.

The query text mentions two fashion attributes: short and lace. The query image is knee length and does not have a lace type appearance.

products in a webshop is to navigate through a product category hierarchy. Users end up in a subcategory which they need to search completely, which is time-consuming. They need to go through many irrelevant products, without the guarantee of actually find- ing the product they are looking for. To narrow down the search, users can sometimes select certain filters. However, desired product attributes might not be amongst the available filters. With a text-based search approach, users can describe the desired product by entering keywords into a search bar. The webshop then finds relevant products by matching these keywords to the words in the product descriptions. It is often difficult for users to write the right keywords that will induce the search engine to provide the products they are interested in. For example, some users might be interested in “jeans with holes”, but the relevant products are described as

“distressed jeans”. Additionally, this approach hampers the search for product attributes which are not mentioned in the product descriptions. Alternatively but rarely, webshops offer image-based search where the user uploads an image of the desired product and receives visually similar products. Recently, there is an increasing interest of users in this kind of image-based search. One of the main reasons is the growing usage of visual social media such as Pinterest and Instagram, where users see products they want to buy. Using an image as a query allows users to convey much more information about the desired product than with a textual query. Additionally, another advantage of image-based search over text-based search is that the language of images is universal. However, the user might be interested in changing or adding attributes to the product in the query image to obtain very specific results. For example, for an image of a red dress, the user likes the sleeve length to be different.

Technical Presentation WSDM’18, February 5-9, 2018, Marina Del Rey, CA, USA

342

Figure 1: Example of a Multi-modal query.

results have been observed in text-based language understanding [10, 15]. In this context, quantum-inspired frameworks have the natural advantage of explicitly capturing the correlations between features by the concepts ofquantum superpositionandquantum entanglement.

This inspires us to adopt quantum-inspired frameworks for constructing multimodal representation, and propose effective and interpretable multimodal fusion methods. In particular, we propose to capture the interactions within a single modality withquantum superposition, and model the cross-modal interactions by means ofquantum entanglement. In this way, we establish a pipeline that extracts the interactions within multi-modal data in a way un- derstandable from a quantum perspective. We expect to obtain comparable performances to state-of-the-art systems in concrete multimodal tasks.

2 PROPOSED METHODOLOGY

Here we introduce our proposed approach for building multimodal data representation inspired by quantum theory. Essentially, we represent multimodal data as many-body systems composed of different data modalities as subsystems. The interaction of different modalities is inherently captured by the notion of entanglement between the subsystems. We propose to build a complex-valued learning network to implement the quantum theoretical framework, which facilitates learning the cross-modal interactions in a data- driven fashion.

2.1 Complex-valued Unimodal Representation

Complex values are essential for the mathematical formalism of quantum physics. However, most existing quantum-inspired models for text representation are based on the real vector space, ignor- ing the complex-valued nature of quantum notions. Recently, our prior works [3, 4, 11] leveraged quantumsuperpositionto model

49

(2)

IIR 2019, September 16–18, 2019, Padova, Italy Q. Li, M. Melucci

Figure 2: Our proposed multimodal representation framework.

correlations between textual features, and the complex-valued representation leads to improved performance and enhanced interpretability. We attempt to employ the concept ofsuperpositionfor modeling intra-modal interactions, and investigate the complex- valued embedding approach to capture the interactions within other modalities.

2.2 Tensor-based Approaches for Capturing Inter-modal Interactions

We represent multimodal data as a many-body quantum system in entanglement. The mathematical formulation will be a complex- valued tensor constructed from unimodal complex-valued vectors by tensor-based approaches. Tensor-based models have been ap- plied for classification [5] and matching [14] tasks, but they avoid directly computing the tensor by decomposing the tensor and learning the decomposed weights via neural networks. Explicit tensor combination of different data modalities into a holistic multimodal representation remains unexplored. [6] proposed a framework to investigate the entanglement between user and document for relevance feedback, but it remains a challenge on how to apply it to multimodal data.

2.3 Quantum-inspired Framework for Multimodal Sentiment Analysis

We focus on the multimodal sentiment analysis task, and work with the benchmarking datasets CMU-MOSI [13] and CMU-MOSEI [1].

The task is to classify sentiment of a video into 2, 5 or 7 classes with textual, visual and acoustic features. As is shown in Fig. 2, our framework represents unimodal data as a set of pure states through complex-value embedding, and constructs the many-body state of an video utterance through tensor-based approaches. Fi- nally, quantum-like measurement operators are implemented for sentiment classification. The whole process is implemented into a complex-valued neural network, and the parameters in the pipeline can be learned from labeled data in an end-to-end manner.

The network is born with an advantage in interpretability com- pared to classical neural networks, in that the role of each com- ponent is made explicit prior to the network training phase. We

are working on deploying the network to CMU-MOSI and CMU- MOSEI, and we expect to see comparable values to state-of-the-art models in terms of effectiveness.

ACKNOWLEDGMENTS

This PhD project is supported by the European Union‘s Horizon 2020 research and innovation programme under the Marie Sklodowska- Curie grant agreement No. 721321.

REFERENCES

[1] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis- Philippe Morency. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia, July 2018. Association for Com- putational Linguistics.

[2] Jerome R. Busemeyer and Peter D. Bruza. Quantum Models of Cognition and Decision. Cambridge University Press, New York, NY, USA, 1st edition, 2012.

[3] Qiuchi Li, Sagar Uprety, Benyou Wang, and Dawei Song. Quantum-Inspired Complex Word Embedding.Proceedings of The Third Workshop on Representation Learning for NLP, pages 50–57, 2018.

[4] Qiuchi Li, Benyou Wang, and Massimo Melucci. CNM: An Interpretable Complex- valued Network for Matching. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pages 4139–4148, Min- neapolis, Minnesota, June 2019. Association for Computational Linguistics.

[5] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. Efficient Low-rank Multi- modal Fusion With Modality-Specific Factors. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2247–2256, Melbourne, Australia, July 2018. Association for Computational Linguistics.

[6] Massimo Melucci. Towards modeling implicit feedback with quantum entanglement.Quantum Interaction 2008, 2008.

[7] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web. InProceedings of the 13th International Conference on Multimodal Interfaces, ICMI ’11, pages 169–176, New York, NY, USA, 2011. ACM.

[8] Hai Pham, Thomas Manzini, Paul Pu Liang, and Barnabas Poczos.

Seq2seq2sentiment: Multimodal Sequence to Sequence Models for Senti- ment Analysis.arXiv:1807.03915 [cs, stat], July 2018. arXiv: 1807.03915.

[9] C. J. van Rijsbergen.The Geometry of Information Retrieval. Cambridge University Press, New York, NY, USA, 2004.

[10] Alessandro Sordoni, Jian-Yun Nie, and Yoshua Bengio. Modeling Term De- pendencies with Quantum Language Models for IR. InProceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’13, pages 653–662, New York, NY, USA, 2013. ACM.

[11] Benyou Wang, Qiuchi Li, Massimo Melucci, and Dawei Song. Semantic Hilbert Space for Text Representation Learning. InThe World Wide Web Conference, WWW ’19, pages 3293–3299, New York, NY, USA, 2019. ACM. event-place: San Francisco, CA, USA.

[12] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor Fusion Network for Multimodal Sentiment Analysis.

arXiv:1707.07250 [cs], July 2017. arXiv: 1707.07250.

[13] Amir Zadeh, Rown Zellers, and Eli Pincus. MOSI: Multimodal Corpus of Senti- ment Intensity and Subjectivity Analysis in Online Opinion Videos. page 10.

[14] Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang, and Dawei Song. A Quantum Many-body Wave Function Inspired Language Modeling Approach. InProceed- ings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, pages 1303–1312, New York, NY, USA, 2018. ACM.

[15] Yazhou Zhang, Dawei Song, Peng Zhang, Panpan Wang, Jingfei Li, Xiang Li, and Benyou Wang. A quantum-inspired multimodal sentiment analysis framework.

Theoretical Computer Science, April 2018.

50