[BCC19] MUREL: Multimodal Relational Reasoning for Visual Question Answering

Conférence Internationale avec comité de lecture : IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019, Long Beach, USA,

Mots clés: Reasoning ; Deep Learning ; Visual Question Answering

Résumé: Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasksinvolving real images. Although attention allows to focuson the visual content relevant to the question, this simplemechanism is arguably insufficient to model complex rea-soning features required for VQA or other high-level tasks.In this paper, we propose MuRel, a multimodal relationalnetwork which is learned end-to-end to reason over real im-ages. Our first contribution is the introduction of the MuRelcell, an atomic reasoning primitive representing interac-tions between question and image regions by a rich vec-torial representation, and modeling region relations withpairwise combinations. Secondly, we incorporate the cellinto a full MuRel network, which progressively refines vi-sual and question interactions, and can be leveraged to de-fine visualization schemes finer than mere attention maps.We validate the relevance of our approach with vari-ous ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 andTDIUC. Our final MuRel network is competitive to or out-performs state-of-the-art results in this challenging context


@inproceedings {
title="{MUREL: Multimodal Relational Reasoning for Visual Question Answering}",
author=" H. Ben Younes and R. Cadene and M. Cord and N. Thome ",
booktitle="{IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}",
address="Long Beach, USA",