[TLC16a] Cross-Modal Classification by Completing Unimodal Representations

Conférence Internationale avec comité de lecture : Workshop Vision and Language Integration Meets Multimedia Fusion, with ACM Multimedia, October 2016, pp. 17-25, Amsterdam, NL, (DOI: 10.1145/2983563.2983570)

Auteurs: T. Tran , H. Le Borgne , M. Crucianu

motcle:

Résumé: We argue that cross-modal classification, where models are trained on data from one modality (e.g. text) and applied to data from another (e.g. image), is a relevant problem in multimedia retrieval. We propose a method that addresses this specific problem, related to but different from cross-modal retrieval and bimodal classification. This method relies on a common latent space where both modalities have comparable representations and on an auxiliary dataset from which we build a more complete bimodal representation of any unimodal data. Evaluations on Pascal VOC07 and NUS-WIDE show that the novel representation method significantly improves the results compared to the use of a latent space alone. The level of performance achieved makes cross-modal classification a convincing choice for real applications.

Collaboration: CEA

BibTeX

@inproceedings {
	TLC16a,
	title	=	"{Cross-Modal Classification by Completing Unimodal Representations}",
	author	=	" T. Tran and H. Le Borgne and M. Crucianu ",
	booktitle	=	"{Workshop Vision and Language Integration Meets Multimedia Fusion, with ACM Multimedia}",
	year	=	2016,
	month	=	"October",
	pages	=	" 17-25",
	address	=	"Amsterdam, NL",
	doi	=	"10.1145/2983563.2983570",
}