Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.

All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers / Scribano, C.; Sapienza, D.; Franchini, G.; Verucchi, M.; Bertogna, M.. - (2021), pp. 4248-4257. (Intervento presentato al convegno 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2021 tenutosi a usa nel 2021) [10.1109/CVPRW53098.2021.00481].

All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers

Scribano C.
;
Sapienza D.;Franchini G.;
2021-01-01

Abstract

Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.
2021
All you can embed: Natural language based vehicle retrieval with spatio-temporal transformers / Scribano, C.; Sapienza, D.; Franchini, G.; Verucchi, M.; Bertogna, M.. - (2021), pp. 4248-4257. (Intervento presentato al convegno 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2021 tenutosi a usa nel 2021) [10.1109/CVPRW53098.2021.00481].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11381/2975932
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? ND
social impact