Google 學術搜尋

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com

A Singh, R Hu, V Goswami, G Couairon, W Galuba, M Rohrbach, D Kiela

Proceedings of the IEEE/CVF Conference on Computer Vision and …, 2022•openaccess.thecvf.com

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally,
such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion)
but not both; and they often only target specific modalities or tasks. A promising direction
would be to use a single holistic universal model, as a" foundation", that targets all
modalities at once---a true vision and language foundation model should be good at vision …

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a" foundation", that targets all modalities at once---a true vision and language foundation model should be good at vision tasks, language tasks, and cross-and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

openaccess.thecvf.com

顯示更多顯示較少

儲存引用被引用 607 次相關文章全部共 6 個版本 HTML 版

引用

進階搜尋

已儲存至「我的圖書館」

Flava: A foundational language and vision alignment model