TL;DR – Recent multimodal models like DALL-E, Midjourney, Flamingo and GPT-4 generate very impressive images and text. But the models store all their knowledge implicitly in the model parameters, which can cause increased training cost (e.g. more parameters and data) and hallucination (e.g. inaccurate generation due to lack of knowledge). How can we build a more efficient and grounded model? In this post, we introduce a retrieval-augmented vision-language multimodal model (ICML 2023) that can retrieve relevant knowledge from an external memory (e.g. web data) and refer to it when generating outputs. By allowing retrieval of an explicit memory, we can significantly enhance the generation accuracy and reduce the training cost.

Table of Contents

  1. Background: Vision-Language Multimodal Models
  2. Background: Retrieval Augmentation
  3. Retrieval-Augmented Multimodal Model (RA-CM3)
  4. Let’s use RA-CM3
  5. Summary

Background: Vision-Language Multimodal Models (VLMs)

Over the past few years, there has been an increasing interest in developing generative models that can understand and synthesize multimodal content. One area of particular interest is vision-language generative models, which aim to generate images from textual descriptions (text-to-image) or generate textual descriptions from images (image-to-text). These models have a wide range of applications, such as in chatbots, image synthesis, and content creation for social media.

Recent text-to-image models include autoregressive Transformer-based models (e.g. DALL-E, CogView2, Parti) and diffusion-based models (e.g. DALL-E 2, Stable Diffusion, Imagen, Midjourney). The former casts each image as a sequence of visual tokens, and generates an image using a Transformer. The latter generates an image using a diffusion model conditioned on the embedding of input text.

Recent image-to-text models include Flamingo and GPT-4, which are Transformer-based models that take images/text and generate text. They can be used for image-to-caption generation or question answering involving images.

Moreover, efforts are recently made to develop unified vision-language generative models that can generate both text and images. This includes CM3 and RA-CM3, which train a Transformer that can understand and generate arbitrary sequences of text and images.


The abovementioned vision-language models have achieved impressive performance in image and text generation. However, they store all their knowledge (e.g. the appearance of a labrador) implicitly in the parameters of the underlying neural networks. Consequently, these models require increasingly more parameters (e.g. >10B) and training data (e.g. >1B images) to capture more knowledge. Additionally, because the knowledge is stored implicitly, it is difficult to update or verify the knowledge in these models, which may cause the models to hallucinate.

How can we make these multimodal models more scalable and grounded? A solution we explore in this post is retrieval augmentation, which aims to allow the models to refer to an explicit external memory (e.g. web data) for improved capacity, accuracy, and controllability of knowledge. Below is a quick introduction to the retrieval augmentation framework.

Background: Retrieval Augmentation

Retrieval augmentation is showing promise and gaining attention recently in language modeling and natural language processing (e.g. kNN-LM, DPR, RAG, REALM, MARGE, RETRO). Given a input text (e.g. question), a retrieval-augmented language model uses a retriever to retrieve documents relevant to the input from an external memory such as Wikipedia or Web Search, and uses a generator to refer to the retrieved documents to make a prediction (e.g. answer).

By externalizing the knowledge of a model, retrieval augmentation can provide benefits such as

  • Scalability: reducing the model size and training cost, as well as allowing easy expansion of knowledge
  • Accuracy: grounding the model to facts and reducing hallucination
  • Controllability: allowing updating or customizing the knowledge
  • Interpretability: retrieved items serving as the reference to source in model predictions

However, this retrieval augmentation framework was mostly researched for language models, and not much for multimodal or vision-language models. To bridge this gap, we will introduce retrieval-augmented multimodal modeling in this post (the full paper is also available here).

Retrieval-Augmented Multimodal Model (RA-CM3)

Here we build a retrieval-augmented multimodal model that can retrieve and generate both text and images.


We define a multimodal document/prompt to be an arbitrary sequence of text and images, such as a caption, an image, or an interleaved text-image sequence. For instance, multimodal documents can be collected from the web. As illustrated in the figure above, given a multimodal prompt, the model uses a retriever to fetch relevant multimodal documents from an external memory, and uses a generator to refer to the retrieved documents to generate a response to the prompt (e.g., generate an image for a text prompt, generate a caption for an image prompt). For instance, when generating an image for a caption “Labrador sitting on a bench near water”, now the model can retrieve and attend to an image of a labrador as a reference, without requiring the model to memorize the appearance of a labrador in its internal parameters.

Our generator adopts CM3, the unified vision-language model architecture introduced in the beginning of this post. Consequently, the resulting model can retrieve and generate any combination of text and images in a unified way, and we call this model Retrieval-Augmented CM3 (RA-CM3).

For anyone interested, below are a bit more technical details about RA-CM3, including the retriever/generator implementation and how RA-CM3 conceptually compares with other recent models.


A retriever takes a query and a candidate document in the memory, and returns a relevance score between them: f(query, memory). We implement our retriever as a multimodal dense retriever. Specifically, we use an encoder to map the query and the memory document into dense vectors–query vector and memory vector, respectively–, and then compute their cosine similarity as the relevance score. For the encoder, we use the pre-trained CLIP, which can map both text and an image into a shared vector space.

Using this retriever, we perform Maximum Inner Product Search (MIPS) over the entire memory to obtain the most relevant memory documents (i.e., retrieved documents). MIPS is implemented using the FAISS library. In our current model, we retrieve up to two documents.


We use the CM3 Transformer as the base of our multimodal generator. To incorporate the retrieved documents (m1, ..., mK) into the generator, we prepend them to the main prompt x, and feed the resulting sequence (m1 , ..., mK, x) into the Transformer. In other words, the model can attend to the retrieved documents as in-context examples when generating a response to the main prompt.

Comparison of RA-CM3 with related models

Vision-language models are an extremely active area of research, and many recent and concurrent models have been developed. The table below compares RA-CM3 with them in terms of the model capabilities. RA-CM3 can retrieve and generate both text and images, which is the generalized capability among the existing vision-language models.

Model Image
DALL-E, Stable Diffusion, Imagen, and many others - -
kNN-diffusion, Re-Imagen, etc. -
Flamingo, GPT-4, and many others - -
MuRAG, Re-ViLM, REVEAL, SmallCap, etc. -
CM3 -
RA-CM3 (this post)

Let’s use RA-CM3

We trained RA-CM3 using LAION, a popular dataset that consists of text-image pairs collected from the web. We took 150M text-image pairs from LAION and used them both as the memory for retrieval and as the training data for the generator.

Result: Retrieval improves image and caption generation quality

We compare RA-CM3 with the baseline CM3 model without retrieval, as well as existing text-to-image models (DALLE and Stable Diffusion) and image-to-text models (Flamingo). In standard benchmarks for caption-to-image and image-to-caption generation, RA-CM3 achieved significantly improved accuracy over the baseline models without retrieval (tables below).

More excitingly, shown below are actual examples of RA-CM3’s outputs and baseline models’ outputs. For instance, given an input caption “French flag waving on the moon’s surface”, baseline models without retrieval tend to generate inaccurate images that show US flags on the moon. In contrast, thanks to the retrieval capability, RA-CM3 can refer to the retrieved image of a French flag and indeed generate accurate images that show French flags for the input caption. This example demonstrates that retrieval helps to provide accurate knowledge and mitigate hallucination in generative models.

In general, we find that retrieval is especially useful for knowledge-intensive generation, i.e., generating outputs that involve entity knowledge or composition of knowledge. For instance, as shown below, RA-CM3 can correctly generate images that involve a relatively rare entity (e.g. “Ming Dynasty vase”) or multiple entities (e.g. “Statue of Liberty” and “Washington Monument”).

Result: Retrieval improves training efficiency

We find that retrieval improves not only the generation accuracy but also the training speed. The figure below visualizes the image generation quality (y-axis; lower score is better) vs the amount of compute used in model training (x-axis) for RA-CM3 and baseline models without retrieval (CM3, DALLE, Parti). RA-CM3 is located significantly below the line that connects baseline models, which means that RA-CM3 achieves better generation accuracy while using much less training compute (e.g. < 30% of compute used by DALL-E).

New capability: RA-CM3 exhibits multimodal in-context learning

In-context learning has primarily been studied in the text modality. A very exciting result of RA-CM3 is that it exhibits multimodal in-context learning capabilities, where the RA-CM3 generator can be prompted with both text and image for controlling image/text generation. For instance, as shown on the right of the figure, instead of feeding retrieved documents in the context of the generator, we can feed any demonstration images we want (e.g., a triangular wooden house and orange autumn leaves) so that the model generates an image that follows the visual characteristics of these in-context images. This is very useful and versatile because the model can now be prompted by both textual description (caption) and visual demonstration (image) to control the generation.


In this blog post, we reviewed the recent landscape of vision-language multimodal models (Section 1) and why retrieval augmentation is becoming important (Section 2). We then introduced how we can bridge these two efforts and build a retrieval-augmented multimodal model, RA-CM3 (Section 3) and discussed the exciting capabilities of RA-CM3 (Section 4).

RA-CM3 provides the first multimodal generative model that can retrieve and generate both text and images. We believe that this is a very exciting research direction towards building a fully general-purpose multimodal model (imagine a multimodal personal assistant that can comprehend and respond with any combination of modalities, such as text, image, audio, video).

This blog post is based on the paper:


Many thanks to my collaborators and mentors, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih for their help, and to the members of the MetaAI team, Stanford SNAP group and P-Lambda group for their valuable feedback.