Until a few years ago, most mainstream large language models (LLMs) were trained on text alone. While such LLMs might learn to associate words like “apple” with words like “red”, they do not learn the association between words like “apple” with pictures of what an apple actually looks like. More recently, however, we’ve seen a flurry of multimodal LLMs: systems trained not only to predict words, but also to associate words (or word sequences) with images, sounds, or even actions in the world. Much of the work on multimodal LLMs has been on vision-language models (VLMs) in particular. There are a few reasons for this, including the fact that there are already large image databases to work with.
From a philosophical perspective, these VLMs are interesting because they’re a start towards addressing a major critique of text-only LLMs, i.e., that they lack “sensorimotor grounding” (something I’ve discussed at length before). And from a practical perspective, visual grounding makes them much more powerful and useful tools: images, after all, can convey information that’s hard or cumbersome to convey in language. Some of the more recent state-of-the-art models, like Molmo, have even been trained to “point” at objects in an image—which can often be more expedient (and less ambiguous) than describing the location of an object using natural language.
Now that ChatGPT’s been out for a couple years, many people have a better sense of how text-only LLMs work. Yet there are still major knowledge gaps for those interested in AI about how exactly VLMs work or what it means to “ground” an LLM. (This is also corroborated by the most recent poll results.) For many—including myself—VLMs can feel a bit mystifying. The primary goal of this explainer is at least partial demystification, much like the original explainer on LLMs I wrote with Timothy Lee. These systems are, at the end of the day, designed and built by human engineers, and those engineers have made design decisions that we can evaluate. Still, just like text-only LLMs, a pre-trained VLM can feel like a black box, and knowing the architecture is not a complete explanation. Thus, a secondary goal (which I’ll address in part 2) is reviewing the literature on VLMs from a cognitive perspective: how does grounding a language model change its representations and capabilities, and what are some of the key limitations of current models?
A final preliminary note: because my goal is demystification, this explainer will focus on open-source models—systems for which the architectural and training details have been made publicly available. That’s a bit of a shame, because state-of-the-art systems like GPT-4o are clearly very powerful. But because those systems are proprietary, I simply can’t speak much to how they actually work.
A very brief review of text-only LLMs
One of the nice things about the current generation of VLMs is that they actually share many architectural features and mechanisms in common with text-only LLMs. Because of that, I thought it’d be useful to quickly review text-only LLMs to establish a few key terms (as always, check out our explainer for a more in-depth discussion).
At a high-level, the goal of a text-only LLM is very simple—it is trained to predict “tokens” (kind of like words) based on their context, such as:
> After the game, I had a large glass of ____
Words like “water” or “juice” (or maybe “beer”, depending on one’s proclivities) make sense here. Words like “dirt” don’t. How could we build a system to learn the difference?
First, an LLM needs to represent the basic units of its vocabulary (e.g., “water” and “dirt”). How do we map the raw signal (a string) into something the system can represent meaningfully? LLMs map input strings to large vectors of real-valued numbers (sometimes called embeddings). Words with more similar meanings or grammatical functions will have more similar vectors and will thus be closer to each other in this “vector-space”.
Second, we need an architecture that enables these vector representations to influence each other in context. How do we modify those representations and what operations are involved? Words, after all, change their meanings in different contexts, so this allows an LLM to adjust its representation of ambiguous words like “bank” as a function of the surrounding words (e.g., “financial” vs. “river”). One of the most powerful mechanisms for doing this is called self-attention, which is at the heart of the modern transformer language model. I won’t go into too much detail on how self-attention works here, but you can think of it as a mechanism for sharing information between the words in a sentence.
And finally, we need a training objective for our system. What kind of feedback or error signal can be used to optimize the system to produce better representations with the architecture we’ve designed? That takes us back to the goal we started with: predicting the next word. Because language is used in regular, systematic ways (words don’t just occur in random order), we can train a system on how language is actually used without needing explicit labels (this is called “self-supervision”). And as recent years have shown, this language modeling task has produced systems that—despite not being explicitly trained to do so—seem to be good at other tasks as well.
These three challenges are also the ones involved in building vision-language models. How should a system system encode or represent its inputs (words and images)? What structure or operations should define that system’s architecture? And how should we train the system to produce better representations?
Problems of representation: the basics of Vision Transformers
Let’s tackle the representation problem first. We already know we can represent words as vectors, so the question is how to represent images.
Computer vision has a long history: for many years, the state-of-the-art was achieved using something called a convolutional neural network, or CNN (Tim Lee has a great explainer on how these systems work if you’re interested). More recently, however, researchers have found success adapting the transformer architecture to vision, which is called a vision transformer (ViT). This explainer by Skylar Jean Callis walks through the ViT in detail, and is based on this more academic paper on ViTs. Below, I give a brief description of how ViTs encode their inputs, but you can follow those links for a more in-depth description.
A digital image is a two-dimensional grid of pixels, where each pixel represents a color or intensity value. Combined, those pixels produce what we perceive as a holistic image (like the image of my two cats below). In principle, we could encode each pixel of the image individually, but that’s really computationally intensive, and also maybe too fine-grained to be strictly necessary: lots of the higher-level abstractions that are relevant to our own perception of the image can only be detected from larger groups of pixels anyway.
Recall that with language models, we tokenize a sentence into its constituent parts (usually words). Language is, of course, pretty different from a digital image: it’s one-dimensional (i.e., a long sequence), and the basic units are discrete and arbitrary. But just as we can tokenize sentences, we can divide or “segment” images into smaller, spatially-determined areas we call patches. By doing this, we can potentially capture higher-level patterns (like textures or shapes) that are hard to determine from single pixels alone.
The number of patches produced by this process is necessarily a function of both the size of the image and the desired size of each patch. Smaller patches will capture more fine-grained details, but will also introduce additional complexity into the model; larger patches are less computationally intensive, but risk losing important details. In the example below, we’re working with a 3’’x4’’image; if our desired patch size is 1’’x1’’, we’ll divide the image into a grid of 3 rows and 4 columns, resulting in 12 square patches.
Each 2D patch is then linearized, which just means it’s turned into a very long 1D vector of pixel values. The length of this vector is determined by the number of pixels in the patch and the number of channels in the image (grayscale images have one channel, whereas color images have three channels). Thus, if the original RGB patch is a 16x16 grid of pixels (256 total), the size of the linearized vector will be 768 (16x16x3). In order to make sure all vectors are always the same length, ViTs then project this 1D vector to a new vector of a consistent length. As this paper points out, the output of this projection process is called a patch embedding, which is roughly analogous to the word embedding for a particular token. A schematic of this process is depicted below (note that the “linearization” depicted is not really linearization, as the resulting image is still technically 2D).
We can then put all these patch embeddings together. Thus, what was originally a 2D grid of pixel values is now a sequence of fixed-length patch embeddings, each consisting of a bunch of numbers. If there are 12 patches, and each patch embedding has 768 values, the total input length would now be 9216.
There are just a couple more small steps involved in preparing our input. First, something called a [class]
token is prepended to that input vector. [class]
tokens are also used in training certain kinds of language models (like BERT), and are useful for learning to store information that applies to the entire image (or sentence). In both cases, you can think of them as representing a kind of “summary” of all the individual embeddings put together. Practically speaking, people often use [class]
as the inputs to classifiers (hence “class”). For example, the [class]
token in a ViT could be useful for aggregating information across all the patches in an image for determining whether or not the image contains a cat; similarly, a [class]
token in BERT could be useful for classifying the sentiment of the entire sentence (as opposed to individual words).
Second, just as with transformer language models, we add position embeddings, which encode information about the original position of each patch. These are really important because the “meaning” of a sequence of pixels (or a word) depends in part on where it is, so the position embeddings help the system track the relationships between parts of the input.
And that’s it! The original image (a grid of pixels) is segmented into patches, which are each linearized and converted into patch embeddings; the final input is then combined with a [class]
token and position embeddings for each patch. The stages are all combined in the graphic below.
Problems of architecture: more than one way to “see”
Now that we can represent images as vectors of numbers (patch embeddings), we can use them as inputs to a transformer neural network. But what kind of architecture should that network have? And how should it be integrated with the language input to a VLM?
The good news is that ViTs have the same building blocks as transformer language models. In each layer of the model, the patch embeddings undergo a series of operations, including self-attention, a feedforward projection (using a multi-layer perceptron, or MLP), and “layer norming” (kind of like centering the data). A ViT without a language model component—i.e., just a vision network—could be used for things like image classification, as in the example below. I’m not going to describe all these operations in detail, but you can get a sense for the structure from the example I’ve provided, which is based on Figure 1 from this overview paper. The self-attention mechanism is covered in more detail in the explainer on LLMs with Timothy Lee.
The bad news is that this is only the vision component of a VLM. We need a way to connect our ViT (a vision network) to a text-only language model (e.g., something like GPT-2). And as the title of this section suggests, there are lots of ways one could do this!
Broadly, VLMs can be characterized in terms of the degree of interaction between their text and images inputs. The basic question (illustrated below) is: when should this information be merged or combined?
Dual-encoder models maintain mostly separate representations of text and images, only combining these streams after doing considerable processing on each. An example of a dual-encoder model is OpenAI’s CLIP model. In contrast, fusion models merge these representations early on, allowing for them to influence each other using operations like self-attention. Somewhat confusingly, researchers distinguish further between two types of fusion models: single-stream (text and image representations are merged almost immediately) and dual-stream (separate processing pipelines are maintained, but they can influence each other using cross-attention). An example of a single-stream fusion model is ViLT (vision-and-language transformer), while an example of a dual-stream fusion model is BridgeTower.
In my experience, these distinctions are a lot easier to understand visually, using our schematic from above as a starting point.
In the graphics above, the language components are color-coded as green, the vision components are all red, and the multimodal components are yellow. As you can see, model architectures vary in the size and extent of their multimodal components.
Single-stream fusion models like ViLT concatenate their text and image representations very early on, and then apply the standard transformer architecture to a single “stream” of information. Dual-stream models like BridgeTower allow for separate representations to be maintained, but these representations “talk” to each other in later layers using cross-modal attention. And finally, dual-encoder models like CLIP mostly don’t allow cross-talk—once each input is fully processed by a language or vision transformer, it is projected to a shared embedding space. (The way this projection works will be covered in the next section on training objectives.)
A natural question that arises here is whether one architectural approach is better than another. This turns out to be a hard question to answer. “Better” could be defined in terms of an architecture’s impact on downstream task performance, its relative memory or compute requirements, or even its similarity or dissimilarity to theories of multimodal integration in the human brain. I’ll tackle some of these questions in part two (coming soon), but for now, the key takeaway is that there are a few different ways to integrate vision and text information.
Problems of training: how to optimize our model?
We know how to encode images and text, and we have a few different ways to integrate these representations. The question now becomes: how do we train our network to learn to integrate these inputs in helpful and informative ways? With text-only language models, we use a token prediction task; and with vision transformers, we typically use something like an image classification task. What’s most appropriate for a VLM?
As with VLM architectures, there are a number of different approaches here. One helpful way to taxonomize these approaches is by their underlying objective: what exactly is the network “trying” to do?
One common kind of pre-training objective involves discrimination: that is, learning which image/text pairs match and which don’t. Discriminative objectives could involve a binary output (“match” vs. “mismatch”), or they might involve projecting visual and text representations into a shared embedding space. For example, in contrastive learning—the procedure used to train CLIP (Radford et al., 2021)—the VLM learns discriminative representations that pull matching image/text pairs together and push mismatching ones away. The embedding for the caption “two cats lying next to each other” should occupy a similar position in vector-space as an embedding for a picture of two cats—but it should be relatively distant from the embedding for a picture of a tree. A related objective (which some distinguish from discrimination) is alignment, which typically operates at a more fine-grained level: rather than comparing the entire caption to the entire image, alignment objectives might involve matching regions of the image to specific words in the caption.
Another kind of objective is generative, in which the VLM is trained to produce text, images, or both. Generative objectives are the most similar to traditional language modeling approaches, in which an LLM learns to produce representations that enable it to produce accurate predictions of upcoming tokens. In the simplest case, a VLM trained with a generative objective might operate in a similar way as a text-only LLM, with the key difference being that it also accepts visual input—such that its predictions are based not only on textual context but also visual context. An example of a model trained with a generative objective (as well as alignment objectives) is FLAVA, or “A Foundational Language and Vision Alignment Model”.
In each case, the goal is to produce a system that has learned, on some level, to integrate information across its modalities. As we saw in the section above, the degree of integration will depend in part on the system’s architecture. But even dual-encoder models like CLIP—which maintain separate representations for text and images—are trained in ways that fundamentally reshape how they represent those inputs. That’s because a process like contrastive learning forces CLIP to update the parameters of both its vision encoder and its language encoder in ways that meet the training objective.
State of the art
As we’ve seen, VLMs build directly on the architectural advances that came before them, such as the advent of self-attention. One of the most important conceptual advances for modern VLMs was the vision transformer, which gives models a way to process images using the same set of mathematical operations that have been so successful for language models. There are also many open questions about these systems, ranging from basic design decisions (e.g., dual-encoder vs. fusion architectures) to the science of the behaviors and mechanisms that emerge in trained models (i.e., an LLM-ology for VLMs). In part two of this explainer (coming soon), I tackle some of these questions—but to conclude this part, I want to briefly survey some of the more recent developments and models that have been released.
As I mentioned at the start of this post, one of the challenges with surveying the space of VLMs is that many of the most performant models are proprietary. This includes GPT-4o and Claude 3.5 Sonnet, both of which perform fairly well on standard visual reasoning benchmarks—but which don’t come with much in the way of documentation about their architecture or training procedure. For example, the system card for GPT-4o describes it as an “autoregressive omni model” that is “trained end-to-end across text, vision, and audio”. That indicates some degree of integration, but it is unclear exactly how that integration works. We can probably assume these components are all transformer models, but again, that’s not explicitly stated.
Fortunately, more details are available for another set of state-of-the-art VLMs, which were released by the Allen Institute for Artificial Intelligence (AI2). The Molmo (“Multimodal Open Language Model) family is a collection of open-weight multimodal language models, which were trained on a novel open image caption dataset (PixMo, or “Pixels for Molmo”). The best models in the Molmo family perform quite well on visual reasoning benchmarks—comparably, in fact, to models like GPT-4o. They also score quite well using the Elo ranking method, in which outputs from different models are directly compared by human participants and aggregated to rank models according to which consistently produced “winning” responses. If you’re curious, you can play around with Molmo online using a web interface. One of the things that I find most interesting about Molmo—which, as far as I can tell, is not available in ChatGPT—is that it can “point” to areas of an image:
Beyond their impressive performance, the Molmo models are important from the perspective of scientific reproducibility and transparency: as I’ve written before, it’s hard to study a model if we don’t know the details. I’ll be publishing a deep dive on Molmo soon, but the short version is that the researchers hooked up two pre-trained components: a vision encoder (CLIP) and a language model. They then created an entirely new dataset called PixMo, which contained a variety of image annotations. Some of these were straightforward image captions, while others were more creative, such as asking people to point to different objects or instances of an object in an image. The researchers then trained Molmo on these annotations in two stages, producing a system capable of describing images and even—as demonstrated above—highlighting specific objects in an image.
For me, a key takeaway from Molmo is that data quality really seems to matter. By focusing on creating and curating a high-quality dataset of custom annotations, the authors built a system that competes with state-of-the-art proprietary VLMs on a number of benchmarks. Some aspects of this dataset were genuinely novel and theoretically interesting, such as PixMo-Point: I’ve never seen a VLM that “points” at things before, and given that pointing is an important form of gestural communication for humans, it’s clever to incorporate it into VLM training.
It’s hard to say what future developments in the VLM space will look like. But based on what we know about current VLMs, it stands to reason that one type of advance might be architectural: novel ways of encoding or integrating different modalities. Another kind of advance might be about training, e.g., coming up new kinds of tasks with which to optimize model representations (like pointing). Earlier this year, researchers published a paper in Science describing a project in which a vision-language model was trained using 61 hours of head-cam footage from a child. It’s possible we’ll see a turn to more naturalistic, active forms of training—which brings me to the final point: vision is of course only one of many ways in which humans experience and act in the world. Much of what we know comes from the fact that we are embodied entities that interact with a physical environment—so I’m excited about models capable of processing other modalities (e.g., sound or touch) and ultimately executing actions (e.g., moving around or grasping objects). As Moravec’s paradox argues, the latter set of problems might be particularly challenging, which means there’s a long way to go.
Thanks to Cameron Jones, Timothy Lee, and Pamela Rivière for reviewing drafts of this article and providing helpful suggestions and feedback. This explainer was also informed by this tutorial on ViTs and this preprint walking through the ViT architecture and training process.
To computers, everything (words, actions, images, sounds) it’s all just binary turtles all the way down