Last month, I published an explainer on vision-language models (VLMs). Like large language models (LLMs), VLMs are trained to predict which words are most likely in a given context—but unlike text-only LLMs, VLMs are also trained to associate sequences of words with images or other visual information. The explainer gives an overview of the different architectures used to combine language and vision, as well as the different protocols used to train them.
That post also mentioned a new family of VLMs released by the Allen Institute for Artificial Intelligence (AI2): Molmo. In this post, I want to do a deeper dive on Molmo, how it works, and why I think it’s such an interesting model.
Brief refresher on VLMs
The original explainer contains many more details, but I thought I’d start with some of the high-level points to know about how VLMs work.
VLMs combine a vision encoder and a language encoder. Typically, both encoders are transformer models.
There are different ways of combining visual and linguistic information. Some architectures (“single-stream fusion”) combine this information very early on in the network, while others (“dual-encoder”) maintain mostly separate representations.
VLMs are trained using either generation procedures (i.e., generating either images or text strings, conditioned on either images or text) or discrimination procedures (i.e., learning which images and text strings go together, and which don’t).
Many state-of-the-art LLMs these days are multimodal, such as GPT-4o. The problem is most of these models are also closed-source, meaning we don’t know much about their architecture or how they’re trained.
Fortunately, that’s not the case with the Molmo family of models—which is why Molmo is perfect for a deep dive.
Molmo’s architecture
Let’s start with the architecture. The paper contains a schematic of Molmo’s structure, which is used across all models in the family; I’ve also included a version of it below.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9442d94b-eae6-4ddf-8dda-3f5c58358cfe_1004x1228.png)
Like other VLMs, Molmo works by combining a language model with a vision encoder. Also like other VLMs, the central challenge here is figuring out how to get the two modalities to “talk” to each other. In this case, image inputs are encoded by a pre-trained vision transformer (CLIP), resulting in a matrix of embeddings for each patch in the image; using another neural network (“Connector” in the diagram), these embeddings are then projected to match the dimensionality of the language model embeddings.
Molmo also accepts language input, which is segmented into a sequence of tokens, each of which is assigned a token embedding. Finally, these different embeddings are all passed into a transformer LLM component (“Large Language Model” in the diagram). As with the pre-trained CLIP encoders, the researchers relied on a pre-trained LLM. Different Molmo models are distinguished by which pre-trained LLM was used: some are entirely open-source (like OLMo), while others are only open-weight (like Qwen2).
Data matters!
After hooking up the different components, the researchers trained each Molmo model in two stages. In the first stage, the models were trained to generate captions for web-sourced images using a custom dataset of image annotations (PixMo-Cap). PixMo-Cap was created by asking annotators to describe each image in detail for one minute, addressing questions like “What is the image at first glance?” and “What are the positions of the objects?” Altogether, PixMo-Cap contained ~712K images and ~1.3M captions.
In the second stage, models were fine-tuned on a variety of additional custom datasets, which ranged from annotations based on the “ask me anything” format to code for generating documents, tables, and diagrams. In my view, the most interesting of these custom datasets was PixMo-Points, in which annotators were asked to “point at something in an image, write a description of it, and then point to every instance of it in the image” (pg. 3). This is what’s depicted in the schematic of Molmo’s architecture: the pink dot highlights Mt. Rainier in the picture’s background.
The motivation for creating PixMo-Points was described as follows (pg. 3):
We collected pointing data that achieves three goals: (1) enables the model to point to anything described by text, (2) enables the model to count by pointing, and (3) enables the model to use pointing as a natural form of visual explanation when answering questions.
As I mentioned above, pointing is one of the most interesting and novel aspects of the Molmo models. There’s a video demonstration, but you can also try it for yourself. Pointing is a really useful form of communication, and serves a useful complement or alternative to natural language. In the example below, I asked Molmo to “point” to the front right paw of the black cat. It does a pretty good job:
In this case, highlighting the cat’s paw is much more straightforward and easier to understand than describing its position in natural language, which is what GPT-4o resorts to:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ec2de1-63d5-445c-9db8-400e0ec0f89d_1522x992.png)
To be clear, GPT-4o’s answer is completely acceptable. It might even seem unfair to ask GPT-4o to “point” since it wasn’t trained to do that1, and there’s no reason in principle to suspect that the model couldn’t be trained to do so. But that’s exactly my point (pardon the pun) here: the data and training objective are really important for a model’s behavior, and the AI2 researchers came up with a clever task.
An edge for counting
How does Molmo stack up against state-of-the-art models like GPT-4o?
The original paper conducted a series of evaluations using standard benchmarks; Molmo was also compared to other state-of-the-art multimodal models using the Elo ranking method, in which outputs are compared directly by human judges. These results of these evaluation approaches can be found on AI2’s blog post introducing Molmo, but the key takeaway is that the Molmo models all perform quite well using both evaluation approaches—competitively with models like GPT-4o and Claude 3.5 Sonnet.
One especially striking result was that Molmo did very well on tasks requiring counting, including CountBenchQA and PixMo-Count. On PixMo-Count, for example, the Molmo models ranged in performance from 79.6% (for the smallest model in the family) to 85.2% (for the biggest model in the family). GPT-4o, in contrast, scored only 59.6%, while Claude 3.5 Sonnet scored 58.3%. No model besides the Molmo family scored higher than 64.3% (which was achieved by Gemini 1.5 Pro).
Counting the number of instances of an object in a scene requires both figuring out which parts of the image “count” as an example of that object in the first place, as well as tallying up all those examples. This is really hard to do at a glance, particularly as the number of instances grows. One intriguing possibility is that being trained to enumerate (via pointing) all the instances of an object scaffolds Molmo’s ability to count: rather than making an “approximate guess”, Molmo might be able to point at each instance one-by-one, thereby enabling a much more precise estimate. This is something I’m planning to study more formally with some collaborators—but it’s a compelling and theoretically interesting possibility.
GPT-4o can interpret parts of an image that have already been highlighted, but as far as I know, it can’t “point” itself. And if it can, it certainly didn’t so when asked.