How do VLMs combine their modalities?
To computers, everything (words, actions, images, sounds) it’s all just binary turtles all the way down
To computers, everything (words, actions, images, sounds) it’s all just binary turtles all the way down