Visual AI Models: Are They Really ‘Seeing’?

Veer Jain
3 min readJul 14, 2024

--

The latest generation of language models, such as GPT-4 and Gemini 1.5 Pro, are marketed as “multimodal,” capable of understanding images and audio in addition to text. However, a new study suggests that these models might not “see” in the way we assume. In fact, their visual comprehension might be fundamentally flawed.

To clarify, no one explicitly claims that these AI models “see” like humans. However, the marketing and benchmarks promoting these models use terms like “vision capabilities” and “visual understanding.” They suggest that the model can analyze images and videos effectively, performing tasks from solving homework problems to watching a game.

The marketing subtly implies that the models have a form of vision. In reality, they process visual information similarly to how they handle math or storytelling — by matching input patterns to training data patterns. This approach results in models failing at seemingly simple tasks, similar to how they struggle with generating random numbers.

Researchers at Auburn University and the University of Alberta conducted a study to assess the visual understanding of current AI models. They tested leading multimodal models on basic visual tasks, such as determining if two shapes overlap, counting pentagons, or identifying a circled letter in a word. These tasks, trivial for a first-grader, posed significant challenges for the AI models.

“Our seven tasks are extremely simple, where humans would perform at 100% accuracy. We expect AIs to do the same, but they are currently NOT,” co-author Anh Nguyen explained to TechCrunch. “Our message is, ‘Look, these best models are STILL failing.’”

One basic task involved determining whether two circles were overlapping, just touching, or separate. While GPT-4 achieved over 95% accuracy for distant circles, its performance dropped to 18% for overlapping or touching circles. Gemini Pro 1.5 performed better but still only managed 70% accuracy at close distances.

Another task involved counting interlocking circles. While all models accurately counted five rings, adding one more ring drastically reduced their accuracy. Gemini failed completely, Sonnet-3.5 got it right a third of the time, and GPT-4 succeeded just under half the time. Adding more rings led to even more inconsistent results.

These inconsistencies highlight that these models do not truly “see.” Their varied performance across similar tasks suggests they lack genuine visual understanding. The models can count five circles accurately, likely because the Olympic Rings, a five-circle image, are prominent in their training data. However, they fail with six or more rings, which are less common in their training data.

Nguyen pointed out the complexity of this issue. “There is not yet a word for this type of blindness/insensitivity of AIs to the images we are showing,” he noted. “Currently, there is no technology to visualize exactly what a model is seeing. Their behavior is a complex function of the input text prompt, input image, and many billions of weights.”

Nguyen speculated that models might extract approximate and abstract information from images, such as identifying a circle on one side, but lack the ability to make detailed visual judgments. This results in responses akin to someone describing an image without actually seeing it.

For instance, when asked about overlapping circles of different colors, a model might respond based on logical assumptions rather than visual evidence, similar to someone answering with their eyes closed.

Despite these limitations, visual AI models are far from useless. While they struggle with elementary visual reasoning, they excel in specific tasks like recognizing human actions and expressions or identifying everyday objects. These models are designed for such interpretations.

In summary, while AI companies may imply their models have advanced visual capabilities, research reveals significant gaps in their fundamental visual understanding. The models can analyze and interpret images, but their “vision” is not comparable to human perception.

--

--

Veer Jain
Veer Jain

Written by Veer Jain

I am a undergraduate student who is eager to learn more!

No responses yet