YES!
Posts by Steven Scholte
My point being, I think both frameworks fit the data. But the paper doesn't consider the natural statistics alternative, and is constructed from the first framework.
The "position ID heads" look like: a network trained on spatially structured data developing position-sensitive representations. Expected under natural statistics. The mechanistic interpretability vocabulary maps this onto binding IDs — but that may say more about the vocabulary than the mechanism.
The low-entropy stimuli offer no such footing. It's not that features are shared — it's that the combinations are arbitrary, divorced from any natural co-occurrence structure. That's why the learned geometry fails. Not binding breaking down, but distributional shift.
And the low-entropy condition is the worst case for exactly this reason. The natural world contains plenty of shared features but their co-occurrences are more lawful. A red apple and a red car share color but are statistically predictable combinations a trained network can exploit
Through this lens the stimuli are the first thing to notice: synthetic colored shapes on grids maximally violate natural co-occurrence statistics. We are studying the boundary condition, not the general case.
When it doesn't — occlusion, clutter, arbitrary feature combinations — the system doesn't invoke a binding operation. It dynamically reweights existing representations via attention and recurrence to resolve the ambiguity. The geometry is the foundation; selective routing handles the exceptions.
Framework 2: Natural image statistics. The visual world is structured — features co-occur lawfully. A system trained on natural data learns this structure and encodes conjunctions directly in representational geometry. For the vast majority of cases, the geometry simply delivers the answer.
Binding failures under low entropy, near-ceiling performance under high entropy. From a Treisman perspective this is a second triumph — the psychophysics predicts exactly what the mechanistic analyses find.
Through this lens the paper is a triumph. VLMs develop emergent index-like mechanisms — position ID heads, feature retrieval heads — that functionally resemble symbolic binding operations. Binding errors trace to failures of these mechanisms.
Framework 1: Treisman's feature integration theory. Features are processed separately and must be bound together. Binding is the central problem of vision. Attention is the binding mechanism. Failures reveal the architecture.
Very cool paper, impressive mechanistic data. But lets see how this data looks from two frameworks (I have put some time and this, could be mistaken, so happy to be corrected, but I suspect this might be informative)
what is the consumer of this information.
[Would have been good to discuss this IRL, but I am, unfortunately, not a VSS visitors anymore].
I actually really like visual illusions. But I am trying to look at them from the perspective of biases our visual system has, and try to move away from the binding vocabulary (hence beyond binding, not no-binding). Anyway, we will see what different approaches provide :-).
And then fail or not. Both outcomes are fine with me, but I want to get one or the other.
And I really think that projecting terms of cognitive science on this (where information is arbitrary and you have operations) is a dead end. I have of course no idea if this is true or not. But we are going to find out, and this is just a very different approach, that should use its own terms.
And make it clear to me that the problems of information processing beyond low / low to mid-level perception are solved by using the structure of these natural stimuli. Of course we kinda suspected that but these models make that clear. And now with VR environments, etc. etc.
or perception (they really suck), but what I like is that the bulk of the shape of the system develops as a combination of smart biasses and learning from natural stimuli. Processing and learning cannot be detached anymore.
And for a time it was sensible to think there is information, and a cognitive system processing this information. But this has, ima largely failed for the perception of natural scenes beyond low, and low to mid-level perception. What I like about DCNNs is not that they are good models of the brain
I have been thinking about this. I think (but I am very happy that I do not dictate research programs at large) that there have been great achievements from, lets call it, classical psychophysics, that stand, and have, an impressive range of practical applications.
And this is fine within a specific paradigm. But I strongly prefer another one. And I then do not have the accept the premises of this cognitive science / psychophysics mixture. The question is, does another paradigm lead to a better understanding of vision, not do they use the same terms.
And this makes (a level of) sense in an information processing paradigm with information and operators, although we do not specify for who this is bound, but mirroring the Chinese room, I suspect this is the observer that still has to do the experiencing (not that much is explained).
I disagree. There is a problem you can identify in a specific case. You can even have the intuition that these are similar across different paradigms (this is an assumption). Next you can have a solution, binding! (so binding is a mechanism with Von der malsburg or Treisman).
Unless of course you assume, there is a separate binding operation that finishes completely in a short cycle but is only available after multiple iterations. I do not.
The reason I did the experiment is that people in my environment where taking it as clearly proven that binding can be ultrafast, even for artificial stimuli. We show that the apparent speed of biding in your paper coincides with a multiple cycles of integration time. Hardly fast.
That is only necessarily though if binding is a real cognitive operation. Clearly information is integrated (you can call this binding or misbinding, but I think a loaded term) but that does not mean there is a separate binding operation that either finishes complete, or fails entirely.
But cognitive scientists generalizing from artificial stimuli to natural vision is where it goes wrong. The world's structure is better captured by learning.
And the computer vision literature is instructive here: models built on cognitive science principles are comprehensively outperformed by DNNs on natural vision tasks. Psychophysicists working in controlled domains — fair enough.
The visual system is shaped by natural stimuli and tuned to process a specific part of the world's structure efficiently. Artificial stimuli are out of distribution. You can read the resulting failures as revealing basic rules — but there is a risk of overextending this approach.