The sharp point here is that the multi-step agent didn't drift from bad steps, it drifted because every handoff serialized the image down to a text description, and a description of a cover is a compression of the thing you're trying to keep.
Stack those and the brand DNA erodes at each boundary. Reads like a reasoning failure, it's actually a serialization failure. So the Image type fix generalizes: pass the artifact, not a description of it, anywhere the model needs to perceive the primitive directly.
The honest tradeoff is that one session swaps handoff loss for context pressure, so raw-image scanning has a cost ceiling on big audits. Right instinct going to the protocol level.
Nice breakdown Michael! Exactly as you said, in this way is literally following the image artifact and keeping that in context for further action, not getting isolated inputs where the real artifact was diluted by external/decoupled processing.
The sharp point here is that the multi-step agent didn't drift from bad steps, it drifted because every handoff serialized the image down to a text description, and a description of a cover is a compression of the thing you're trying to keep.
Stack those and the brand DNA erodes at each boundary. Reads like a reasoning failure, it's actually a serialization failure. So the Image type fix generalizes: pass the artifact, not a description of it, anywhere the model needs to perceive the primitive directly.
The honest tradeoff is that one session swaps handoff loss for context pressure, so raw-image scanning has a cost ceiling on big audits. Right instinct going to the protocol level.
Nice breakdown Michael! Exactly as you said, in this way is literally following the image artifact and keeping that in context for further action, not getting isolated inputs where the real artifact was diluted by external/decoupled processing.