The reason that I think it's a bad analogy (and I mean this in the most broad sense possible - as in, it's just not a fruitful way of thinking about how we see things), for the record, is that it leads precisely to that sort of regress.
A point to make against it might simply be that we're asked to assume that at some point the self/mind/whatever is confronted with data gathered by the senses. In other words, the schema is supposed to look like this:
Object-->light beam --> eye --> neural firings --> brain activity in the visual cortex -->brain activity in the frontal lobe(possibly, I'm not up on the relevant neuroscience) --> a seeing of whatever object.
But the reason I think this is a bad way of thinking about it is that it's never clear exactly where those arrows stop being causal mechanistic chains and start being 'seeings' of something. In other words, why did I represent the chain going all the way to the frontal cortex before calling it a seeing?
We could say that the reason for that is that any break up till that point leads to not having seen it, and that's a decently plausible reading of the whole thing. The problem there, as you've noticed, is that this doesn't
really seem to be a good enough reason. There could be some other link we're missing first, and even if not why should we assume that means that the 'seeing' happens there.
For the record, I think a more plausible way of going about the whole thing is to simply differentiate between seeing something, and the mechanism involved when you see something. In other words, the chain going from the object to the frontal cortex is -- and I don't think anyone would deny this little bit -- at least a part of the mechanism of seeing something. However, that doesn't bear significantly on the epistemology involved.
In other words, when talking about 'seeing' what you should say isn't that at some point the causal chain stops and the 'seeing' happens. What you should say is that
you see the object (literally you -- including your eyes), and that the mechanism of seeing
involves your eyes, your visual cortex, activity in the frontal lobe, etc.
To take a different example, you wouldn't ask the following question (hopefully):
Now, suppose I just go for a run: my feet move quickly across the pavement. They are caused to do so by my muscles flexing in rhythm. The muscles are caused to do that by neurons firing in my legs, and the neurons are caused to fire by other neurons in the brain. But, where does the running happen? What is the last "thing" that is "running"? The chain of neuron firings ends in the brain, I know that (analogous to the print), but "who" or what is "running"? See what I mean?
Now, clearly that's ludicrous -- what's running is
you, and that includes your feet, legs, neurons, and the whole deal. It's not something else that can eventually be tracked down the way a photograph can be.
So, why is it so intuitive to treat "looking at something" as different from "running"?
(I think it's because of that analogy, which seemed to obvious as a starting point. But it tends to lead to sillyness -- we could possibly alter it enough to locate 'looking' in some part of the brain, and so forth(though, trust me, the problems that doing so runs into are somewhat....extreme, and no one has managed a satisfactory way of doing this so far), but wouldn't it simply be better to get rid of the whole thing entirely?)
[edited to change the terminology of a distinction, which looked confusing to me at any rate]