Idea of Visual Perception

Human Perception

As a result, teaching a computer to see in the same way that humans see is not an easy project.

• The Dorm Room Example

Take a look at this picture of a dorm room. We can immediately identify most things in it, from the mess on the floor to the lamp in the corner. A computer has no abilities like that. It can't even tell that this is a room -- it's just a series of pixels.

To be more specific: consider this one element of that picture. As humans, we can instantly identify it as a clock... but how did we do it? What would you tell a computer that the features of a clock were, to help it identify one? Think about what it is that tells you this is a clock. Most likely, you would say it has a familiar, clock-like rectangular shape and size, an hour and minute hand, and most importantly, it has numbers that you know tell time.

But what about this picture? It too is a clock, but compared to the first clock, it has a different, round shape, it's smaller, and it doesn't even have the same number set-up. It is a completely different representation of a clock from the previous picture, yet you know this, too, to be a clock.

Think again in terms of a computer, not a human, viewing these two pictures. We 'teach' the computer that the image in the first picture is a clock. Then when the computer is later shown the second picture and asked to identify the image, it gets confused. This issue is a simple example of a very complex problem in the field of computer vision. We have so many different representations and variations of one given object that it is close to impossible to get a computer to learn them all. Consider programming a computer to recognize any given type of automobile. The task is enormous.

The Sensory Equation

The problem inherent in computer vision, in fact, the very purpose of the field, is to recover information about the world from sensory input. This can be thought about as a formula: S = f (W). Our sensory information is a function of the world around us. What humans take for granted, and what the field of Computer Vision struggles to make machines do, is the reverse: W = f -1(S) -- understand the world from sensory information.