What Machines Don’t “See” … Yet.
As of 2020, we have come a long way with respects to applying novel learning algorithms through deep learning to vision tasks — otherwise a characteristic feature of most living beings that move.
In “narrow” tasks like object detection and classification, different labs have reported “better than human” level performance of machines.
However, multiple bottlenecks have arose in the application of these algorithms to “real world” where humans still have a massive generalizing upper hand.
A rather simple research work on corrupting MNIST data, dubbed MNIST-C, highlights this difference quite well. The gist being that distorting/corrupting images of MNIST (handwritten numbers) while keeping the semantic structure of the numbers intact, confuses some of the best models that report high accuracy on the clean test data, but not humans.
One way forward then would be to create massive amounts of training data for machines to generalize to the real world like us. These datasets would account for as much possible distortions or out of distribution events there can be out there. By today's standards this would turn out to be an enormous task with a seemingly infinite “long tail problem” where we keep missing out on possible deviations from the training data.
Other alternates, not talked about as often, is to look at the biological underpinnings of vision and their evolution in living beings. One major drawback of datasets for computer vision today, is that the vast majority is made up of labelled 2D images. These images do not represent the real world as it has been appearing to living/moving/seeing beings. Even if we do collect massive image datasets that seem to capture an extremely broad range of objects and their categories, learning models trained on them would still lack 2 key concepts that could be needed for effective generalization.
First, before we compare human vision to computer vision, I believe, we need to represent motion to machines.
Most things we identify as single whole objects, when moved, move together as a single entity. For things that are fluid, like water and sand, we club together the individual particles/molecules and generalize the collection to, well, water and sand. When we look at a dog, we may have inbuilt mechanisms that identify separate aspects like the eye, ears, nose, fur etc. but the human mind has a bias to detect the dog as a whole and would “know” to expect a tail even when it is hidden from the eye. This is categorically different from current computer vision abilities of deep neural networks where the convolution layers try to identify, in isolation, each minute features first before collating them together as a whole, all while being prone to occlusion.
There is also the aspect of real world cause and effect being fluid and biological neurons, by design, have “spiking” signals to model that fluidity. With computational limits, though, we find ourselves discretising signals and operating on instances called frames/images. Recurring through time, or paying attention to events as they unfold in time to even detect objects would perhaps yield a better machine representation of objects. Mechanisms for time series have already been modeled in machines for non vision tasks through Recurrent Neural Networks, LSTMS, and Transformers. At the expense of immense compute, vision algorithms could benefit from attention like learning from videos. Short videos of dogs would consistently expose the beast in its totality to the machine and from “You Only Look Once” we shift to “keep looking till you see the whole damn thing” paradigm.
If, instead of MNIST, we had a datasets of videos that capture the written digits while they are being written (like children see in kindergarten), learning models could recognize not only how numbers are, but also how numbers come to be. If knowing what sequence of “something” makes up a number, what that number is made off, could then be deemed irrelevant — be it white pixels in black background, or black pixels in white, or any other distortions that maintains semantics. Only training on such datasets and then testing can reveal how effective these methods turn out to be. Granted, that by current standards even such video datasets would be massive, but efforts in that direction may just provide a better “understanding” to machines about the world we live in.
Another aspect is that of depth in a 3-Dimensional spatial world. How far a prey or predator is, has always been vital information for the survival of living beings. By augmenting images to zoom in and zoom out an object for deep learning models, what we effectively tell them is that “the object can come in many sizes” when we actually want to say “the object is close or far away”. This misconstrues the concept of objects and their sizes (both absolute and relative) for machines. Robustness to occlusion can also see improvement when instead of telling machines that dogs “may not always have a tail” we rather imbue the bias “a tail is probably hidden behind the dog”. Again, modeling 3-Dimensional images is already underway but implementing them “just” for detecting hundreds of thousands of objects will test the best of hardware.
Now imagine the technical challenges of representing a moving 3-Dimensional world to deep learning systems with 3D convolutions and transformers (or something). Thanks to billions of years of evolution, we humans are extremely good at generalizing concepts from a 3D (space) moving (time) world — so good, in fact, that for us it is enough to look at 2D images and infer not just the third missing spatial dimension but also a bit of the history, possible future, and abstract emotions! It would thus be unfair to show machines 2D, static representations and then expect them to generalize as good as ourselves to real world scenarios. I understand this proposal seems far fetched, yet I think learning is not about the limits of hardware, compute, or data. Effective modeling of the real world is sure going to be challenging and resource hungry but for now they seem to be the better alternative from a purely hypothetical computer vision and machine learning stand point.
