While the term “Big Data” has only recently come into vogue, it can be argued that many of the practical advances in fields like computer vision have been driven by increasing the amount of data used, rather than any specific algorithmic innovation. The reason for this seems to be the sheer complexity of our visual world; it may simply be too rich to be well represented by compact parametric models. At the same time, with the abundance of data, one can move away from representations based on rigid categories and toward these based on unsupervised (or very-weakly-supervised) data association. That is, we would like the data to “speak its own mind”, instead of being beaten into submission by often-arbitrary “semantic” labels.
The main challenge in the case of high-dimensional visual data is establishing distance metrics that can capture our perception of visual similarity. Getting a handle on that will hopefully allow us to start making transitive correspondences between instances within our unordered heap of visual data, toward creating a coherent, correspondence-centric structure that we term the “Visual Memex”. Our motto is: “Ask not ‘What is this?’, ask ‘What is this like?'”, an interpretation of the world that naturally supports prediction — the holy grail of visual understanding. But even if we don’t manage to get all the way there, there are plenty potential low-hanging fruit along the way, including visual data mining (in art, history, architecture, visual perception, etc), and making inter-modal correspondences such as visual-audio or visual-tactile, etc.