What is the curse of dimensionality?

Answer by Yoshua Bengio:

The use of the term "curse of dimensionality" in machine learning is related to the fact that one can easily imagine a target function (to be learned) that is very non-smooth, for example having an exponential number of modes (ups and downs), with respect to dimensionality (the number of scalar input variables). Imagine that in order to produce a good prediction, our learner needs to distinguish (produce a substantially different answer) between 10 different values of each of n variables. Then it may need to distinguish between 10^n different configurations of the input n-dimensional vector. With n easily in the hundreds, thousands or more, this is much much more than the number of examples one can hope to gather (or even the number of atoms in the universe). With most learning algorithms, and in particular with classical non-parametric learning algorithms (e.g. nearest-neighbor, Parzen, Gaussian kernel SVM, Gaussian kernel Gaussian Process, etc.) the learner will need to see at least one example for each of these many configurations (at least as many as necessary to cover all the variations of configurations of interest), in order to produce a correct answer around each of these configurations, one that is different from the target value required for other nearby configurations.

However, the real difficulty is not with the dimensionality, but with the complexity (number of modes, number of ups and downs, number of different regions in input space that must be distinguished) of the target function to be learned. A digit image may have 32×32=1024 input grey-level pixels but it may be that the function one needs to learn (say a classifier to distinguish between 3's and 4's) can be very smooth (in fact a linear classifier will already do a reasonable job, in this case). If there is a so-called manifold (lower-dimensional region) near which the examples congregate, then the actual dimensionality of interest may be much smaller than the number of input variables. But even if the target function can be captured on a low-dimensional manifold, what really matters is how much it varies on that manifold, i.e., the number of different regions that need to be distinguished, because this will correspond to the number of training examples needed to achieve a good performance, when using a standard non-parametric local learner (such as the above).

On the other hand, even if the target function is apparently very complex (e.g., a huge number of neighboring configurations need to be distinguished), whether in high or low dimension, it may still be possible to generalize well from a reasonably small number of training examples, so long as this apparent complexity is actually organized. If these ups and downs have some structure (the simplest you can think of is just repetitions of the same pattern), then some learning algorithms can potentially catch that. For this to happen, the learning algorithm must be able to generalize non-locally, even to regions not "covered" by training examples, far from training examples. One interesting research question is therefore to ask to what extent some algorithms have or not the potential to generalize in this non-local way. Deep learning algorithms are believed to have that potential. See my paper with Yann LeCun entitled "Scaling Learning Algorithms towards AI", http://www.iro.umontreal.ca/~lisa/pointeurs/bengio+lecun_chapter2007.pdf.

View Answer on Quora

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s