If we scale the entire scene by some factor \(k\) and, at the same time, scale the camera matrices by the factor of \(1/k\), the projections of the scene points in the image remain exactly the same:
Core of ML: Making predictions or decisions from data
This overview will not go into depth about the statistical underpinnings of learning methods. We're looking at ML as a tool.
Take a machine learning course if you want to know more!
(COS 280, SYS 411)
impact of machine learning
Machine Learning is arguably the greatest export from computing to other scientific fields
machine learning applications
[ slide: Isabelle Guyon ]
machine learning problems
supervised learning
unsupervised learning
discrete
classification or categorization
clustering
continuous
regression
dimensionality reduction
machine learning problems
supervised learning
unsupervised learning
discrete
classification or categorization
clustering
continuous
regression
dimensionality reduction
dimensionality reduction
PCA, ICA, LLE, Isomap
PCA is the most important technique to know. It takes advantage of correlations in data dimensions to produce the best possible lower dimensional representation, according to reconstruction error.
PCA should be used for dimensionality reduction, not for discovering patterns or making predictions. Don't try to assign semantic meaning to the bases.
Assign each point to the closest center
\[\mathbf{\delta}^t = \argmin{\mathbf{\delta}} \frac{1}{N} \sum_j^N \sum_i^K \delta_{ij}^{t-1} \left( \mathbf{c}_i^{t-1} - \mathbf{x}_j \right)^2\]
Update cluster centers as the mean of the points
\[\mathbf{c}^t = \argmin{\mathbf{c}} \frac{1}{N} \sum_j^N \sum_i^K \delta_{ij}^{t-1} \left( \mathbf{c}_i^{t-1} - \mathbf{x}_j \right)^2\]
Repeat 2–3 until no points are re-assigned (\(t=t+1\))
k-means converges to a local minimum
k-means: design choices
Initialization
Randomly select \(K\) points as initial cluster center
Or greedily choose \(K\) points to minimize residual
Distance measures
Traditionally Euclidean, could be others
Optimization
Will converge to a local minimum
May want to perform multiple restarts
k-means clustering using intensity or color
imageintensitycolor
how to evaluate clusters?
Generative
How well are points reconstructed from the clusters?
Discriminative
How well do the clusters correspond to labels?
Purity
Note: unsupervised clustering does not aim to be discriminative
[ slide: Derek Hoiem ]
how to choose the number of clusters?
Validation set
Try different numbers of clusters and look at performance
When building dictionaries (discussed later), more cluster typically work better
[ slide: Derek Hoiem ]
k-means pros and cons
Pros
Finds cluster centers that minimize conditional variance (good representation of data)
Simple and fast*
Easy to implement
k-means pros and cons
Pros
Cons
Need to choose \(K\)
Sensitive to outliers
Prone to local minima
All clusters have the same parameters (e.g., distance measure is non-adaptive)
*Can be slow: each iteration is \(O(KNd)\) for \(N\) \(d\)-dimensional points
k-means pros and cons
Pros
Cons
Usage
Rarely used for pixel segmentation
building visual dictionaries
Sample patches from a database
e.g., 128 dimensional SIFT vectors
Cluster the patches
Cluster centers are the dictionary
Assign a codeword (number) to each new patch according to the nearest cluster
examples of learned codewords
Most likely codewords for 4 learned "topics"
EM with multinomial (problem 3) to get topics
\(\mathbf{x}\): current location (cyan cross)
\(\frac{\sum ...}{\sum ...}\): next location (orange cross)
\(\mathbf{x}(\mathbf{x})\): translation (yellow arrow)
[ slide: Y.Ukrainitz and B.Sarel ]
attraction basin
Attraction Basin
the region for which all trajectories lead to the same mode
Cluster
all data points in the attraction basin of a mode
[ slide: Y.Ukrainitz and B.Sarel ]
attraction basin
mean shift clustering
The mean shift algorithm seeks modes of the given set of points
Choose kernel and bandwidth
For each point
Center a window on that point
Compute the mean of the data in the search window
Center the search window at the new mean location
Repeat 2–3 until convergence
Assign points that lead to nearby modes to the same cluster
segmentation by means shift
Compute features for each pixel (color, gradients, texture, etc.)
Set kernel size for features \(K_f\) and position \(K_s\)
Initialize windows at individual pixel locations
Perform mean shift for each window until convergence
Merge windows that are within width of \(K_f\) and \(K_s\)
mean shift segmentation results
[ Comaniciu and Meer 2002 ]
mean shift segmentation results
[ Comaniciu and Meer 2002 ]
mean shift pros and cons
Pros
Good general-practice segmentation
Flexible in number and shape of regions
Robust to outliers
Cons
Have to choose kernel size in advance
Not suitable for high-dimensional features
When to use it
Oversegmentation
Multiple segmentations
Tracking, clustering, filtering applications
spectral clustering
Group points based on links in a graph
cuts in a graph
Normalized cut
a cut penalizes large segments
fix by normalizing for size of segments
\[\mathit{Ncut}(A,B) = \frac{\mathit{cut}(A,B)}{\mathit{volume}(A)} + \frac{\mathit{cut}(A,B)}{\mathit{volume}(B)}\]
\(\mathit{volume}(A) = \text{sum of costs of all edges that touch }A\)
[ source: S.Seitz ]
normalized cuts for segmentation
which algorithm to use?
Quantization / Summarization: K-means
aims to preserve variance of original data
can easily assign new point to a cluster
Quantization for computing histograms
Summary of 20k photos of Rome using "greedy k-means"
\(y\): output
\(f\): prediction function
\(\mathbf{x}\): image feature
Training: given a training set of labeled examples \(\{(\mathbf{x}_1,y_1, \ldots, (\mathbf{x}_N, y_N)\}\), estimate the prediction function \(f\) by minimizing the prediction error on the training set
Testing: apply \(f\) to a never before seen test example \(\mathbf{x}\) and output the predicted value \(y = f(\mathbf{x})\)
[ slide: L.Lazebnik ]
learning a classifier
Given some set of features with corresponding labels, learn a function to predict the labels from the features
learning a classifier
Given some set of features with corresponding labels, learn a function to predict the labels from the features
learning a classifier
Given some set of features with corresponding labels, learn a function to predict the labels from the features
steps
[ slide: D.Hoiem and L.Lazebnik ]
features
Raw pixels
Histograms
GIST descriptors
...
[ slide: L.Lazebnik ]
one way to think about it
Training labels dictate that two examples are the same or different, in some sense
Features and distance measures define visual similarity
Classifiers try to learn weights or parameters for features and distance measures so that visual similarity predicts label similarity
many classifiers to choose from
SVM
Neural networks
Naive Bayse
Bayesian network
Logistic regression
Randomized Forests
Boosted Decision Trees
K-Nearest Neighbors
RBMs
etc.
Which one is best?
Claim:
The decision to use machine learning is more important than the choice of a particular learning method
Classifiers: Nearest neighbors
\[f(\mathbf{x}) = \text{label of the training example nearest to }\mathbf{x}\]
\(D\) can be (inverse) L1 distance, Euclidean distance, \(\chi^2\) distance, etc.
J.Zhang, M.Marszalek, S.Lazebnik, C.Schmid. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. IJCV 2007. link
summary: SVMs for image classification
Pick an image representation (in our case, bag of features)
Pick a kernel function for that representation
Compute the matrix of kernel values between every pair of training examples
Feed the kernel matrix into your favorite SVM solver to obtain support vectors and weights
At test time: compute kernel values for your test example and each support vector, and combine them with the learned weights to get the value of the decision function
[ slide: L.Lazebnik ]
what about multi-class SVMs?
Unfortunately, there is no "definitive" multi-class SVM formulation
In practice, we have to obtain a multi-class SVM by combining multiple two-class SVMs
One vs. Others
Training: learn an SVM for each class vs. the others
Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value
One vs. One
Training: learn an SVM for each pair of classes
Testing: each learned SVM "votes" for a class to assign to the test example