Machine Learning Crash Course

COS 351 - Computer Vision

[ slides: Isabelle Guyon, Erik Sudderth, Mark Johnson, Derek Hoiem ]

recap: multiple views and motion

recap: multiple views and motion

\[\nabla I \cdot \left[\begin{array}{cc} u & v \end{array}\right]^T + I_t = 0\]

Structure from motion (or SLAM)

Given a set of corresponding points in two or more images, compute the camera parameters and the 3D point coordinates

[ slide: Noah Snavely ]

Structure from motion (or SLAM)

structure from motion ambiguity

If we scale the entire scene by some factor \(k\) and, at the same time, scale the camera matrices by the factor of \(1/k\), the projections of the scene points in the image remain exactly the same:

\[\mathbf{x} = \mathbf{P}\mathbf{X} = \left(\frac{1}{k} \mathbf{P}\right)(k\mathbf{X})\]

It is impossible to recover the absolute scale of the scene!

what is the scale of image content?

what is the scale of image content?

what is the scale of image content?

bundle adjustment

\[E(\mathbf{P}, \mathbf{X}) = \sum_{i=1}^m \sum_{j=1}^n D\left(\mathbf{x}_{ij}, \mathbf{P}_i \mathbf{X}_j\right)^2\]

photo synth

Noah Snavely, Steven M. Seitz, Richard Szeliski. Photo tourism: Exploring photo collections in 3D. SIGGRAPH 2006

http://photosynth.net

machine learning: overview

impact of machine learning



Machine Learning is arguably the greatest export from computing to other scientific fields

machine learning applications

[ slide: Isabelle Guyon ]

machine learning problems



supervised
learning
unsupervised learning

discrete classification or categorization clustering

continuous regression dimensionality reduction

machine learning problems



supervised
learning
unsupervised learning

discrete classification or categorization clustering

continuous regression dimensionality reduction

dimensionality reduction

  • PCA, ICA, LLE, Isomap
  • PCA is the most important technique to know. It takes advantage of correlations in data dimensions to produce the best possible lower dimensional representation, according to reconstruction error.
  • PCA should be used for dimensionality reduction, not for discovering patterns or making predictions. Don't try to assign semantic meaning to the bases.

machine learning problems



supervised
learning
unsupervised learning

discrete classification or categorization clustering

continuous regression dimensionality reduction

clustering

clustering

even population

clustering

even population

clustering example: image segmentation

Goal: break up the image into meaningful or perceptually similar regions

segmentation for feature support or efficiency

segmentation as a result

[ Rother et al. 2004 ]

types of segmentation

horse
oversegmentation
undersegmentation
multiple segmentations

clustering

Clustering
group together similar points and represent them with a single token



Key challenges

  1. What makes two points/images/patches similar?
  2. How do we compute an overall grouping from pairwise similarities?
[ slide: Derek Hoiem ]

why do we cluster?

[ slide: Derek Hoiem ]

how do we cluster?

clustering for summarization

Goal: cluster to minimize variance in data given clusters

\[\mathbf{c}^*, \mathbf{\delta}^* = \argmin{\mathbf{c},\mathbf{\delta}} \frac{1}{N} \sum_j^N \sum_i^K \delta_{ij} \left( \mathbf{c}_i - \mathbf{x}_j \right)^2\]

\(\mathbf{c}_i\): cluster center
\(\mathbf{x}_j\): data
\(\delta_{ij}\): whether \(\mathbf{x}_j\) is assigned to \(\mathbf{c}_i\)

[ slide: Derek Hoiem ]

K-means algorithm

[ illustration: wikipedia: K-means clustering ]

K-means algorithm

[ illustration: wikipedia: K-means clustering ]

K-means algorithm

  1. Initialize cluster centers: \(\mathbf{c}^0\); \(t=0\)
  2. Assign each point to the closest center \[\mathbf{\delta}^t = \argmin{\mathbf{\delta}} \frac{1}{N} \sum_j^N \sum_i^K \delta_{ij}^{t-1} \left( \mathbf{c}_i^{t-1} - \mathbf{x}_j \right)^2\]
  3. Update cluster centers as the mean of the points \[\mathbf{c}^t = \argmin{\mathbf{c}} \frac{1}{N} \sum_j^N \sum_i^K \delta_{ij}^{t-1} \left( \mathbf{c}_i^{t-1} - \mathbf{x}_j \right)^2\]
  4. Repeat 2–3 until no points are re-assigned (\(t=t+1\))

k-means converges to a local minimum

k-means: design choices

k-means clustering using intensity or color



image
intensity
color

how to evaluate clusters?

[ slide: Derek Hoiem ]

how to choose the number of clusters?

[ slide: Derek Hoiem ]

k-means pros and cons

k-means pros and cons

k-means pros and cons

building visual dictionaries

  1. Sample patches from a database
    • e.g., 128 dimensional SIFT vectors
  2. Cluster the patches
    • Cluster centers are the dictionary
  3. Assign a codeword (number) to each new patch according to the nearest cluster

examples of learned codewords

Most likely codewords for 4 learned "topics"
EM with multinomial (problem 3) to get topics

agglomerative clustering

agglomerative clustering

agglomerative clustering

agglomerative clustering

agglomerative clustering

agglomerative clustering

How to define cluster similarity?

How many clusters?

conclusions: agglomerative clustering

mean shift segmentation

[ D.Comaniciu and P.Meer. Mean Shift: A Robust Approach toward Feature Space Analysis. PAMI 2002 ]

Mean shift segmentation is a versatile technique for clustering-based segmentation

mean shift algorithm

Try to find modes of this non-parametric density

kernel density estimation

Kernel density estimation

\[\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K \left( \frac{x-x_i}{h} \right)\]

Gaussian kernel

\[K \left( \frac{x-x_i}{h} \right) = \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-x_i)^2}{2h^2}}\]

mean shift

[ slide: Y.Ukrainitz and B.Sarel ]

mean shift

[ slide: Y.Ukrainitz and B.Sarel ]

mean shift

[ slide: Y.Ukrainitz and B.Sarel ]

mean shift

[ slide: Y.Ukrainitz and B.Sarel ]

mean shift

[ slide: Y.Ukrainitz and B.Sarel ]

mean shift

[ slide: Y.Ukrainitz and B.Sarel ]

mean shift

[ slide: Y.Ukrainitz and B.Sarel ]

computing the mean shift

Simple mean shift procedure

\[\mathbf{m}(\mathbf{x}) = \left[ \frac{\sum_{i=1}^n \mathbf{x}_i g\left( \frac{||\mathbf{x} - \mathbf{x}_i||^2}{h} \right)}{\sum_{i=1}^n g\left( \frac{||\mathbf{x}-\mathbf{x}_i||^2}{h} \right)} - \mathbf{x} \right]\]

\(\mathbf{x}\): current location (cyan cross)
\(\frac{\sum ...}{\sum ...}\): next location (orange cross)
\(\mathbf{x}(\mathbf{x})\): translation (yellow arrow)

[ slide: Y.Ukrainitz and B.Sarel ]

attraction basin

Attraction Basin
the region for which all trajectories lead to the same mode
Cluster
all data points in the attraction basin of a mode
[ slide: Y.Ukrainitz and B.Sarel ]

attraction basin

mean shift clustering

The mean shift algorithm seeks modes of the given set of points

  1. Choose kernel and bandwidth
  2. For each point
    1. Center a window on that point
    2. Compute the mean of the data in the search window
    3. Center the search window at the new mean location
    4. Repeat 2–3 until convergence
  3. Assign points that lead to nearby modes to the same cluster

segmentation by means shift

  1. Compute features for each pixel (color, gradients, texture, etc.)
  2. Set kernel size for features \(K_f\) and position \(K_s\)
  3. Initialize windows at individual pixel locations
  4. Perform mean shift for each window until convergence
  5. Merge windows that are within width of \(K_f\) and \(K_s\)

mean shift segmentation results

[ Comaniciu and Meer 2002 ]

mean shift segmentation results

[ Comaniciu and Meer 2002 ]

mean shift pros and cons

spectral clustering

Group points based on links in a graph

cuts in a graph

Normalized cut

[ source: S.Seitz ]

normalized cuts for segmentation

which algorithm to use?


Quantization for computing histograms


Summary of 20k photos of Rome using "greedy k-means"

which algorithm to use?

clustering

Key algorithm: K-means

machine learning problems



supervised
learning
unsupervised learning

discrete classification or categorization clustering

continuous regression dimensionality reduction

the machine learning framework

Apply a prediction function (\(f\)) to a feature representation of the image (input) to get the desired output:

[ slide: L.Lazebnik ]

the machine learning framework

\[y = f(\mathbf{x})\]

\(y\): output
\(f\): prediction function
\(\mathbf{x}\): image feature


[ slide: L.Lazebnik ]

learning a classifier

Given some set of features with corresponding labels, learn a function to predict the labels from the features

learning a classifier

Given some set of features with corresponding labels, learn a function to predict the labels from the features

learning a classifier

Given some set of features with corresponding labels, learn a function to predict the labels from the features

steps

[ slide: D.Hoiem and L.Lazebnik ]

features

[ slide: L.Lazebnik ]

one way to think about it

many classifiers to choose from

Which one is best?


Claim:



The decision to use machine learning is more important than the choice of a particular learning method

Classifiers: Nearest neighbors

\[f(\mathbf{x}) = \text{label of the training example nearest to }\mathbf{x}\]

[ slide: L.Lazebnik ]

Classifiers: Linear

\[f(\mathbf{x}) = \mathit{sgn}(\mathbf{w} \cdot \mathbf{x} + b)\]

[ slide: L.Lazebnik ]

recognition task and supervision

[ slide: L.Lazebnik ]

spectrum of supervision

Unsupervised "Weakly" Supervised Fully Supervised

Weakly vs Fully: definition depends on task

[ slide: L.Lazebnik ]

generalization

training set (labels known)
test set (labels unknown)

How well does a learned model generalize from the data it was trained on to a new test set?

[ slide: L.Lazebnik ]

generalization

[ slide: L.Lazebnik ]

generalization

[ slide: L.Lazebnik ]

generalization

[ slide: L.Lazebnik ]

bias-variance trade off

underfitting
overfitting
[ slide: D.Hoiem ]

bias-variance trade off

\[E(MSE) = \text{noise}^2 + \text{bias}^2 + \text{variance}\]

\(\text{noise}\): Unavoidable Error
\(\text{bias}\): Error due to incorrect assumptions
\(\text{variance}\): Error due to variance of training samples

See link for explanations of bias-variance (also Bishop's "Neural Networks" book)

[ slide: D.Hoiem ]

bias-variance trade off

[ slide: D.Hoiem ]

bias-variance trade off

[ slide: D.Hoiem ]

bias-variance trade off

[ slide: D.Hoiem ]

remember...

No classifier is inherently better than any other; you need to make assumptions to generalize



Three kinds of error:

[ slide: D.Hoiem ]

how to reduce variance?

Ways to reduce variance

[ slide: D.Hoiem ]

very brief tour of some classifiers

generative vs. discriminative classifiers

Generative Models

  • Represent both the data and the labels
  • Often, makes use of conditional independence and priors
  • Examples
    • Naive Bayes classifier
    • Bayesian network
  • Models of data may apply to future prediction problems

Discriminative Models

  • Learn to directly predict the labels from the data
  • Often, assume a simple boundary (e.g., linear)
  • Examples
    • Logistic regression
    • SVM
    • Boosted decision trees
  • Often easier to predict a label from the data than to model the data
[ slide: D.Hoiem ]

classification

Assign input vector to one of two or more classes

Any decision rule divides input space into decision regions sepearated by decision boundaries

[ slide: L.Lazebnik ]

classifiers: k-nearest neighbor

Assign label of nearest training data point to each test data point

Voronoi partitioning of feature space for two-category 2D and 3D data
[ source: D.Lowe ]

classifiers: k-nearest neighbor

classifiers: k-nearest neighbor

\(k=1\)

classifiers: k-nearest neighbor

\(k=3\)

classifiers: k-nearest neighbor

\(k=5\)

classifiers: k-nearest neighbor

Using k-NN

classifiers: Linear SVM

Find a linear function to separate the classes:

\[f(\mathbf{x}) = \mathrm{sgn}(\mathbf{w}\cdot\mathbf{x} + b)\]

classifiers: Linear SVM

Find a linear function to separate the classes:

\[f(\mathbf{x}) = \mathrm{sgn}(\mathbf{w}\cdot\mathbf{x} + b)\]

classifiers: Linear SVM

Find a linear function to separate the classes:

\[f(\mathbf{x}) = \mathrm{sgn}(\mathbf{w}\cdot\mathbf{x} + b)\]

classifiers: nonlinear SVMs

Datasets that are linearly separable work out great

But what if the dataset is just too hard?

We can map it to a higher-dimensional space!

[ slide: Andrew Moore ]

classifiers: nonlinear SVMs

General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable.

[ slide: Andrew Moore ]

classifiers: nonlinear SVMs

The Kernel Trick: instead of explicitly computing the lifting transformation \(\varphi(\mathbf{x})\), define a kernel function \(K\) such that

\[K(\mathbf{x}_i, \mathbf{x}_j) = \varphi(\mathbf{x}_i)\cdot\varphi(\mathbf{x}_j)\]

Note: to be valid, the kernel function must satisfy Mercer's condition

This gives a nonlinear decision boundary in the original feature space

\[\sum_i \alpha_i y_i \varphi(\mathbf{x}_i) \cdot \varphi(\mathbf{x}) + b = \sum_i \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b\]

C.Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery. 1988. link

classifiers: nonlinear SVMs

Example: consider the mapping \(\varphi(x) = (x, x^2)\)

\[\varphi(x) \cdot \varphi(y) = (x, x^2) \cdot (y, y^2) = xy + x^2y+2\] \[ K(x,y) = xy + x^2y^2 \]

Kernel for bags of features

Histogram intersection kernel:

\[I(h_1,h_2) = \sum_{i=1}^N \min(h_1(i), h_2(i))\]

Generalized Gaussian kernel:

\[K(h_1, h_2) = \exp\left( -\frac{1}{A} D(h_1, h_2)^2 \right)\]

\(D\) can be (inverse) L1 distance, Euclidean distance, \(\chi^2\) distance, etc.

J.Zhang, M.Marszalek, S.Lazebnik, C.Schmid. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. IJCV 2007. link

summary: SVMs for image classification

  1. Pick an image representation (in our case, bag of features)
  2. Pick a kernel function for that representation
  3. Compute the matrix of kernel values between every pair of training examples
  4. Feed the kernel matrix into your favorite SVM solver to obtain support vectors and weights
  5. At test time: compute kernel values for your test example and each support vector, and combine them with the learned weights to get the value of the decision function
[ slide: L.Lazebnik ]

what about multi-class SVMs?

[ slide: L.Lazebnik ]

SVMs: Pros and Cons

what to remember about classifiers

[ slide: D.Hoiem ]

making decisions about data

loading...