Structure from Motion (SfM)

A Comprehensive Mathematical and Computational Guide

Overview

Structure from Motion (SfM) is a computer vision technique for reconstructing 3D structures from 2D image (sequencial and/or non-sequencial). This project demonstrates the end-to-end pipeline for SfM, including feature detection, camera pose estimation, triangulation, and 3D point cloud generation with code, its mathematical representation and some open source tools to generate this.

Point Cloud generated using SfM (using COLMAP)

Notation Legend

Before diving into the technical details, let’s establish our notation:

  • x: 2D image coordinates (homogeneous)
  • X: 3D world coordinates
  • K: Camera intrinsic matrix
  • R: Rotation matrix
  • t: Translation vector
  • π: Projection function

SfM Pipeline: Mathematical Foundations

1. Camera Model: Geometric Projection

The pinhole camera model mathematically represents how 3D world points are mapped to 2D image planes.

Projection Equation: x=K[R|t]X

Computational Insights:

  • Complexity: O(1) projection operation
  • Requires pre-calibration of intrinsic parameters
  • Assumes perfect pinhole camera geometry

Practical Implementations:

  • OpenCV Camera Calibration
  • COLMAP Camera Modeling
  • Bundler Toolkit

2. Feature Detection: Identifying Distinctive Image Landmarks

Feature detection creates a bridge between 2D images by identifying robust, invariant points.

Feature Representation: Let fi represent a feature point with descriptor D(fi)

Popular Algorithms:

  1. SIFT (Scale-Invariant Feature Transform)
    • Invariant to: Scale, Rotation, Illumination
    • Computational Complexity: O(nlogn)
Example representation of SIFT feature detection
  1. ORB (Oriented FAST and Rotated BRIEF)
    • Real-time performance
    • Computational Complexity: O(n)
Example representation of ORB feature detection

Descriptor Distance Metric: Similarity(D(fi),D(fj))=k=1D|dikdjk|D

3. Feature Matching: Establishing Correspondences

Matching features across images requires robust correspondence algorithms.

Example representation of feature matching

Matching Strategies:

  • K-Nearest Neighbor (KNN)
  • RANSAC for outlier rejection
  • Lowe’s ratio test for robust matching

Mathematical Formulation: Match(fi,fj)=argminmM|D(fi)D(fj)|2

4. Fundamental Matrix Estimation

The Fundamental Matrix captures geometric relationships between image pairs.

Epipolar Constraint: xjTFxi=0

Estimation Methods:

  • Eight-point Algorithm
  • Linear Least Squares
  • Non-linear Refinement

Computational Complexity: O(n³) for SVD-based estimation

5. Essential Matrix Decomposition

Bridges geometric relationships with camera motion.

Derivation: E=KTFK

Motion Decomposition: E=R[t]×

6. Triangulation: 3D Point Reconstruction

Converts 2D correspondences to 3D point clouds.

Example representation of triangulation

Linear Triangulation: AX=0

Where: A=[x1×P1x2×P2]

Triangulation Algorithms:

  • Direct Linear Transformation (DLT)
  • Iterative Refinement
  • Geometric Error Minimization

7. Bundle Adjustment: Optimization

Refines camera parameters and 3D point locations simultaneously.

Objective Function: minK,R,t,Xi,jxijπ(K,Rj,tj,Xi)2

Optimization Techniques:

  • Levenberg-Marquardt Algorithm
  • Gauss-Newton Method
  • Computational Complexity: O(n³)

8. Sparse Point Cloud Generation

Aggregates triangulated points into an initial 3D representation.

9. Dense Reconstruction

Enhances sparse point clouds with per-pixel depth estimation.

Depth Estimation: z=argminziIi(u,v)Ij(π(u,v,z))2

Advanced Techniques:

  • Multi-View Stereo (MVS)
  • Depth Fusion Algorithms
  • Probabilistic Depth Estimation

Practical Considerations

  • Choose feature detectors based on computational resources
  • Validate camera calibration
  • Handle image noise and computational constraints
  • Use robust optimization techniques
  • COLMAP
  • OpenSfM
  • VisualSfM
  • Bundler

Conclusion

Structure from Motion represents a powerful intersection of geometric algebra, optimization, and computer vision, transforming 2D image sequences into rich 3D representations.