Structure from Motion (SfM)
A Comprehensive Mathematical and Computational Guide
Overview
Structure from Motion (SfM) is a computer vision technique for reconstructing 3D structures from 2D image (sequencial and/or non-sequencial). This project demonstrates the end-to-end pipeline for SfM, including feature detection, camera pose estimation, triangulation, and 3D point cloud generation with code, its mathematical representation and some open source tools to generate this.
Notation Legend
Before diving into the technical details, let’s establish our notation:
: 2D image coordinates (homogeneous) : 3D world coordinates : Camera intrinsic matrix : Rotation matrix : Translation vector : Projection function
SfM Pipeline: Mathematical Foundations
1. Camera Model: Geometric Projection
The pinhole camera model mathematically represents how 3D world points are mapped to 2D image planes.
Projection Equation:
Computational Insights:
- Complexity:
projection operation - Requires pre-calibration of intrinsic parameters
- Assumes perfect pinhole camera geometry
Practical Implementations:
- OpenCV Camera Calibration
- COLMAP Camera Modeling
- Bundler Toolkit
2. Feature Detection: Identifying Distinctive Image Landmarks
Feature detection creates a bridge between 2D images by identifying robust, invariant points.
Feature Representation: Let
Popular Algorithms:
- SIFT (Scale-Invariant Feature Transform)
- Invariant to: Scale, Rotation, Illumination
- Computational Complexity:

- ORB (Oriented FAST and Rotated BRIEF)
- Real-time performance
- Computational Complexity:

Descriptor Distance Metric:
3. Feature Matching: Establishing Correspondences
Matching features across images requires robust correspondence algorithms.

Matching Strategies:
- K-Nearest Neighbor (KNN)
- RANSAC for outlier rejection
- Lowe’s ratio test for robust matching
Mathematical Formulation:
4. Fundamental Matrix Estimation
The Fundamental Matrix captures geometric relationships between image pairs.
Epipolar Constraint:
Estimation Methods:
- Eight-point Algorithm
- Linear Least Squares
- Non-linear Refinement
Computational Complexity:
5. Essential Matrix Decomposition
Bridges geometric relationships with camera motion.
Derivation:
Motion Decomposition:
6. Triangulation: 3D Point Reconstruction
Converts 2D correspondences to 3D point clouds.

Linear Triangulation:
Where:
Triangulation Algorithms:
- Direct Linear Transformation (DLT)
- Iterative Refinement
- Geometric Error Minimization
7. Bundle Adjustment: Optimization
Refines camera parameters and 3D point locations simultaneously.
Objective Function:
Optimization Techniques:
- Levenberg-Marquardt Algorithm
- Gauss-Newton Method
- Computational Complexity:
8. Sparse Point Cloud Generation
Aggregates triangulated points into an initial 3D representation.
9. Dense Reconstruction
Enhances sparse point clouds with per-pixel depth estimation.
Depth Estimation:
Advanced Techniques:
- Multi-View Stereo (MVS)
- Depth Fusion Algorithms
- Probabilistic Depth Estimation
Practical Considerations
- Choose feature detectors based on computational resources
- Validate camera calibration
- Handle image noise and computational constraints
- Use robust optimization techniques
Recommended Tools
- COLMAP
- OpenSfM
- VisualSfM
- Bundler
Conclusion
Structure from Motion represents a powerful intersection of geometric algebra, optimization, and computer vision, transforming 2D image sequences into rich 3D representations.