TL;DR: We present ParticleSfM, a structure-from-motion system for videos based on dense point trajectories, that generalizes well to in-the-wild sequences with complex foreground motion.
Estimating the pose of a moving camera from monocular video is a challenging problem, especially due to the presence of moving objects in dynamic environments, where the performance of existing camera pose estimation methods are susceptible to pixels that are not geometrically consistent. To tackle this challenge, we present a robust dense indirect structure-from-motion method for videos that is based on dense correspondence initialized from pairwise optical flow. Our key idea is to optimize long-range video correspondence as dense point trajectories and use it to learn robust estimation of motion segmentation. A novel neural network architecture is proposed for processing irregular point trajectory data. Camera poses are then estimated and optimized with global bundle adjustment over the portion of long-range point trajectories that are classified as static. Experiments on MPI Sintel dataset show that our system produces significantly more accurate camera trajectories compared to existing state-of-the-art methods. In addition, our method is able to retain reasonable accuracy of camera poses on fully static scenes, which consistently outperforms strong state-of-the-art dense correspondence based methods with end-to-end deep learning, demonstrating the potential of dense indirect methods based on optical flow and point trajectories. As the point trajectory representation is general, we further present results and comparisons on in-the-wild monocular videos with complex motion of dynamic objects.
Given an input video, we first accumulate and optimize over pairwise optical flow to acquire high-quality dense point trajectories. Then, a specially designed network architecture is employed to process irregular point trajectory data to predict per-trajectory motion labels. Finally, the optimized dense point trajectories along with the motion labels are exploited for global bundle adjustment (BA) to optimize the final camera poses.
 
 
 
 
DynaSLAM. Bescos et al. DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes. IROS 2018.
TrianFlow. Zhao et al. Towards Better Generalization: Joint Depth-Pose Learning without PoseNet. CVPR 2020.
VOLDOR. Min et al. VOLDOR-SLAM: For the times when feature-based or direct methods are not good enough. ICRA 2021.
DROID-SLAM. Teed et al. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. NeurIPS 2021.
@inproceedings{zhao2022particlesfm,
author = {Zhao, Wang and Liu, Shaohui and Guo, Hengkai and Wang, Wenping and Liu, Yong-Jin},
title = {ParticleSfM: Exploiting Dense Point Trajectories for Localizing Moving Cameras in the Wild},
booktitle = {European conference on computer vision (ECCV)},
year = {2022}
}