Obtaining accurate per-pixel ground truth for optical flow is non-trivial, especially when scene content motion is independent of camera motion. Models delivering state-of-the-art optical flow estimation accuracy are often trained on synthetically generated datasets. Cross-dataset generalization performance for these models is typically evaluated on the synthetic MPI Sintel dataset, after training on datasets such as Flying Chairs and FlyingThings3D. We present qualitative comparisons with the following state-of-the-art models:
Here, we present the results of our ablation studies demonstrating the effectiveness of dynamic cost volumes over static ones. We begin with the RAFT architecture and modify it by applying our matching probability–based lookups on cost volumes constructed once from the feature maps. This setup serves as a baseline model with static cost volumes. For dynamic cost volumes, as proposed in our approach, we generate a sequence of enriched feature representations by applying transformer blocks with self- and cross-attention layers on the visual feature maps. This sequence is used to construct a series of cost volumes over time. We then interleave cost volume lookups with iterative flow refinement steps, effectively modeling the cost volumes as dynamic within the recurrent refinement module.
We present qualitative comparisons on the Kubrik-NK dataset for models trained on Flying Chairs and FlyingThings3D under identical settings. Dynamic cost volumes consistently yield optical flow estimates with lower End-Point Error (EPE) compared to models using static cost volumes— especially for pixels undergoing large displacements, where the EPE for static cost volumes is significantly higher.
Here we present visualizations of optical flow predictions from DCV-Net, demonstrating strong cross-dataset generalization on the TartanAir dataset. DCV-Net is able to produce highly accurate optical flow estimates across diverse scenes from TartanAir, though occasional visual artifacts may still appear.. These scenes exhibit significant variation in lighting conditions, weather, environmental layout, and motion patterns of both the camera and dynamic objects— all of which differ substantially from the scenarios encountered during training on the FlyingChairs and FlyingThings3D datasets.
@misc{yadav_2023_8253052,
author = {Yadav, Vemburaj and Pagani, Alain and Stricker, Didier},
title = {Dynamic Cost Volumes with Scalable Transformer Architecture for Optical Flow},
month = aug,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.8253052},
url = {https://doi.org/10.5281/zenodo.8253052},
}