DCV-Net

Abstract

We introduce DCV-Net, a scalable transformer-based architecture for optical flow with dynamic cost volumes. Recently, FlowFormer, which applies transformers on the full 4D cost volumes instead of the visual feature maps, has shown significant improvements in flow estimation accuracy. However, its major drawback lies in its scalability to high-resolution input images, as the complexity of the attention mechanism on 4D cost volumes scales as O(N⁴), where N is the number of visual feature tokens. We propose a novel architecture that retains the FlowFormer-style enrichment of matching cost representations, but with lightweight attention applied to visual feature maps—reducing the complexity to quadratic O(N²). First, we generate sequential updates to the visual feature representations and the corresponding cost volumes using lightweight attention layers. Then, we interleave this sequence of cost volumes with iterative flow refinement steps, allowing our refinement module to explicitly operate over dynamic cost volumes. Our architecture, which is two orders of magnitude more efficient than FlowFormer in terms of attention complexity, demonstrates strong cross-domain generalization on both the Sintel and KITTI datasets. DCV-Net outperforms FlowFormer on the KITTI benchmark and achieves highly competitive flow estimation accuracy on the Sintel dataset.

Overview

Overview of DCV-Net. Given a pair of image frames as input, DCV-Net employs transformer blocks with self- and cross-attention layers to concurrently contextualize visual feature representations and refine optical flow using a GRU-based update operator. The architecture is composed of three main components: 1) A sequence of visual feature maps generated by a transformer-based encoder. 2) A sequence of local cost volumes—on-the-fly computed variants of dynamic cost volumes. 3) An iterative flow refinement module that recurrently estimates residual flow using the evolving dynamic cost volumes.

Comparisons with the State-of-the-art

Obtaining accurate per-pixel ground truth for optical flow is non-trivial, especially when scene content motion is independent of camera motion. Models delivering state-of-the-art optical flow estimation accuracy are often trained on synthetically generated datasets. Cross-dataset generalization performance for these models is typically evaluated on the synthetic MPI Sintel dataset, after training on datasets such as Flying Chairs and FlyingThings3D. We present qualitative comparisons with the following state-of-the-art models:

FlowFormer: FlowFormer is a transformer-based architecture for optical flow, where the cost volumes are dynamic in nature. 4D cost volumes generated from an image pair are tokenized and encoded into a cost memory in a novel latent space using transformer layers. These encoded cost maps are then iteratively decoded via a GRU-based recurrent transformer decoder, with dynamic positional cost queries derived from the current flow estimates. In terms of our formulation, FlowFormer resembles an approach with attention on cost volumes.
GMFlow: GMFlow formulates optical flow as a global matching problem and directly regresses the dense correspondence field based on encoded feature similarities in the 4D cost volumes. It leverages transformer layers with self- and cross-attention on the feature maps from a CNN backbone to facilitate matching across images for pixels with large displacements. In our formulation, GMFlow corresponds to an approach with attention on feature maps and static cost volumes.

Static vs Dynamic Cost Volumes

Here, we present the results of our ablation studies demonstrating the effectiveness of dynamic cost volumes over static ones. We begin with the RAFT architecture and modify it by applying our matching probability–based lookups on cost volumes constructed once from the feature maps. This setup serves as a baseline model with static cost volumes. For dynamic cost volumes, as proposed in our approach, we generate a sequence of enriched feature representations by applying transformer blocks with self- and cross-attention layers on the visual feature maps. This sequence is used to construct a series of cost volumes over time. We then interleave cost volume lookups with iterative flow refinement steps, effectively modeling the cost volumes as dynamic within the recurrent refinement module.

We present qualitative comparisons on the Kubrik-NK dataset for models trained on Flying Chairs and FlyingThings3D under identical settings. Dynamic cost volumes consistently yield optical flow estimates with lower End-Point Error (EPE) compared to models using static cost volumes— especially for pixels undergoing large displacements, where the EPE for static cost volumes is significantly higher.

More Visualizations

Here we present visualizations of optical flow predictions from DCV-Net, demonstrating strong cross-dataset generalization on the TartanAir dataset. DCV-Net is able to produce highly accurate optical flow estimates across diverse scenes from TartanAir, though occasional visual artifacts may still appear.. These scenes exhibit significant variation in lighting conditions, weather, environmental layout, and motion patterns of both the camera and dynamic objects— all of which differ substantially from the scenarios encountered during training on the FlyingChairs and FlyingThings3D datasets.

BibTeX

@misc{yadav_2023_8253052,
  author       = {Yadav, Vemburaj and Pagani, Alain and Stricker, Didier},
  title        = {Dynamic Cost Volumes with Scalable Transformer Architecture for Optical Flow},
  month        = aug,
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.8253052},
  url          = {https://doi.org/10.5281/zenodo.8253052},
}

DCV-Net: Dynamic Cost Volumes with Scalable Transformer Architecture for Optical Flow

IMVIP 2023