OptSplat: Recurrent Optimization for Generalizable Reconstruction and Novel View Renderings

Vemburaj Yadav¹, Alain Pagani¹, Didier Stricker^1,2,

Under Review

¹German Research Center for Artificial Intelligence, ²RPTU Kaiserslautern,

Feed-forward high resolution novel view synthesis (512x960) from OptSplat on unseen scenes of DL3DV from 2 input views.

Summary

We present an efficient and scalable architecture OptSplat for zero-shot 3D reconstruction and high-resolution novel view synthesis from just 2 input views.
Our GRU-based recurrent optimization layers leverages lightweight local cost volumes instead of their global counterparts for sequential estimation and refinement of multi-view depth and 3D Gaussians.
Our learning-to-optimize framework enables strong cross-dataset generalization, while reducing GPU memory consumption by approximately 5×—making it suitable for deployment on resource-constrained hardware without compromising rendering quality.
Scalability with respect to the number of input views, image resolution, and depth candidates, enabling efficient large-scale generalizable reconstruction and high-resolution novel view synthesis.

Abstract

We propose an efficient feed-forward model for novel view synthesis and 3D reconstruction based on Gaussian Splatting, featuring a scalable architecture that reliably predicts multi-view depth maps and 3D Gaussian primitives from as few as two input views. Existing multi-view depth estimation techniques typically depend on processing plane-swept cost volumes, which generate probability distributions over a discrete set of candidate depths. This approach limits scalability, especially when finer depth sampling or higher spatial resolution is required. To address this, we design an optimization-inspired architecture OptSplat, that employs recurrent iterative updates to refine depth maps and pixel-aligned Gaussian primitives based on previous predictions. Our model leverages a unified update operator that iteratively indexes global cost volumes, progressively improving predictions in the joint space of depth and Gaussian parameters. Comprehensive evaluations across the real world datasets of RealEstate10K, ACID and DL3DV shows that our model demonstrates strong cross-dataset generalization and competitive rendering quality for novel views compared to the existing works with plane swept cost volumes, while at the same time offering upto 5x reduction in the GPU memory requirements, especially for reconstruction with high-resolution inputs.

Overview

Overview of OptSplat. Given posed multi-view RGB inputs, OptSplat constructs local cost volumes via plane sweep stereo and iteratively refines 3D-consistent depth maps and 3D Gaussians using a GRU-based update operator. The entire pipeline operates in a fully feed-forward, zero-shot manner, enabling efficient and scalable novel view synthesis with high geometric and visual fidelity

Comparisons with the State-of-the-art

We present qualitative comparisons with the following state-of-the-art models:

MVSplat: Feed-forward 3D Gaussians model that utilizes plane-swept cost volumes to encode global feature matching information between frame pairs, and then infer Mulit-View Depth and Gaussian parameter from the constructed global cost volumes. Construction and processing of global cost volumes leads to poor scalability of the model to input image resolutions and number of input views.
DepthSplat: The latest feed-forward Multi-View Depths and 3D Gaussians model that leverages both the ploane-swept cost volumes and foundational features from pre-trained monocular-depth estimation models to predict highly accurate 3D-consistent depth maps and 3D Gaussians.

Comparison on Real Estate 10k dataset

Here, we present qualitative comparisons for multi-view depth estimation and novel views rendered from interpolated viewpoints between the two input views. OptSplat demonstrates the ability to produce 3D-consistent depth maps and photorealistic novel views, performing comparably to state-of-the-art models despite not leveraging monocular features from foundation models. Instead, it operates using localized cost volumes, resulting in a significantly reduced memory footprint. Our approach achieves significantly less memory—approximately 50% and 25% of the memory used by MVSplat and DepthSplat, respectively. Specifically, our model operates within a memory footprint of under 700,MB for 256 × 256 resolution inputs, compared to over 2600,MB required by DepthSplat, which offers only marginal improvements in rendering fidelity.

Multi-view depth estimation and novel view synthesis on the benchmark scenes of RealEstate10K dataset

Reccurent Optimization for Iterative Refinement

We highlight the convergence behavior of our model as the number of recurrent refinement iterations increases. The results validate our design objective: the update operator effectively learns to perform optimization-like refinement in a feed-forward manner, progressively improving scene geometry and appearance reconstruction. Furthermore, as shown in the figure below, both the predicted depth maps and the synthesized novel views exhibit consistent refinement across iterations, capturing increasingly fine-grained details and producing sharper reconstructions over time.

Input view depth predictions and novel view renderings from OptSplat over the iterations of recurrent optimization.

Cross-Dataset Generalization on DL3DV and ACID

We conduct cross-dataset generalization studies, where models trained on RealEstate10K are evaluated on the ACID and DL3DV datasets. Despite relying on local cost volumes, our OptSplat inherently outperforms MVSplat in generalizing to out-of-distribution novel scenes, owing to the iterative optimization layers that refine 3D Gaussians during inference. On the DL3DV scenes, which feature more complex geometry and larger viewpoint variations, OptSplat demonstrates strong generalization with negligible drops in rendering quality—while using only a fraction of the memory required by DepthSplat. Notably, unlike DepthSplat, OptSplat does not rely on foundational monocular features, yet it generalizes robustly across datasets. This highlights the significance of our recurrent optimization module.

Method	GPU Memory (MB)	DL3DV			ACID
Method	GPU Memory (MB)	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓
MVSplat	1217	25.55	0.833	0.119	28.15	0.841	0.147
DepthSplat	2638	27.99	0.897	0.084	28.37	0.847	0.141
OptSplat (Ours)	658	26.69	0.875	0.093	27.39	0.836	0.144

Cross-domain novel view synthesis: Zero-shot generalization on DL3DV and ACID datasets for models trained on RealEstate10K

Cross-dataset Generalization for OptSplat (Depth estimation on the scenes of DL3DV dataset)

Though trained on the RealEstate10K dataset, which primarily comprises indoor scenes, our model generalizes robustly to the ACID dataset, which features natural outdoor environments captured by aerial drones. The highly detailed depth maps and novel views demonstrate robustness across datasets, scales, lighting conditions, and viewpoint variations.

Comparison for Cross-Dataset Generalization (Novel view synthesis on scenes of ACID dataset with models trained on RealEstate10K dataset) .