We present qualitative comparisons with the following state-of-the-art models:
Here, we present qualitative comparisons for multi-view depth estimation and novel views rendered from interpolated viewpoints between the two input views. OptSplat demonstrates the ability to produce 3D-consistent depth maps and photorealistic novel views, performing comparably to state-of-the-art models despite not leveraging monocular features from foundation models. Instead, it operates using localized cost volumes, resulting in a significantly reduced memory footprint. Our approach achieves significantly less memory—approximately 50% and 25% of the memory used by MVSplat and DepthSplat, respectively. Specifically, our model operates within a memory footprint of under 700,MB for 256 × 256 resolution inputs, compared to over 2600,MB required by DepthSplat, which offers only marginal improvements in rendering fidelity.
We highlight the convergence behavior of our model as the number of recurrent refinement iterations increases. The results validate our design objective: the update operator effectively learns to perform optimization-like refinement in a feed-forward manner, progressively improving scene geometry and appearance reconstruction. Furthermore, as shown in the figure below, both the predicted depth maps and the synthesized novel views exhibit consistent refinement across iterations, capturing increasingly fine-grained details and producing sharper reconstructions over time.
We conduct cross-dataset generalization studies, where models trained on RealEstate10K are evaluated on the ACID and DL3DV datasets. Despite relying on local cost volumes, our OptSplat inherently outperforms MVSplat in generalizing to out-of-distribution novel scenes, owing to the iterative optimization layers that refine 3D Gaussians during inference. On the DL3DV scenes, which feature more complex geometry and larger viewpoint variations, OptSplat demonstrates strong generalization with negligible drops in rendering quality—while using only a fraction of the memory required by DepthSplat. Notably, unlike DepthSplat, OptSplat does not rely on foundational monocular features, yet it generalizes robustly across datasets. This highlights the significance of our recurrent optimization module.
Method | GPU Memory (MB) |
DL3DV | ACID | ||||
---|---|---|---|---|---|---|---|
PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | ||
MVSplat | 1217 | 25.55 | 0.833 | 0.119 | 28.15 | 0.841 | 0.147 |
DepthSplat | 2638 | 27.99 | 0.897 | 0.084 | 28.37 | 0.847 | 0.141 |
OptSplat (Ours) | 658 | 26.69 | 0.875 | 0.093 | 27.39 | 0.836 | 0.144 |
Though trained on the RealEstate10K dataset, which primarily comprises indoor scenes, our model generalizes robustly to the ACID dataset, which features natural outdoor environments captured by aerial drones. The highly detailed depth maps and novel views demonstrate robustness across datasets, scales, lighting conditions, and viewpoint variations.