High-Fidelity 3D Gaussian Splatting From Sparse but Important Images Through Multi-View Diffusion with Global Frame Attention

Author: Eric Rubin and Dmitriy Pinskiy

Date:  October 2025

Abstract

This paper presents a novel reconstruction workflow for generating complete, photorealistic 3D Gaussian Splatting (3DGS) models from only a few but highly important photographs. In scenarios such as e-commerce, jewelry visualization, and concept design, practitioners often capture only a limited set of high-quality “hero” views due to cost, time, or physical constraints. Traditional photogrammetry and 3DGS pipelines fail under such sparse input, producing incomplete or inconsistent geometry and textures.

We introduce a multi-view diffusion generation framework that treats the desired orbit of camera viewpoints as a structured video sequence. A modified video-like diffusion model, equipped with global frame attention and a shared object latent, jointly generates all views in a single denoising process. This ensures cross-view texture consistency, material stability, and identity preservation across the entire synthetic dataset.

Our approach prioritizes exact fidelity to the important real photos while enforcing learned global consistency on all synthesized views. This enables the production of a fully consistent 3DGS model from extremely sparse real data, without requiring large-scale retraining of diffusion models or reliance on post-hoc texture collapsing.


1. Introduction

Producing high-quality 3D assets typically requires dozens or hundreds of photographs taken from carefully calibrated angles. In practice, especially in product visualization, creators capture only a handful of critical photographs that showcase:

      Brand-defining visual features

      Material details (e.g., gold, silk, gemstones)

      Symmetry, silhouette, and scale

However, unseen or occluded regions of the product are left uncaptured.

Conventional 3D reconstruction pipelines struggle under these conditions, due to:

      Missing viewpoint diversity

      Incomplete texture information

      Inconsistent or inaccurate hallucination of unseen surfaces

      Flickering or non-stationary textures in rendered turntables

This paper introduces a joint multi-view diffusion approach that addresses these limitations directly.


2. Motivation and Requirements

We assume two key constraints:

2.1 Precision in Important Views

A small number of views contain important brand or material information. These hero views must be reconstructed with:

      Pixel-level accuracy

      Correct specular and shading characteristics

      Fidelity to distortions, reflections, and fine-grain texture

Any synthetic view must remain compatible with these ground-truth images.

2.2 Global Consistency Everywhere Else

For unseen surfaces:

      Absolute correctness is impossible

      Plausible consistency is essential

      Surfaces must not change appearance across view angles

      No pattern flickering, color drifting, or changing fine details

This requires a generative process capable of reasoning about all views simultaneously.


3. Overview of the Approach

Our method consists of:

  1. Initial Capture & Sparse 3DGS Construction

  2. Multi-View Diffusion Generation

      Video-like UNet

      Global frame attention

      Shared object latent

      Pose-aware conditioning

  1. Final 3DGS Training Using Synthetic + Real Views

Unlike the first approach (per-view diffusion then 3D collapse), here consistency is enforced inside the generative model itself. This allows sharper textures, less averaging, and more expressive creative control.


4. Stage 1 — Sparse Real Capture & Coarse 3DGS

A small set (e.g., 2–5) of important photographs is captured. These define:

      The object's true material

      The real observable details

      Accurate color and lighting behavior

A coarse 3DGS model is reconstructed using COLMAP poses and sparse optimization. Although incomplete, this provides:

      Approximated geometry

      Visibility information

      Condition inputs for diffusion models

This coarse model stabilizes the generative process and anchors synthetic views to reality.


5. Stage 2 — Multi-View Diffusion as a Joint Video Generation Problem

The key innovation of this approach is to treat all requested view angles as a coherent sequence, analogous to video frames. However, unlike true time-based video:

      Frames correspond to camera rotation, not temporal motion

      All frames share a fixed underlying object identity

      Frames must express the same materials and local texture behavior

Thus, a video-like diffusion architecture is ideal.


5.1 Network Architecture Overview

Global Frame Attention

Frames can attend to each other regardless of angular distance:

      Frame at 0° can directly communicate with 180°

      The model enforces global structural coherence

      Appearance features propagate across the entire view orbit

Shared Object Latent

A learned latent vector zobjz_{\text{obj}}zobj represents:

      The canonical texture space of the object

      Material identity

      High-level shape and reflectance behavior

This latent is incorporated into each frame branch of the UNet, providing cross-frame alignment.

Pose-Aware Encoding

Each frame is conditioned on its camera pose:

      Azimuth

      Elevation

      Distance

      Optional lighting metadata

Thus the model learns how appearance changes predictably with viewing angle.


5.2 How the Diffusion Model Enforces Consistency

The core idea is:

All frames are denoised jointly, so the model must produce a set of images that form a coherent representation of a single object.

During training or fine-tuning:

      Correspondence between frames is learned through attention

      Shared latent ensures identity stability

      Pose conditioning maps global appearance into angle-specific views

This prevents contradictory patterns across views.


5.3 Ensuring Fidelity to Important Real Photos

Critical real images are injected as hard anchors:

      During training, frames corresponding to real poses are replaced with real data

      Their embeddings are injected strongly into the shared latent

      Loss terms penalize deviations from these real views

This ensures the diffusion model does not “rewrite” what is known.

The model remains flexible only in unseen areas.


6. Stage 3 — Training 3DGS from Consistent Multi-View Output

Once diffusion produces a fully consistent multi-view dataset:

      Real photos provide high-precision constraints

      Synthetic views fill all missing angles

The resulting set is immediately suitable for 3DGS training.

This differs from Approach 1 because:

      No per-splat averaging is required

      No cross-view contradictions exist

      Textures remain sharper and more detailed

      Subtle reflectance and anisotropy can be expressed consistently

3DGS optimization then converges stably to a complete visual model.


7. Why This Works

7.1 Consistency Is Enforced During Generation

Unlike pipelines that rely on post-hoc consolidation:

      Here, the diffusion model generates all frames in one denoising pass

      Shared latent + global attention provides a global “object memory”

      The model produces a single, coherent identity across all views

7.2 Hero Views Remain Untouched

The model explicitly preserves:

      Fine-grain texture

      Distinctive optical features

      Material properties such as anisotropic reflection or gemstone sparkle

Synthetic views cannot override these.

7.3 Unseen Regions Become Plausible but Stable

Because generation is multi-view-aware:

      Backside textures are invented once, then applied consistently

      Symmetry is naturally preserved

      Sharpness and fine details are not averaged out

7.4 High Capacity for Creative Variation

Because hallucination occurs in latent space:

      Designers can experiment with material variants

      Diffusion priors allow stylistic or shape exploration

      The model respects real constraints but remains generative


8. Applications

E-commerce

      360° turntables from only a few studio shots

      Perfect fidelity in front-facing hero images

      Fully consistent back and bottom surfaces

Jewelry & Watches

      Specular materials require precise hero views

      Multi-view diffusion ensures stable reflections in synthesized angles

Fashion & Apparel

      Dresses on mannequins can be hallucinated from minimal photos

      Without texture tearing or inconsistent folds

Design Prototyping

      Create 3DGS models for concept designs without full photo shoots


9. Comparison to Traditional or Collapsing Approach

Method

Real-View Fidelity

Global Consistency

Texture Sharpness

Model Complexity

This approach (multi-view diffusion)

Excellent

Excellent

High

Medium/High

3D Texture Collapse (Approach 1)

Excellent

Guaranteed

Medium (averaging)

Very Low

Per-view SD + 3DGS

Medium

Poor

High

Low

Multi-view diffusion training (research)

Good

Good

Very High

High/Very High

This approach offers a unique blend of:

      High fidelity in important images

      High global consistency

      High texture sharpness

      Moderate implementation cost


10. Limitations

      Requires a multi-frame diffusion pipeline (heavier than Approach 1)

      Training or fine-tuning is necessary for best results

      Quality depends on the robustness of the initial 3DGS geometry

However, it produces the most coherent and detailed multi-view data before 3DGS training.


11. Conclusion

This paper introduces a powerful new pipeline for producing complete 3D Gaussian Splatting models from only a few but highly important photographs. By reconstructing all intermediate views using video-like multi-view diffusion, reinforced with global attention, shared object latent, and pose-dependent conditioning, we achieve:

      Pixel-level precision in important real photographs

      High-quality, globally consistent synthetic views

      Complete 3DGS reconstructions suitable for commercial-quality rendering

This approach stands as a major step toward democratizing 3D content creation from minimal photography.