Author: Eric
Rubin and Dmitriy Pinskiy
Date: October 2025
This paper presents a novel reconstruction workflow for generating complete, photorealistic 3D Gaussian Splatting (3DGS) models from only a few but highly important photographs. In scenarios such as e-commerce, jewelry visualization, and concept design, practitioners often capture only a limited set of high-quality “hero” views due to cost, time, or physical constraints. Traditional photogrammetry and 3DGS pipelines fail under such sparse input, producing incomplete or inconsistent geometry and textures.
We introduce a multi-view diffusion generation framework that treats the desired orbit of camera viewpoints as a structured video sequence. A modified video-like diffusion model, equipped with global frame attention and a shared object latent, jointly generates all views in a single denoising process. This ensures cross-view texture consistency, material stability, and identity preservation across the entire synthetic dataset.
Our approach prioritizes exact fidelity to the important real photos while enforcing learned global consistency on all synthesized views. This enables the production of a fully consistent 3DGS model from extremely sparse real data, without requiring large-scale retraining of diffusion models or reliance on post-hoc texture collapsing.
Producing high-quality 3D assets typically requires dozens or hundreds of photographs taken from carefully calibrated angles. In practice, especially in product visualization, creators capture only a handful of critical photographs that showcase:
●
Brand-defining visual features
●
Material details (e.g., gold,
silk, gemstones)
●
Symmetry, silhouette, and scale
However, unseen or occluded regions of the product are left uncaptured.
Conventional 3D reconstruction pipelines struggle under these conditions, due to:
●
Missing viewpoint diversity
●
Incomplete texture information
●
Inconsistent or inaccurate
hallucination of unseen surfaces
●
Flickering or non-stationary
textures in rendered turntables
This paper introduces a joint multi-view diffusion approach that addresses these limitations directly.
We assume two key constraints:
A small number of views contain important brand or material information. These hero views must be reconstructed with:
●
Pixel-level accuracy
●
Correct specular and shading
characteristics
●
Fidelity to distortions,
reflections, and fine-grain texture
Any synthetic view must remain compatible with these ground-truth images.
For unseen surfaces:
●
Absolute correctness is impossible
●
Plausible consistency is essential
●
Surfaces must not change
appearance across view angles
●
No pattern flickering, color
drifting, or changing fine details
This requires a generative process capable of reasoning about all views simultaneously.
Our method consists of:
○
Video-like UNet
○
Global frame attention
○
Shared object latent
○
Pose-aware conditioning
Unlike the first approach (per-view diffusion then 3D collapse), here consistency is enforced inside the generative model itself. This allows sharper textures, less averaging, and more expressive creative control.
A small set (e.g., 2–5) of important photographs is captured. These define:
●
The object's true material
●
The real observable details
●
Accurate color and lighting
behavior
A coarse 3DGS model is reconstructed using COLMAP poses and sparse optimization. Although incomplete, this provides:
●
Approximated geometry
●
Visibility information
●
Condition inputs for diffusion
models
This coarse model stabilizes the generative process and anchors synthetic views to reality.
The key innovation of this approach is to treat all requested view angles as a coherent sequence, analogous to video frames. However, unlike true time-based video:
●
Frames correspond to camera
rotation, not temporal motion
●
All frames share a fixed
underlying object identity
●
Frames must express the same
materials and local texture behavior
Thus, a video-like diffusion architecture is ideal.
Frames can attend to each other regardless of angular distance:
●
Frame at 0° can directly
communicate with 180°
●
The model enforces global
structural coherence
●
Appearance features propagate
across the entire view orbit
A learned latent vector zobjz_{\text{obj}}zobj represents:
●
The canonical texture space of the
object
●
Material identity
●
High-level shape and reflectance
behavior
This latent is incorporated into each frame branch of the UNet, providing cross-frame alignment.
Each frame is conditioned on its camera pose:
●
Azimuth
●
Elevation
●
Distance
●
Optional lighting metadata
Thus the model learns how appearance changes predictably with viewing angle.
The core idea is:
All frames are denoised jointly, so the model must produce a set of images that form a coherent representation of a single object.
During training or fine-tuning:
●
Correspondence between frames is
learned through attention
●
Shared latent ensures identity
stability
●
Pose conditioning maps global
appearance into angle-specific views
This prevents contradictory patterns across views.
Critical real images are injected as hard anchors:
●
During training, frames
corresponding to real poses are replaced with real data
●
Their embeddings are injected
strongly into the shared latent
●
Loss terms penalize deviations
from these real views
This ensures the diffusion model does not “rewrite” what is known.
The model remains flexible only in unseen areas.
Once diffusion produces a fully consistent multi-view dataset:
●
Real photos provide high-precision
constraints
●
Synthetic views fill all missing
angles
The resulting set is immediately suitable for 3DGS training.
This differs from Approach 1 because:
●
No per-splat averaging is required
●
No cross-view contradictions exist
●
Textures remain sharper and more
detailed
●
Subtle reflectance and anisotropy
can be expressed consistently
3DGS optimization then converges stably to a complete visual model.
Unlike pipelines that rely on post-hoc consolidation:
●
Here, the diffusion model
generates all frames in one denoising pass
●
Shared latent + global attention
provides a global “object memory”
●
The model produces a single,
coherent identity across all views
The model explicitly preserves:
●
Fine-grain texture
●
Distinctive optical features
●
Material properties such as
anisotropic reflection or gemstone sparkle
Synthetic views cannot override these.
Because generation is multi-view-aware:
●
Backside textures are invented once,
then applied consistently
●
Symmetry is naturally preserved
●
Sharpness and fine details are not
averaged out
Because hallucination occurs in latent space:
●
Designers can experiment with
material variants
●
Diffusion priors allow stylistic
or shape exploration
●
The model respects real
constraints but remains generative
●
360° turntables from only a few
studio shots
●
Perfect fidelity in front-facing
hero images
●
Fully consistent back and bottom
surfaces
●
Specular materials require precise
hero views
●
Multi-view diffusion ensures
stable reflections in synthesized angles
●
Dresses on mannequins can be
hallucinated from minimal photos
●
Without texture tearing or
inconsistent folds
●
Create 3DGS models for concept
designs without full photo shoots
|
Method |
Real-View Fidelity |
Global Consistency |
Texture Sharpness |
Model Complexity |
|
This approach (multi-view diffusion) |
Excellent |
Excellent |
High |
Medium/High |
|
3D Texture Collapse (Approach 1) |
Excellent |
Guaranteed |
Medium (averaging) |
Very Low |
|
Per-view SD + 3DGS |
Medium |
Poor |
High |
Low |
|
Multi-view diffusion training (research) |
Good |
Good |
Very High |
High/Very High |
This approach offers a unique blend of:
●
High fidelity in important
images
●
High global consistency
●
High texture sharpness
●
Moderate implementation cost
●
Requires a multi-frame diffusion
pipeline (heavier than Approach 1)
●
Training or fine-tuning is
necessary for best results
●
Quality depends on the robustness
of the initial 3DGS geometry
However, it produces the most coherent and detailed multi-view data before 3DGS training.
This paper introduces a powerful new pipeline for producing complete 3D Gaussian Splatting models from only a few but highly important photographs. By reconstructing all intermediate views using video-like multi-view diffusion, reinforced with global attention, shared object latent, and pose-dependent conditioning, we achieve:
●
Pixel-level precision in
important real photographs
●
High-quality, globally
consistent synthetic views
●
Complete 3DGS reconstructions
suitable for commercial-quality rendering
This approach stands as a major step toward democratizing 3D content creation from minimal photography.