November 2025
This white paper introduces 3D Stable Diffusion over Gaussian Splatting (3D-SD-GS), a new generative modeling paradigm that extends diffusion models from 2D image space to true native 3D latent space. Unlike conventional diffusion pipelines that synthesize 2D imagery, 3D-SD-GS operates directly on Gaussian Splatting representations, enabling the model to generate, enhance, and manipulate 3D scenes with geometric and photometric consistency.
By training on datasets of multi-view images paired with optimized Gaussian Splatting reconstructions, the model learns a 3D distribution of shape, appearance, reflectance, and spatial structure. This allows it to synthesize novel 3D representations from text, image, or partial 3D inputs, and to enhance incomplete reconstructions with physically consistent detail.
The system addresses a major gap in generative AI: creating consistent, controllable, high-fidelity 3D assets without relying on implicit neural fields or low-resolution voxel grids. Gaussian Splatting provides an explicit, differentiable, scalable substrate, making it the ideal backbone for 3D diffusion.
Generative diffusion models—such as Stable Diffusion—have transformed 2D content creation by learning a powerful latent distribution of images. However, most real-world applications, including eCommerce, industrial design, robotics simulation, digital twins, and AR/VR content creation, require consistent and editable 3D assets, not just images.
Existing 3D generative approaches rely on one of the following:
●
Text-to-image diffusion + Score
Distillation Sampling into NeRF/3DGS
●
Low-resolution voxel or SDF
diffusion
●
2.5D approaches synthesizing
multi-view images then performing photogrammetry
●
Tri-plane diffusion with limited
geometric expressiveness
None of these operate natively in 3D.
With the emergence of 3D Gaussian Splatting (3DGS) as a fast, explicit, camera-agnostic representation, we can construct a diffusion model whose latent, noise process, and U-Net all operate directly over spatial Gaussian primitives instead of pixels.
This opens the door to:
●
3D-aware generative design
●
Rapid asset creation for games and
movies
●
Product visualization and
variation
●
Spatially consistent AI-driven 3D
editing
This paper introduces such a system.
3DGS represents a scene as a set of ellipsoidal Gaussian primitives with:
●
Center position
●
Covariance / scale
●
Color
●
Opacity
●
Optional learned SH lighting
Advantages for diffusion:
Each splat has an explicit 3D location — unlike NeRF’s implicit volumetric field.
Allows end-to-end training via view rendering.
Millions of splats render fast and scale better than voxel grids.
Diffusion adds/removes/reshapes particles; splats are particles.
Because of these traits, Gaussian Splatting is uniquely suited to serve as the latent representation of a generative diffusion process.
3D-SD-GS generalizes the architecture of 2D Stable Diffusion:
|
2D SD Component |
3D-SD-GS Equivalent |
|
Latent image grid (W×H×C) |
Latent field of Gaussian parameters (positions, scales, colors, SH) |
|
Noise added to pixel latents |
Noise added to splat properties or spatial latent grid |
|
2D U-Net |
3D U-Net operating on sparse 3D volumes or splat tensors |
|
VAE encoder |
3DGS encoder (splat → 3D latent voxel or tri-plane tensor) |
|
VAE decoder |
3DGS decoder (latent → optimized splat field) |
|
Rendering used only for output |
Rendering used for training supervision |
Two practical options:
○
Store each splat’s features in
nearest voxel
○
Apply sparse 3D convolutions
○
Decode voxels back to splats
○
Used in EG3D
○
More memory-efficient
○
Good for high resolution
Noise is applied to:
●
Splat positions
●
Splat scales
●
Splat colors
●
Feature vectors in latent grid
Noise schedule is identical to DDPM/DDIM.
Like SD, but with 3D convolutions + 3D cross-attention.
Supports:
●
Text
(via CLIP text encoder)
●
Reference images (via CLIP image encoder)
●
Partial 3DGS (inpainting in 3D)
●
Meshes
(encoded through geometry encoder)
3D-SD-GS requires paired (multi-view images → optimized 3DGS) datasets.
●
Industrial photogrammetry
pipelines
●
3D capture of products, shoes,
jewelry, apparel, furniture
●
Public datasets (Mip-NeRF 360,
Tanks and Temples, Objaverse)
●
Synthetic datasets generated from
meshes using UE5/Maya/Blender
For each object/scene:
The model is trained to:
Unlike 2D SD or multi-view diffusion, the model maintains one coherent 3D latent.
This prevents:
●
view misalignments
●
geometry hallucination
●
inconsistent textures
3DGS renders in real-time and compresses well.
Users can:
●
move splats
●
delete parts
●
adjust shape
●
refine materials
Given only:
●
1–3 photos
●
or partial 3D scans
●
or rough meshes
The model fills missing geometry and appearance.
Millions of products or scenes can be processed.
WebGL / WebGPU / Unity / Unreal can display 3DGS directly.
●
Create full 3D reconstruction
from a few important photos
●
Generate design variations
(materials, colors, geometry tweaks)
●
Create marketing content
with consistent multi-view imagery
●
Build interactive 3D viewers
from minimal input data
This transforms product photography from manual workflows into generative automation.
A designer can:
●
Roughly sketch an object
●
Provide 2–3 reference photos
●
Let 3D-SD-GS generate high-resolution
3D concepts consistent with physics and materials
This is useful for:
●
jewelry
●
apparel
●
shoes
●
furniture
●
consumer electronics
Robots need accurate 3D models to:
●
identify objects
●
plan grasps
●
simulate behavior
3D-SD-GS can reconstruct full objects from sparse or occluded cameras.
Because Gaussian splats are explicit geometry, the model can:
●
reconstruct garments
●
separate clothing from mannequins
●
project clothing onto bodies
●
maintain consistent lighting
●
Rapid asset prototyping
●
AI-driven scene dressing
●
Direct import into Unreal, Unity,
Maya, Blender
●
Medical scans (3D volumes)
●
Geospatial mapping
●
Architecture / BIM
●
Industrial inspection
Model can evolve scenes over time → useful for video, animation.
Extract camera trajectories and enforce 3D scene stabilization.
Combine explicit splats with implicit surfaces.
Billions of splats across product catalogs enable multi-category generalization.
3D Stable Diffusion over Gaussian Splatting represents the next step in generative foundation models: native 3D generation, not just multi-view 2D synthesis or implicit NeRF optimization. By diffusing directly in 3D latent space structured around Gaussian primitives, the model achieves coherence, controllability, and fidelity unattainable with traditional image-based approaches.
This technology enables new workflows in design, eCommerce, robotics, simulation, AR/VR, and high-end content creation — while dramatically lowering the cost and complexity of creating consistent, editable, high-quality 3D assets.
3D-SD-GS is positioned to become the core generative engine for the next decade of spatial computing.