3D Stable Diffusion Over Gaussian Splatting

Authors: Dmitriy Pinskiy and Eric Rubin

November 2025

A Generative Foundation Model Operating Directly in 3D Space

Abstract

This white paper introduces 3D Stable Diffusion over Gaussian Splatting (3D-SD-GS), a new generative modeling paradigm that extends diffusion models from 2D image space to true native 3D latent space. Unlike conventional diffusion pipelines that synthesize 2D imagery, 3D-SD-GS operates directly on Gaussian Splatting representations, enabling the model to generate, enhance, and manipulate 3D scenes with geometric and photometric consistency.

By training on datasets of multi-view images paired with optimized Gaussian Splatting reconstructions, the model learns a 3D distribution of shape, appearance, reflectance, and spatial structure. This allows it to synthesize novel 3D representations from text, image, or partial 3D inputs, and to enhance incomplete reconstructions with physically consistent detail.

The system addresses a major gap in generative AI: creating consistent, controllable, high-fidelity 3D assets without relying on implicit neural fields or low-resolution voxel grids. Gaussian Splatting provides an explicit, differentiable, scalable substrate, making it the ideal backbone for 3D diffusion.

1. Introduction

Generative diffusion models—such as Stable Diffusion—have transformed 2D content creation by learning a powerful latent distribution of images. However, most real-world applications, including eCommerce, industrial design, robotics simulation, digital twins, and AR/VR content creation, require consistent and editable 3D assets, not just images.

Existing 3D generative approaches rely on one of the following:

● Text-to-image diffusion + Score Distillation Sampling into NeRF/3DGS

● Low-resolution voxel or SDF diffusion

● 2.5D approaches synthesizing multi-view images then performing photogrammetry

● Tri-plane diffusion with limited geometric expressiveness

None of these operate natively in 3D.

With the emergence of 3D Gaussian Splatting (3DGS) as a fast, explicit, camera-agnostic representation, we can construct a diffusion model whose latent, noise process, and U-Net all operate directly over spatial Gaussian primitives instead of pixels.

This opens the door to:

● 3D-aware generative design

● Rapid asset creation for games and movies

● Product visualization and variation

● Spatially consistent AI-driven 3D editing

This paper introduces such a system.

2. Background: Why Gaussian Splatting Enables True 3D Diffusion

3DGS represents a scene as a set of ellipsoidal Gaussian primitives with:

● Center position

● Covariance / scale

● Color

● Opacity

● Optional learned SH lighting

Advantages for diffusion:

2.1 Explicit geometry

Each splat has an explicit 3D location — unlike NeRF’s implicit volumetric field.

2.2 Continuous differentiable rasterization

Allows end-to-end training via view rendering.

2.3 Sparse + scalable

Millions of splats render fast and scale better than voxel grids.

2.4 Naturally matches diffusion's “particle evolution” metaphor

Diffusion adds/removes/reshapes particles; splats are particles.

Because of these traits, Gaussian Splatting is uniquely suited to serve as the latent representation of a generative diffusion process.

3. Model Overview: Stable Diffusion in 3D Latent Space

3D-SD-GS generalizes the architecture of 2D Stable Diffusion:

2D SD Component	3D-SD-GS Equivalent
Latent image grid (W×H×C)	Latent field of Gaussian parameters (positions, scales, colors, SH)
Noise added to pixel latents	Noise added to splat properties or spatial latent grid
2D U-Net	3D U-Net operating on sparse 3D volumes or splat tensors
VAE encoder	3DGS encoder (splat → 3D latent voxel or tri-plane tensor)
VAE decoder	3DGS decoder (latent → optimized splat field)
Rendering used only for output	Rendering used for training supervision

3.1 Latent Representation

Two practical options:

Voxelized latent embedding of splat properties

○ Store each splat’s features in nearest voxel

○ Apply sparse 3D convolutions

○ Decode voxels back to splats

Tri-plane latent (3 planes × feature channels)

○ Used in EG3D

○ More memory-efficient

○ Good for high resolution

3.2 Noise Model

Noise is applied to:

● Splat positions

● Splat scales

● Splat colors

● Feature vectors in latent grid

Noise schedule is identical to DDPM/DDIM.

3.3 3D U-Net

Like SD, but with 3D convolutions + 3D cross-attention.

3.4 Conditioning

Supports:

● Text (via CLIP text encoder)

● Reference images (via CLIP image encoder)

● Partial 3DGS (inpainting in 3D)

● Meshes (encoded through geometry encoder)

4. Training Data

3D-SD-GS requires paired (multi-view images → optimized 3DGS) datasets.

4.1 Sources of Training Data

● Industrial photogrammetry pipelines

● 3D capture of products, shoes, jewelry, apparel, furniture

● Public datasets (Mip-NeRF 360, Tanks and Temples, Objaverse)

● Synthetic datasets generated from meshes using UE5/Maya/Blender

4.2 Required Components per Sample

For each object/scene:

Multi-view images
Camera intrinsics & extrinsics
Optimized 3DGS model
Optional material maps (normal, roughness)
Text captions (automatically generated using BLIP or GPT-4o)

4.3 Training Objectives

The model is trained to:

Predict denoised 3D latent from noisy 3D latent
Render splats into 2D camera viewpoints
Match reconstructed images with ground-truth images (photometric loss)
Learn semantic alignment via CLIP loss
Learn geometric priors via regularization
Maintain splat compactness via sparsification losses

5. Advantages of 3D Diffusion Over 3DGS

5.1 True 3D consistency

Unlike 2D SD or multi-view diffusion, the model maintains one coherent 3D latent.

This prevents:

● view misalignments

● geometry hallucination

● inconsistent textures

5.2 Faster and smaller than NeRF-based models

3DGS renders in real-time and compresses well.

5.3 Editable explicit geometry

Users can:

● move splats

● delete parts

● adjust shape

● refine materials

5.4 Works with extremely sparse input

Given only:

● 1–3 photos

● or partial 3D scans

● or rough meshes

The model fills missing geometry and appearance.

5.5 Ideal for large-scale datasets (eCommerce, robotics)

Millions of products or scenes can be processed.

5.6 Compatibility with existing rendering engines

WebGL / WebGPU / Unity / Unreal can display 3DGS directly.

6. Use Cases

6.1 eCommerce and Product Visualization

● Create full 3D reconstruction from a few important photos

● Generate design variations (materials, colors, geometry tweaks)

● Create marketing content with consistent multi-view imagery

● Build interactive 3D viewers from minimal input data

This transforms product photography from manual workflows into generative automation.

6.2 Generative Product Design

A designer can:

● Roughly sketch an object

● Provide 2–3 reference photos

● Let 3D-SD-GS generate high-resolution 3D concepts consistent with physics and materials

This is useful for:

● jewelry

● apparel

● shoes

● furniture

● consumer electronics

6.3 Robotics, Simulation, and Digital Twins

Robots need accurate 3D models to:

● identify objects

● plan grasps

● simulate behavior

3D-SD-GS can reconstruct full objects from sparse or occluded cameras.

6.4 Virtual Try-On and AR

Because Gaussian splats are explicit geometry, the model can:

● reconstruct garments

● separate clothing from mannequins

● project clothing onto bodies

● maintain consistent lighting

6.5 Film, Animation, and Gaming

● Rapid asset prototyping

● AI-driven scene dressing

● Direct import into Unreal, Unity, Maya, Blender

6.6 Scientific and Industrial Applications

● Medical scans (3D volumes)

● Geospatial mapping

● Architecture / BIM

● Industrial inspection

7. Future Directions

7.1 Temporal diffusion for 4D Gaussian Splatting

Model can evolve scenes over time → useful for video, animation.

7.2 Integration with Sora-like generative video models

Extract camera trajectories and enforce 3D scene stabilization.

7.3 Mesh/SDF hybrid diffusion

Combine explicit splats with implicit surfaces.

7.4 Industrial-scale training datasets

Billions of splats across product catalogs enable multi-category generalization.

8. Conclusion

3D Stable Diffusion over Gaussian Splatting represents the next step in generative foundation models: native 3D generation, not just multi-view 2D synthesis or implicit NeRF optimization. By diffusing directly in 3D latent space structured around Gaussian primitives, the model achieves coherence, controllability, and fidelity unattainable with traditional image-based approaches.

This technology enables new workflows in design, eCommerce, robotics, simulation, AR/VR, and high-end content creation — while dramatically lowering the cost and complexity of creating consistent, editable, high-quality 3D assets.

3D-SD-GS is positioned to become the core generative engine for the next decade of spatial computing.