3D Stable Diffusion Over Gaussian Splatting

 

Authors: Dmitriy Pinskiy and Eric Rubin

November 2025

 

A Generative Foundation Model Operating Directly in 3D Space


Abstract

This white paper introduces 3D Stable Diffusion over Gaussian Splatting (3D-SD-GS), a new generative modeling paradigm that extends diffusion models from 2D image space to true native 3D latent space. Unlike conventional diffusion pipelines that synthesize 2D imagery, 3D-SD-GS operates directly on Gaussian Splatting representations, enabling the model to generate, enhance, and manipulate 3D scenes with geometric and photometric consistency.

By training on datasets of multi-view images paired with optimized Gaussian Splatting reconstructions, the model learns a 3D distribution of shape, appearance, reflectance, and spatial structure. This allows it to synthesize novel 3D representations from text, image, or partial 3D inputs, and to enhance incomplete reconstructions with physically consistent detail.

The system addresses a major gap in generative AI: creating consistent, controllable, high-fidelity 3D assets without relying on implicit neural fields or low-resolution voxel grids. Gaussian Splatting provides an explicit, differentiable, scalable substrate, making it the ideal backbone for 3D diffusion.


1. Introduction

Generative diffusion models—such as Stable Diffusion—have transformed 2D content creation by learning a powerful latent distribution of images. However, most real-world applications, including eCommerce, industrial design, robotics simulation, digital twins, and AR/VR content creation, require consistent and editable 3D assets, not just images.

Existing 3D generative approaches rely on one of the following:

      Text-to-image diffusion + Score Distillation Sampling into NeRF/3DGS

      Low-resolution voxel or SDF diffusion

      2.5D approaches synthesizing multi-view images then performing photogrammetry

      Tri-plane diffusion with limited geometric expressiveness

None of these operate natively in 3D.

With the emergence of 3D Gaussian Splatting (3DGS) as a fast, explicit, camera-agnostic representation, we can construct a diffusion model whose latent, noise process, and U-Net all operate directly over spatial Gaussian primitives instead of pixels.

This opens the door to:

      3D-aware generative design

      Rapid asset creation for games and movies

      Product visualization and variation

      Spatially consistent AI-driven 3D editing

This paper introduces such a system.


2. Background: Why Gaussian Splatting Enables True 3D Diffusion

3DGS represents a scene as a set of ellipsoidal Gaussian primitives with:

      Center position

      Covariance / scale

      Color

      Opacity

      Optional learned SH lighting

Advantages for diffusion:

2.1 Explicit geometry

Each splat has an explicit 3D location — unlike NeRF’s implicit volumetric field.

2.2 Continuous differentiable rasterization

Allows end-to-end training via view rendering.

2.3 Sparse + scalable

Millions of splats render fast and scale better than voxel grids.

2.4 Naturally matches diffusion's “particle evolution” metaphor

Diffusion adds/removes/reshapes particles; splats are particles.

Because of these traits, Gaussian Splatting is uniquely suited to serve as the latent representation of a generative diffusion process.


3. Model Overview: Stable Diffusion in 3D Latent Space

3D-SD-GS generalizes the architecture of 2D Stable Diffusion:

2D SD Component

3D-SD-GS Equivalent

Latent image grid (W×H×C)

Latent field of Gaussian parameters (positions, scales, colors, SH)

Noise added to pixel latents

Noise added to splat properties or spatial latent grid

2D U-Net

3D U-Net operating on sparse 3D volumes or splat tensors

VAE encoder

3DGS encoder (splat → 3D latent voxel or tri-plane tensor)

VAE decoder

3DGS decoder (latent → optimized splat field)

Rendering used only for output

Rendering used for training supervision

3.1 Latent Representation

Two practical options:

  1. Voxelized latent embedding of splat properties

      Store each splat’s features in nearest voxel

      Apply sparse 3D convolutions

      Decode voxels back to splats

  1. Tri-plane latent (3 planes × feature channels)

      Used in EG3D

      More memory-efficient

      Good for high resolution

3.2 Noise Model

Noise is applied to:

      Splat positions

      Splat scales

      Splat colors

      Feature vectors in latent grid

Noise schedule is identical to DDPM/DDIM.

3.3 3D U-Net

Like SD, but with 3D convolutions + 3D cross-attention.

3.4 Conditioning

Supports:

      Text (via CLIP text encoder)

      Reference images (via CLIP image encoder)

      Partial 3DGS (inpainting in 3D)

      Meshes (encoded through geometry encoder)


4. Training Data

3D-SD-GS requires paired (multi-view images → optimized 3DGS) datasets.

4.1 Sources of Training Data

      Industrial photogrammetry pipelines

      3D capture of products, shoes, jewelry, apparel, furniture

      Public datasets (Mip-NeRF 360, Tanks and Temples, Objaverse)

      Synthetic datasets generated from meshes using UE5/Maya/Blender

4.2 Required Components per Sample

For each object/scene:

  1. Multi-view images

  2. Camera intrinsics & extrinsics

  3. Optimized 3DGS model

  4. Optional material maps (normal, roughness)

  5. Text captions (automatically generated using BLIP or GPT-4o)

4.3 Training Objectives

The model is trained to:

  1. Predict denoised 3D latent from noisy 3D latent

  2. Render splats into 2D camera viewpoints

  3. Match reconstructed images with ground-truth images (photometric loss)

  4. Learn semantic alignment via CLIP loss

  5. Learn geometric priors via regularization

  6. Maintain splat compactness via sparsification losses


5. Advantages of 3D Diffusion Over 3DGS

5.1 True 3D consistency

Unlike 2D SD or multi-view diffusion, the model maintains one coherent 3D latent.

This prevents:

      view misalignments

      geometry hallucination

      inconsistent textures

5.2 Faster and smaller than NeRF-based models

3DGS renders in real-time and compresses well.

5.3 Editable explicit geometry

Users can:

      move splats

      delete parts

      adjust shape

      refine materials

5.4 Works with extremely sparse input

Given only:

      1–3 photos

      or partial 3D scans

      or rough meshes

The model fills missing geometry and appearance.

5.5 Ideal for large-scale datasets (eCommerce, robotics)

Millions of products or scenes can be processed.

5.6 Compatibility with existing rendering engines

WebGL / WebGPU / Unity / Unreal can display 3DGS directly.


6. Use Cases


6.1 eCommerce and Product Visualization

      Create full 3D reconstruction from a few important photos

      Generate design variations (materials, colors, geometry tweaks)

      Create marketing content with consistent multi-view imagery

      Build interactive 3D viewers from minimal input data

This transforms product photography from manual workflows into generative automation.


6.2 Generative Product Design

A designer can:

      Roughly sketch an object

      Provide 2–3 reference photos

      Let 3D-SD-GS generate high-resolution 3D concepts consistent with physics and materials

This is useful for:

      jewelry

      apparel

      shoes

      furniture

      consumer electronics


6.3 Robotics, Simulation, and Digital Twins

Robots need accurate 3D models to:

      identify objects

      plan grasps

      simulate behavior

3D-SD-GS can reconstruct full objects from sparse or occluded cameras.


6.4 Virtual Try-On and AR

Because Gaussian splats are explicit geometry, the model can:

      reconstruct garments

      separate clothing from mannequins

      project clothing onto bodies

      maintain consistent lighting


6.5 Film, Animation, and Gaming

      Rapid asset prototyping

      AI-driven scene dressing

      Direct import into Unreal, Unity, Maya, Blender


6.6 Scientific and Industrial Applications

      Medical scans (3D volumes)

      Geospatial mapping

      Architecture / BIM

      Industrial inspection


7. Future Directions

7.1 Temporal diffusion for 4D Gaussian Splatting

Model can evolve scenes over time → useful for video, animation.

7.2 Integration with Sora-like generative video models

Extract camera trajectories and enforce 3D scene stabilization.

7.3 Mesh/SDF hybrid diffusion

Combine explicit splats with implicit surfaces.

7.4 Industrial-scale training datasets

Billions of splats across product catalogs enable multi-category generalization.


8. Conclusion

3D Stable Diffusion over Gaussian Splatting represents the next step in generative foundation models: native 3D generation, not just multi-view 2D synthesis or implicit NeRF optimization. By diffusing directly in 3D latent space structured around Gaussian primitives, the model achieves coherence, controllability, and fidelity unattainable with traditional image-based approaches.

This technology enables new workflows in design, eCommerce, robotics, simulation, AR/VR, and high-end content creation — while dramatically lowering the cost and complexity of creating consistent, editable, high-quality 3D assets.

3D-SD-GS is positioned to become the core generative engine for the next decade of spatial computing.