fVDB Version History
Version 0.5.0 - In Development
Version 0.4.2 - March 25, 2026
Bug Fixes:
Build warning fix for GCC 14.
Version 0.4.1 - March 25, 2026
Bug Fixes:
Updated
nanovdb-editordependency to use version 0.0.23.
Version 0.4.0 - March 12, 2026
140 commits, 300+ files changed, 10 contributors.
This release delivers major advances across the Gaussian splatting pipeline, sparse convolution, multi-GPU performance, and build/release infrastructure. fVDB now supports PyTorch 2.10 and CUDA 12.8/13.0, and ships its first formal release process with automated nightly builds.
Highlights:
Gaussian splatting gains a new rasterize-from-world path that renders directly from 3D Gaussians, Unscented Transform projection for non-pinhole camera models, full MCMC splatting support, sparse rendering, per-pixel/per-tile masking, and a composable camera model that decouples kernels from camera internals. Numerous gradient correctness fixes harden the backward pass.
Sparse convolution has been consolidated into a single GatherScatterDefault backend with full feature support including transposed convolution and arbitrary strides. A new PredGatherIGemm backend using CUTLASS/CuTe implicit-GEMM with TF32 tensor cores delivers significantly faster convolution on dense grids.
A new multi-axis dispatch framework provides flexible kernel execution across multiple dimensions with typed views and for_each iteration.
SampleGridTrilinear is roughly 2x faster via vectorized float4 loads and a fused stencil-plus-sample optimization. Morton and Hilbert space-filling curve ordering is now available for grid coordinates.
Multi-GPU scaling is significantly improved through batched prefetching, device-centric synchronization, and radix sort optimizations. All tensor index accessors are now 64-bit, enabling larger datasets.
A fully automated nightly wheel build and publish pipeline, a formal OneFlow release process with automation scripts, and GPU-validated publish workflows are all new in this release.
Contributors: @blackencino, @fwilliams, @harrism, @iYuqinL, @kmuseth, @matthewdcong, @phapalova, @areidmeyer, @swahtz, @zlalena
Gaussian Splatting & Rendering
New Features:
Added a new rasterization pathway that operates directly on 3D Gaussians (#444 - @fwilliams).
Added Gaussian projection via the Unscented Transform, providing an alternative to the EWA splatting approximation (#420 - @fwilliams).
Added full MCMC Gaussian Splatting support, including relocation (#374) and add-noise (#377) kernels, Python bindings (#394), and tunable
min_opacity(#396) andk/t(#402) parameters (@harrism, @fwilliams).Added end-to-end sparse Gaussian rendering with sparse rendering functions (#348) and sparse tile intersection (#401) (@fwilliams, @swahtz).
Rasterization can now render all contributing Gaussian IDs and weights per pixel (#340 - @swahtz).
Gaussian rasterization now supports background colors (#343 - @harrism).
All Gaussian render methods now accept
masksandbackgrounds(#480 - @swahtz).The
evaluate_spherical_harmonicsfunction is now exposed in the Python API (#431 - @swahtz).Refactored the rendering pipeline around composable camera operation classes that encapsulate camera-space transform and projection, decoupling kernels from camera internals (#485 - @fwilliams), with a CameraIntrinsics constructor fix for host/device compatibility (#489 - @blackencino).
Optimizations:
Switched the GaussianTileIntersection cumulative sum to use CUB for better performance (#427 - @swahtz).
Optimized the
computeSparseInfopath to reduce overhead in sparse rendering (#428 - @swahtz).Improved the contributing Gaussian ID kernels with shared-memory and loop optimizations (#429 - @swahtz).
Optimized tile intersection for multi-GPU execution with better prefetching (#446 - @matthewdcong).
Removed an unnecessary stream synchronization in GaussianTileIntersection (#370 - @harrism).
ProjectedGaussianSplatsopacities now use an efficient expand/view instead of per-element copy (#457 - @swahtz).
Bug Fixes:
Fixed a shared memory alignment issue in the Gaussian rasterization kernel (#342 - @swahtz).
Fixed inverted abs(gradient) logic in the backward rasterization pass that produced incorrect gradients (@harrism).
Fixed NaN outputs in the top-contributing Gaussian IDs weights computation (#400 - @swahtz).
Fixed camera data loading that could exceed blockDim when using many cameras (#345 - @swahtz).
Fixed incorrect derivation of the number of cameras in packed rasterization mode (#414 - @swahtz).
Fixed the chain rule for the log_scale gradient in the projection backward pass (#433 - @harrism).
Fixed a race condition in the spherical harmonics backward pass when using multiple cameras or large batch sizes (#437 - @swahtz).
Fixed the
dLossDQuatquaternion gradient missing a warp-level reduction in the projection backward pass (#435, #533 - @swahtz, @matthewdcong).Fixed a multi-GPU race condition in the multibatch spherical harmonics backward pass (#484 - @matthewdcong).
Fixed the ProjectionForward kernel double-initializing accessors, which caused correctness issues (#453 - @swahtz).
Fixed a crash when loading GaussianPly files to a CPU device (#417 - @swahtz).
Fixed handling of duplicate pixels in sparse pixel Gaussian rendering (#488 - @harrism).
Fixed an incorrect datatype in the backward projection test (#486 - @matthewdcong).
Sparse Convolution (Major)
Consolidated all legacy sparse convolution backends into a single GatherScatterDefault backend with full feature support, including transposed convolution, arbitrary strides, and all float types (#473 - @blackencino).
Added a new PredGatherIGemm sparse convolution backend using CUTLASS/CuTe implicit-GEMM with TF32 tensor cores, significantly faster than GatherScatterDefault for dense or near-dense grids (#508 - @blackencino).
Fixed the default convolution behavior and added extensive correctness tests (#321 - @blackencino).
Added gradient and backward pass tests to the convolution unit test suite (#358, #361 - @blackencino).
Removed unused legacy sparse convolution backends (ImplicitGEMM, CUTLASS, LGGS, ME), deleting approximately 22,500 lines of code (#454 - @blackencino).
Moved all op dispatch and precondition code into each op’s C++ implementation files, making ops self-contained and reducing compile-time interconnectivity (#492 - @blackencino).
Multi-Axis Dispatch Framework (New)
Introduced a new multi-axis dispatch framework for flexible kernel execution across multiple dimensions (#418 - @blackencino).
Extended the dispatch framework with
for_eachiteration, typed views, and tag canonicalization (#452 - @blackencino).The framework ships as a full C++ library under
src/dispatch/with comprehensive tests and benchmarks.
Grid Operations & Spatial Indexing
Added Morton and Hilbert space-filling curve ordering for Grid and GridBatch ijk coordinates, with module-level standalone functions (#311, #316, #323 - @blackencino).
SampleGridTrilinearnow uses vectorized float4 loads, yielding roughly a 2x throughput improvement (#430 - @swahtz).SampleGridTrilinearreceived a second optimization pass using a fused stencil-plus-sample approach (#474 - @swahtz).Cleaned up the active grid coordinate generation code for clarity and consistency (#318 - @blackencino).
JaggedTensor
JaggedTensor reduce operators now support bfloat16 (#501 - @swahtz).
Fixed a binary search edge case in
JIdxForJOffsetsthat returned incorrect indices when joffsets contained duplicate values (#325 - @iYuqinL).Fixed
from_*_and_list_idsproducing incorrect results with ldim=2 (#357 - @swahtz).Fixed concatenation errors in
JaggedTensor.jcat(#352 - @blackencino).Reduced the number of blocking GPU-to-CPU copies in the
unbind*methods, improving throughput (#363 - @swahtz).Fixed the single-element JaggedTensor constructor unconditionally initializing CUDA even for CPU tensors (#469 - @swahtz).
Performance & Multi-GPU
Optimized joffsets construction by using pinned memory to overlap CPU/GPU transfers (#403 - @matthewdcong).
Significantly improved multi-GPU scaling through batched prefetching and sorting changes (#499 - @matthewdcong).
Switched to device-centric synchronization for the forEach multi-GPU codepath (#440 - @matthewdcong).
Fixed and improved radix sort synchronization across multiple rounds of improvements (#315, #409, #415 - @matthewdcong).
Fused SSIM outputs now prefetch to avoid write page faults that degraded performance (#407 - @matthewdcong).
MCMC kernels now support PrivateUse1 for multi-GPU execution (#421 - @harrism).
Switched from
torch.inversetotorch.linalg.inv_exto avoid an unnecessary device synchronization (#487 - @matthewdcong).All 32-bit tensor index accessors have been upgraded to 64-bit across every op, enabling support for larger datasets (#505 - @harrism).
PyTorch & CUDA Compatibility
fVDB now builds and runs with PyTorch 2.10 (#423, #521 - @matthewdcong, @swahtz).
Added support for CUDA 12.8 and 13.0 toolkits (#521 - @swahtz).
Replaced the custom scaled dot-product attention implementation with PyTorch’s native
torch.scaled_dot_product_attention(#364 - @swahtz).Fixed the CCCL version check macro that could cause build failures with newer CUDA toolkits (#509 - @matthewdcong).
Improved PyTorch build configuration time by streamlining CMake detection (#441 - @matthewdcong).
NanoVDB
Updated the bundled NanoVDB dependency to version 32.9.1 (#475, #483, #493 - @swahtz).
Fixed voxel size and origin metadata not being preserved when serializing index grids (#490 - @swahtz).
Neural Network Modules
Fixed several bugs in SimpleUnet: NaN propagation from -inf values entering BatchNorm after max-pooling, incorrect ConvolutionPlan source/target grid assignments, and a crash on non-contiguous grad_output in the convolution backward pass (#496 - @swahtz).
Added dedicated unit tests for all
fvdb.nnmodules to improve coverage and prevent regressions (#497 - @swahtz).
Visualization / Viewer
The viewer now supports displaying multiple scenes simultaneously with a scene-switching UI (#308 - @phapalova).
Added viz bindings for
waitandadd_imageto enable blocking display and image overlays (#332 - @phapalova).Fixed the viewer so it works correctly inside Jupyter notebooks (#350 - @zlalena).
Build & Packaging
Renamed the Python extension binary from
_Cppto_fvdb_cppfor clarity and to avoid naming conflicts (#317, #322 - @harrism, @blackencino).Improved build times with compilation speedups and added build tracing support (#443 - @blackencino).
Fixed potential oversubscription when nvcc and cmake parallelism combined to exceed available cores (#351 - @swahtz).
Added a
lineinfobuild option to include source-line debug info for GPU profiling (#367 - @harrism).Added a
getMaxSharedMemoryutility to centralize shared memory limit queries across kernels (#368 - @harrism).Added a
Versionclass that provides structured version information at runtime (#507 - @swahtz).
Nightly Builds & Release Infrastructure (New)
Added a fully automated nightly wheel build and publish pipeline that builds across a matrix of Python, PyTorch, and CUDA versions and publishes to an S3 simple index (#477, #478 - @swahtz).
Established a formal release process based on the OneFlow branching model, with
start-release.shandfinish-release.shautomation scripts (#512, #525 - @harrism, @swahtz).The publish workflow now includes GPU validation with smoke tests and full unit tests on built wheels, an S3 staging index with automatic 30-day pruning, and support for release branch pushes triggering builds automatically.
CI / DevOps
Documentation-only PRs now auto-pass CI instead of showing a perpetual “waiting for status” indicator (#462 - @swahtz).
Draft PRs now skip test runs entirely, saving compute resources (#339 - @swahtz).
CI checkout references are pinned to immutable commit SHAs to prevent build/test skew between checkout and merge steps (#503 - @swahtz).
Nightly workflows are now restricted to the upstream
openvdb/fvdb-corerepository and no longer run on forks (#319 - @harrism).Runner stop jobs are now skipped when the corresponding start job was skipped, avoiding spurious failures (#471, #472 - @harrism).
Unit tests now only run for the matrix entry matching the
test_environment.ymlconfiguration (#531 - @harrism).
Developer Tooling (New)
Added git worktree tools (
fvdb-open,fvdb-close,fvdb-issue) that make it easy to work on multiple branches simultaneously (#445 - @harrism).Added an unanswered external issues reporter with Slack output and a daily CI workflow to help the team stay on top of community questions (#510, #513 - @harrism).
Added an
AGENTS.mdfile providing persistent coding guidelines for AI agents working on the codebase (#455 - @harrism).
Documentation
Added and updated introductory, neural network, and convolution notebooks (#504 - @swahtz).
Applied NVIDIA branding to the documentation site (#405 - @fwilliams).
Added documentation for installing nightly builds from the S3 package index (#481 - @swahtz).
Integrated Google Analytics into the documentation site for usage tracking (#312 - @fwilliams).