With all the current clamor around Stable Diffusion, Neural Radiance Fields (NeRF) is not getting a lot of love lately. Though hardly a ‘legacy’ technology (it only came on the scene in 2020), it’s arguably seen as a hamstrung approach to neural rendering, compared to the rich and explorable latent space of a latent diffusion network, or even a Generative Adversarial Network (GAN).
Resistant to editing, hard to include in a deepfakes pipeline, and more suited to urban scenes and static representations than human synthesis, NeRF is nonetheless perhaps the most accurate neural representation technology currently available for the human form – but it’s not the most imaginative.
Unlike Stable Diffusion, you can’t ‘search’ the latent space of a NeRF for hidden treasures, since a NeRF representation is pretty much limited to whatever material was present in the photos from which it derived its network. If that material is of a woman in a studio, you won’t be able to intervene in the trained model produced from those source images (at least not in any easy or meaningful way). To paraphrase John Hammond in Jurassic Park (1993), NeRF is ‘kind of a ride‘.
However, a new collaboration between the University of Washington and Google Research offers hope that NeRF could be developed into a richer and more disentangled space. PersonNeRF offers a neural radiance field representation that’s trained not on several viewpoints of the same thing (usually taken at the same moment), as is the traditional pipeline with NeRF; but rather on multiple abstract photos of the same person, from varying views and wearing various different types of clothing.
In the test case for PersonNeRF, the researchers gathered together a small multi-year dataset of Swiss former professional tennis player Roger Federer, and trained it into a network capable of generalizing Federer’s image, based on the input images.
Unlike a system such as Stable Diffusion, which has millions of abstract and similar images from which to concoct new poses and configurations on a theme, PersonNeRF is limited to representations that were included in the dataset – but the proof of concept that it offers signifies that Neural Radiance Fields is amenable to a generalized and explorable space, with the potential to train a much higher and more varied set of images, in order to increase the possibilities of diverse outputs.
The new paper is titled PersonNeRF: Personalized Reconstruction from Photo Collections, and comes from four researchers variously associated with UoW and Google Research.
The starting point for PersonNeRF was the HumanNeRF project, mostly from the same research group. HumanNeRF was able to convert people depicted in YouTube videos into discrete and explorable neural representations:
According to the researchers, the new work is directly evolved from HumanNeRF, but with some limitations removed in order to allow the system to generalize a labeled individual from multiple and only semi-related photographs, similar to the way that latent diffusion and GAN systems broadly extract features from generic and wide-ranging source material.
The paper notes that the project’s central insight is that multiple and varied photos of a person can be resolved into a single canonical space, i.e., a single ‘reference entity’ from which desired ‘divergences’ (of pose, dress, etc.) can be made. This is arguably the closest NeRF has yet gotten to encoding truly ‘abstract’ concepts into a neural representation.
PersonNeRF removes the mapping of non-rigid components from HumanNeRF, and uses only skeletal motion in its novel regularization formula. The project borrows from ideas developed in the 2022 RegNeRF initiative, encouraging geometric smoothing by the creation of a ‘depth smoothness loss’ on rendered depth maps. The authors note that this method encourages the creation of ‘haze’ artifacts from transparent geometry (i.e., the space between depicted objects in the NeRF), which has to be remediated by a special opacity loss algorithm.
The collections of photos used in the Federer and other tests for the paper are subdivided into appearance sets that denote photos taken around the same period.
Instead of optimizing each desired facet (appearance consistency and pose consistency) in a separate network, the training centers on a single multi-layer perceptron (MLP) for canonical appearance, into which all the labeled material is passed, training on the entire set of body passes in a single workflow.
The canonical MLP is ‘inspired’ by the architecture of the 2021 NeRF in the Wild project, with each appearance set bound to sole appearance embedding vector. This vector is concatenated with a novel pose embedding vector, which conditions the system’s pose correction module for each appearance set.
Training and Development
For the central work on the Federer dataset, the researchers collected photos by searching for particular associated sporting events across a limited number of years spanning 2009 to 2020. Each event yielded 19-24 photos for each year, and each set was accordingly labeled.
Body pose and camera pose on the dataset was estimated by SPIN, though the researchers had to intervene manually in cases of occlusion, such as where part of Federer’s body was obstructed by a racket or other non-intrinsic items. Without removing such items, they would have become essentially incorporated into the ‘Federer entity’.
The system was trained with the Adam optimizer, at varying learning rates for the canonical MLP and the rest of the network. Optimization, the paper notes, takes 200,000 iterations per game, or 600,000 iterations for all games, trained into a single network.
The researchers compared PersonNeRF to their prior effort HumanNeRF, running tests on the compiled Federer datasets. Where HumanNeRF had more restricted parameters, the authors similarly limited PersonNeRF. Each network was trained on 200,000 iterations.
Since there is no strict ground truth for synthesized and truly novel images, the researchers resorted to Fréchet Inception Distance (FID) as an arbitrating metric for the purposes of quantitative comparison.
Of the results, the authors state:
‘[Our] method outperforms HumanNeRF on all datasets by comfortable margins. The performance gain is particularly significant when visualizing the [results]. Our method is able to create consistent geometry, sharp details, and nice renderings, while HumanNeRF tends to produce irregular shapes, distorted textures, and noisy images, due to insufficient inputs.’
PersonNeRF is the first Neural Radiance Field system I’ve seen with the capacity to generalize a subject in the same way that a GAN or latent diffusion representation can. Unlike most comparable GAN or LDM systems, the topic matter in the network is not trained alongside vast swathes of related and unrelated data, so explorability is limited to such data as has been chosen to be trained.
Nonetheless, it’s easy to consider that later NeRF architectures that adopt this approach could increase the amount of data and the scope of the labels to form systems where ‘editing’ (NeRF’s primary disadvantage) is enabled by simply accessing a parameter of the trained system.