Reviewer Corner

Clarity of writing: High

Math and Deep Learning knowledge level: Medium. A background in rendering 3D scenes and 3D deep learning is useful.

Practical applications: Generate views of a 3D scene from novel camera positions, which can be useful in various domains.

Article Details:

Article link: https://arxiv.org/abs/2003.08934.

Code link: https://github.com/bmild/nerf.

Published on: 19 Mar 2020 (arXiv).

Presented at: ECCV 2020.

Article Domains:

  • Computer Vision.
  • Novel Pose Synthesis.
  • 3D Reconstruction.

Problem Description

This paper proposed a new approach to a longstanding problem in computer vision and graphics: novel view synthesis.

Given a collection of

  • input images, and
  • corresponding camera models

generate new views of the same scene from camera positions which do not appear in the input data. The new images should be consistent with the input data, in other words - they should appear to faithfully depict the same scene.

Summary

Many solutions have been proposed for this problem, but obtaining high-quality results in challenging scenes is difficult. The NeRF solution is noteworthy since the results are of stunningly high quality

  • a fact which drew much attention and also resulted in a line of follow-up works which generalize or improve different aspects of the original method.

The NeRF method works by learning a "radiance field" of the scene by using differentiable volumetric rendering and optimizing a pixel-level photometric loss. A neural network is used to represent the radiance field.

In-Depth

Scene Representation Approach

Intuitively, in order to generate new images, the essence of the three-dimensional scene needs to be learned. Only thus can we generalize to novel views. There are many possible ways to describe a 3D scene, which consists of geometry (shapes), textures, lighting and so forth.

A "classic" approach is to estimate the 3D geometry using a technique such as structure-from-motion. This produces a 3D model of the objects in the scene, which can then be rendered from any camera position. The problem with this approach is that full 3D reconstruction is a difficult problem and the results are often lacking in quality and may fail altogether in some cases.

If the goal is generating new views, 3D reconstruction can be a larger problem than what we actually want to solve. NeRF doesn't attempt to recover the full 3D scene geometry. Instead, it estimates a "radiance field", which is a volumetric representation, and assigns a color and density to each point in 3D space. The color is also allowed to vary, depending on the viewing direction.

Volume Rendering

Volumetric scenes and rendering are familiar concepts in areas of computer graphics. An example is rendering a cloud of dust or smoke. Instead of having a well-defined shape, it can be more appropriate to represent them as a distribution of matter dispersed over some three-dimensional space. Each point in the volume has a particular density, corresponding to the amount of matter or number of particles in that location. A color value can also be associated to each point. Imagine looking at a particular spot on a dust cloud. Consider the ray emanating from your eye and intersecting the dust cloud, passing through that same spot. The color your eye sees depends on cloud density along the ray. Denser areas appear more opaque, absorbing more light than sparse areas. Now let's see how rendering actually is performed.

When rendering such a scene from a given camera, we shoot rays out from the camera in all directions. For each pixel, we need to compute a color value using the ray which passes through that pixel. We do this using the following formula:

$$
C(\mathbf{r}) = \int_{t_n}^{t_f} T(t)  \sigma\left( \mathbf{r} (t)\right) \mathbf{c}\left( \mathbf{r} (t), \mathbf{d}\right) dt
$$

Where $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ is a ray emanating from the camera center $\mathbf{o}$ in the direction $\mathbf{d}$. $C(\mathbf{r})$ is the RGB color value assigned to ray $\mathbf{r}$, $\sigma(\mathbf{x})$ is the opacity value at the 3D spatial coordinate $\mathbf{x}$, and $\mathbf{c}\left( \mathbf{x}, \mathbf{d}\right)$ is the RGB color value associated with the 3D spatial coordinate $\mathbf{x}$ and the viewing direction $\mathbf{d}$. $T(t)$ is a factor which gives more weight to points on the ray which are close to the camera center (Eq. (1) in the paper).

The integral assigns an RGB color to a ray by computing a (continuous) weighted sum of RGB values at points along the ray. This is done such that denser areas and points nearer to the camera contribute more to the final result. In practice, a discrete summation is performed, using a collection of points sampled along the ray using a binned approach (Eqs. (2,3) in the paper).

We refer to the pair $\left( \sigma, \mathbf{c} \right)$ as a radiance field. An important property of the radiance field is that the opacity/density $\sigma \left( \mathbf{x} \right)$ depends only on the spatial coordinate $\mathbf{x}$, but the RGB color value $\mathbf{c}$ dependends on both the spatial coordinate and also on the 2D viewing direction $\mathbf{d} = \left( \theta, \phi \right)$. This allows for the scene to look different from different angles in terms of lighting, reflectance, etc. but constrains the density of the "cloud" to be the same from all angles, which is a reasonable assumption from physical considerations.

A useful property of volume rendering is that it is differentiable. This means that the partial derivatives of $C(\mathbf{r})$ with respect to the components $\mathbf{c}, \sigma$ exist. This is useful for performing optimization using gradient-based methods.

Radiance Fields and Neural Networks

As we have seen, the radiance field $\left(\sigma, \mathbf{c} \right)$ is described as a function which maps 5D inputs to 4D outputs. We need to choose a way to represent all such functions in computer memory and also a method to fit a function to the input data, consisting of 2D images, taken from different viewpoints, and camera parameters. We rely on deep neural networks which serve as general-purpose powerful nonlinear function approximators, along with the methods of training them based on gradients and back-propagation. The idea is to estimate the network parameters such that the resulting network represents the radiance field which best explains the input images. This involves rendering the radiance field from the corresponding cameras and comparing to the input images.

The authors propose a neural network based on a deep fully-connected architecture - i.e., multilayer perceptron, or MLP. The network accepts as input a 3D location and viewing direction $\left( \mathbf{x}, \mathbf{d} \right)$ and outputs a color value and a density value $\left(\mathbf{c}, \sigma \right)$. To train the network, the authors propose a photometric loss, which in this case is the squared Euclidean distance between rendered pixel values and the corresponding input image pixel values. These per-pixel losses are summed over a set of rays which are sampled per batch from the input cameras.

The details of the MLP architecture are described in Section 3 of the paper. One noteworthy aspect of the architecture is that the density value $\sigma$ is restricted to be a function of 3D location $\mathbf{x}$, while color can depend also on the view direction. This encourages the resulting generated image to be consistent over multiple novel viewing directions.

Conceptually, the NeRF training is as follows. In each iteration, a batch of rays is sampled from the input cameras, resulting in a set of target pixel values. In the forward pass, each ray in the batch is rendered, using a discrete approximation of the volume rendering integral and the neural network represented by the current parameter values. This produces a set of estimated pixel values. These estimated values are compared with the target pixel values using the photometric loss, resulting in a scalar nonnegative error value. The backward pass propagates the error gradient back through the network which allows us to adjust the neural network parameters in the direction which decreases the photometric error. This is repeated until convergence or termination.

Tricks and Enhancements

The methods we discussed are a good start, but they don't produce good enough results on their own. The authors propose several interesting techniques which dramatically improve the quality of the results.

Positional Encoding

Position encoding refers to a transformation which is performed on the inputs $\left( \mathbf{x}, \mathbf{d}\right)$ before inputting them into the network. The transformation, given in Equation (4) in the paper, takes these values and applies a collection of trigonometric functions (sines and cosines), resulting in a vector of length $L$ (where $L$ is a chosen hyperparameter). It turns out that this greatly increases the quality of the resulting images, specifically in areas containing high-frequency detail. The transformation is applied separately to the location and the view direction.

It may seem a little surprising that such a step is needed. Why can't the MLP learn the radiance field properly from the original inputs? After all, they contain all the information. It turns out that some neural networks can perform poorly at learning high-frequency functions from low-dimensional inputs. This point is stated in the paper without any explanation or background, but they do cite other work and explore this more deeply in a follow-up work where they make a connection to existing theory on neural networks (Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains). The term positional encoding is a misnomer, since it is totally different than positional encoding in it's original context of transformer networks.

Hierarchical Sampling

This feature is designed to lower the computation complexity and the runtime of the training process without compromising on the visual quality. The problem is that the volume-rendering integral, when computed via naive approximation, is inefficient and time-consuming.
This is because a uniform sampling of the rays requires many points to compute the total result, and most of the points sampling in this manner contribute very little to the overall resulting pixel value. This is because typically the "true" radiance field will have low density in most regions, and the local color contributions are linearly weighted by density. We would really like to sample the high-density regions and ignore the rest. But we don't know the radiance field, since that it what we need to learn to begin with!

To overcome this problem, the solution is to "bootstrap" it. This is done by learning two neural networks simultaneously. One is termed the coarse network and only needs to provide a rough approximation of the radiance field. This is sufficient to indicate where the high-density regions are and allow for more efficient sampling along rays. The coarse network is trained with less data - fewer sampling points - so it will train faster.

The fine network is trained with a larger number of sampled points and provides the final radiance field representation. The two networks are trained simultaneously using a sum of two photometric losses. See Equation (6) in the paper.

Results

The authors perform a detailed comparison of NeRF with other state-of-the-art methods, such as Neural Volumes, Scene Representation Networks and Local Light Field Fusion. They use both real and synthetic evaluation datasets, reporting the metrics PSNR, SSIM and LPIPS. In nearly all cases, NeRF outperforms the competition.

The visual comparison results are striking. They clearly demonstrate the superiority of NeRF, showing sharper details and fewer visual artifaces, even for complex real-world scenes.

It should be noted that this method has high computational complexity. The authors report training time of 1-2 days on an NVIDIA V100 GPU. Rendering a single image from a real scene requires 150-200 million network queries, taking 30 seconds on a V100 GPU.

Conclusion and Follow-up Works

The NeRF method produces terrific results. It taps into the recent trend of learning and inferring 3D scenes using neural networks. The paper stirred much interest in a short amount of time and exposed fertile ground for improvements and ideas to develop the field further. Here is a partial list of a few of the interesting papers which have been appeared since:

BARF: Bundle-Adjusting Neural Radiance Fields - [April 2021] Relax the requirement to know the camera poses.

Deformable Neural Radiance Fields - [November 2020] Estimate time-varying (deforming) scenes.

FastNeRF: High-Fidelity Neural Rendering at 200FPS -  [April 2021]  Vastly-improved rendering speed.

Nerf in the wild: Neural radiance fields for unconstrained photo collections - [August 2020] Estimate scenes from uncurated datasets containing visual artifacts.

NeRF-VAE: A Geometry Aware 3D Scene Generative Model - [April 2021] Learn a distribution over radiance fields and render novel scenes.

ShaRF: Shape-conditioned Radiance Fields from a Single View -  [February 2021]  Generate radiance field from a single image using a learned object class.