TensorFlow 3D -A Quick Introduction

8 min readJun 16, 2021

OverView

In this article, we will talk about the newest release of the open-source Google AI library — TensorFlow 3D.
This article walks you through the 3D scene capturing method and optimization technique used by Google to develop the Tensorflow 3D library.
Also, discuss the limitation of Current Computer vision to solve the 3D scene understanding.

Introduction

As Technologies are growing and evolving, companies like Samsung, Google, Apple, etc.. working hard on their technologies and trying different stuff with their existing product to deliver cutting-edge technologies to the customer. Like Samsung uses the time-of-flight sensor with its Galaxy Note 10 and Galaxy S10 5G, though it has ditched the sensor in its current generation models. Radar made a brief cameo via Project Soli in Google Pixel 4. More recently, Apple implemented LiDAR sensors in the iPhone 12 Pro and iPad Pro lineups after breaking through with the TrueDepth front-facing camera that ushered in the era of The Notch.

Companies like Tesla use 3D sensors such as Lidar, radar, and depth-sensing cameras in their autonomous vehicles to understand the scene of their surroundings and navigate and operate in the real world. You may call it — A self-driving car.

As the growing use of sensors like Lidar, depth-sensing camera, and radar, etc. over the last few years, has created a need for technologies that can be able to understand the scene and process the data that these devices capture in very optimized ways.

Here is Tensorflow 3D coming into play…

The team at Google AI has open-sourced and released the newest TensorFlow 3D library. The TensorFlow 3D library is an open-source framework built on top of TensorFlow 2 and Keras that makes it easy to construct, train and deploy 3D Object Detection, 3D Semantic Segmentation, and 3D Instance Segmentation models. This release will make the domain of 3D scene understanding much easier to tackle for the community.

Why do we need Tensorflow 3D? Is CV not Sufficient?

3D scene understanding is critical in object detection, Human-centric understanding, and graphics. The field of computer vision has to make good progress in 3D scene understanding, including models for mobile 3D object detection, transparent object detection, and more, but entry to the field can be challenging due to the limited availability of tools and resources that can be applied to 3D data.

And It is also quite difficult to optimize the 3D scene with pre-existing tools and technologies as it is less efficient.

TensorFlow 3D

In order to further improve 3D scene understanding and reduce barriers to entry for interested researchers, Google AI Releases a TensorFlow 3D (TF 3D), a highly modular and efficient library that is designed to bring 3D deep learning capabilities into TensorFlow.

TF 3D provides a set of popular operations, loss functions, data processing tools, models, and metrics that enable the broader research community to develop, train and deploy state-of-the-art 3D scene understanding models.

TF 3D contains training and evaluation pipelines for state-of-the-art 3D semantic segmentation, 3D object detection, and 3D instance segmentation, with support for distributed training. It also enables other potential applications like 3D object shape prediction, point cloud registration, and point cloud densification.

What about the dataset?

As we all know, for deep learning we need lots of data to make our model robust and perform well in the real world. For that Tensorflow 3D offers a unified dataset specification and configuration for training and evaluation of the standard 3D scene understanding datasets.

All our datasets have the following recurring concept:

Frame: each entry contains frame-level data like color and depth camera images, point cloud, camera intrinsics, ground truth semantic, and instance segmentation annotations.
Scene: each entry contains point-cloud/mesh data of the whole scene and lightweight information of all frames in the scene.

It currently supports the Waymo Open, ScanNet, and Rio datasets.

However, users can freely convert other popular datasets, such as NuScenes and Kitti, into a similar format and use them in the pre-existing or custom created pipelines, and can leverage TF 3D for a wide variety of 3D deep learning research and applications, from quickly prototyping and trying new ideas to deploying a real-time inference system.

Architecture

Now, let’s understand the architecture.

TF 3D uses an efficient and configurable sparse convolutional backbone which is the key to achieving state-of-the-art results on various 3D scene understanding tasks.

The 3D data captured by sensors often consists of a scene that contains a set of objects of interest (e.g. cars, pedestrians, etc.) surrounded mostly by open space, which is of limited (or no) interest.

As such, 3D data is inherently sparse. Directly applying a dense convolution will waste a lot of computing resources due to invalid calculations in the empty space. Also after traditional convolution, the extracted features are no longer sparse.

So, in TF 3D we use submanifold sparse convolution and pooling operations, which are designed to process 3D sparse data more efficiently.

Sparse convolutional models are core to the state-of-the-art methods applied in most outdoor self-driving (e.g. Waymo, NuScenes) and indoor benchmarks (e.g. ScanNet).

We also use various CUDA techniques to speed up the computation (e.g., hashing, partitioning, etc..)

Currently, TF 3D supports three pipelines: 3D semantic segmentation, 3D object detection, and 3D instance segmentation.

3D semantic segmentation: 3D Semantic Segmentation involves the segmentation of 3D objects or scenes represented as point clouds into their constituent parts where each point in the input space must be assigned a part label. The 3D semantic segmentation model has only one output head for predicting the per-voxel semantic scores, which are mapped back to points to predict a semantic label per point.

3D semantic segmentation of an indoor scene from the ScanNet dataset.

The U-Net uses a submanifold sparse convolutional networks because they can process low-dimensional data living in a space of higher dimensionality as most of the 3D data is sparse, applying standard implementation of convolutions is computationally intensive and requires large memory space.

The 3D Semantic Segmentation model enables apps to differentiate between foreground objects or objects and the background of the scene, as with the virtual backgrounds on Zoom. Google has implemented similar technology with virtual video backgrounds for YouTube.

3D Instance Segmentation: In 3D instance segmentation, the goal is to group the voxels that belong to the same object together, in addition to predicting semantics. The model used by TF 3D predicts a per-voxel instance embedding vector as well as a semantic score for each voxel. The instance embedding vectors map the voxels to an embedding space where voxels that correspond to the same object instance are close together, while those that correspond to different objects are far apart.

Simply put, the 3D Instance Segmentation model identifies a group of objects as individual objects, as in Snapchat Lenses that can put virtual masks on more than one person in the camera view.

By default, TF 3D uses the U-Net Network as the backbone and has two output heads for predicting per-voxel semantic logits, and instance embedding. At the time of inference, a greedy algorithm picks one instance seed at a time and uses the distance between the voxel embeddings to group them into segments.

3D Object Detection: The 3D Object Detection model takes instance segmentation a step further by also classifying objects in view. The 3D object detection model predicts per-voxel size, center, and rotation matrices and the object semantic scores.

While at the time of inference, a box proposal mechanism is used to reduce the hundreds of thousands of per-voxel box predictions into a few accurate box proposals, and then at training time, box prediction and classification losses are applied to per-voxel predictions.

It uses a dynamic box classification loss that classifies a box that strongly overlaps with the ground-truth as positive and classifies the non-overlapping boxes as negative.

TensorFlow 3D supports two types of deep Networks, The UNet Network and The HourGlass Network.

The U-Net network consists of three parts:

An encoder that downsamples the input sparse Voxels.
A bottleneck.
A decoder with skip connections that up-samples the sparse voxel features back to the original resolution.

Each part consists of a number of sparse convolution blocks with possible pooling or un-pooling operations which are used to extract a feature for each voxel. A voxel represents a value on a regular grid in three-dimensional space.

3D Submanifold Sparse Convolution Block

The convolution block receives a set of sparse voxel indices and their features, performs a series of 3d submanifold sparse convolution on them, and returns the computed voxel features.

# voxel_features: A tf.float32 tensor of size [b, n, f] where b is batch size, n s the number of voxels and f is the feature size.
# voxel_indices: A tf.int32 tensor of size [b, n, 3].
# num_valid voxels: A tf.int32 tensor of size [b] containing number of valid voxels in each of the batch examples.from tf3d.layers import sparse_voxel_net_utilsconv_block = sparse_voxel_net_utils.SparseConvBlock3D(
                   num_convolution_channels_list=[32, 48,
                   64],apply_relu_to_last_conv=True)convolved_features = conv_block(voxel_features, voxel_xyz_indices, num_valid_voxels)

3D Sparse Voxel UNet

The U-Net figure can be described as:

The horizontal arrows take in the voxel features and apply a submanifold sparse convolution to it.
The arrow moving downwards performs a submanifold sparse pooling.
The arrow moving upwards will gather back the pooled features, concatenate them with the features coming from the horizontal arrow, and perform a submanifold sparse convolution on the concatenated features.

This model has been utilized as a backbone architecture by TF 3D for 3D Object Detection, 3D Semantic Segmentation, and 3D Instance Segmentation models. TF 3D allows for configuring of the U-Net network by changing the number of encoder-decoder layers and the number of convolutions in each layer, and by modifying the convolution filter sizes based on the use case.

Here is the code to create the U-Net model displayed in the above diagram.

from tf3d.layers import sparse_voxel_unettask_names_to_num_output_channels = {‘semantics’: 5, ‘embedding’: 64}
task_names_to_use_relu_last_conv = {‘semantics’: False, ‘embedding’: False}
task_names_to_use_batch_norm_in_last_layer = {‘semantics’: False, ‘embedding’: False}unet = sparse_voxel_unet.SparseConvUNet(
          task_names_to_num_output_channels,
          task_names_to_use_relu_last_conv,
          task_names_to_use_batch_norm_in_last_layer,
          encoder_dimensions=((32, 48), (64, 80)),
          bottleneck_dimensions=(96, 96),
          decoder_dimensions=((80, 80), (64, 64)),
          network_pooling_segment_func=tf.math.unsorted_segment_max)outputs = unet(voxel_features, voxel_xyz_indices, num_valid_voxels)
semantics = outputs[‘semantics’]
embedding = outputs[‘embedding’]

HourGlass Network:

HourGlass Network is one or multiple stacked UNet Networks.

To Sum UP…

Tensorflow 3D is an amazing library for 3D scene understanding as experiments on the datasets show that the implementation of TF 3D is around 20x faster than a well-designed implementation with pre-existing TensorFlow operations. Which is much faster than Any existing computer vision solution on 3D scenes.

TensorFlow 3D is just one of the 3D deep learning extensions in the market. Facebook launched PyTorch3D in 2020, more dedicated to 3D rendering and virtual reality.

Another player in the market is Kaolin from NVIDIA, a modular differentiable rendering for applications like high-resolution simulation environments. From this overview, seems that

TensorFlow 3D application is more dedicated to robotics perception as well as mapping, while other options are more dedicated to 3D simulation and rendering. For the purpose of 3D rendering Google has TensorFlow Graphics.

Thanks for reading!

References:

● https://arxiv.org/pdf/1711.10275.pdf

● https://github.com/google-research/google-research/tree/master/tf3d

● http://ai.googleblog.com/2021/02/3d-scene-understanding-with-tensorflow.html

TensorFlow 3D -A Quick Introduction

Written by MD SAIF UDDIN