Due to the COVID-19 pandemic and large-scale transitioning of employees to working from home the popularity of audio/video messaging platforms and streaming services has increased very rapidly since they help to organize the remote work efficiently. Employees have to learn new technology and adapt it to remote work. Working from home or any public place like coworking centers or cafes can be disruptive, which affects your efficiency and causes more stress.
There is usually no quality lighting (for video streaming), no real green screen background, people are walking in the background and get in the frame, which distracts you and doesn't let you focus on the work.
The virtual background technology (also known as the background replacement feature or background blur or even smart chroma-key, which does not require a real monochromatic background) has been developed recently to address these issues. This technology allows you to replace or blur the background of your video feed in real-time thereby hiding your environment.
The first companies to deliver this feature were Zoom, Skype, and Google.
The popularity of virtual background replacement continued growing, and it quickly became a requirement for any video calling software. Building such functionality in-house for most companies is a time-consuming and expensive task, which also requires a high level of expertise on the part of the development team. There were no out-of-the-box solutions in the market at that time.
In this article, we will describe how we created a cross-platform solution for background removal/replacement in real-time webcam video feeds.
One of the most popular approaches to remove background is to use the so-called “green screen” technique: the user places a solid single-color canvas behind them and the software removes all pixels of that color and, optionally, some pixels in color space around it to prevent the color spill effect caused by canvas lightning imperfections and camera sensor noise. This solution provides good quality but requires some effort and a proper setup. This won't be a big problem for people who steam regularly, however, for casual use or video conferences it is preferred to do background removal without any preparation at all.
To remove background we'll need to classify each pixel as either “Person” or “Background”. This is an exact formulation of a semantic segmentation task! This is where convolutional neural networks come into play. They have conquered many tasks related to computer vision since their historical debut at ImageNet Large Scale Visual Recognition Challenge. Today deep convolutional models rank very high on every semantic segmentation benchmark you can imagine, which makes them an obvious choice for background removal tasks.
Here, we faced two major obstacles. The first is to prepare a dataset that would address our goal. The second is to develop an architecture that will run on low-grade and mobile hardware such as power-limited laptop CPUs and mobile devices running Android and iOS.
To better understand the requirements for data training, we have examined the existing datasets intended for similar tasks.
The standard dataset used for object detection and semantic segmentation model benchmarking. It contains a lot of images with people, but, unfortunately, they don't resemble those that are typically seen on webcam streams and in video conferences. The quality of annotated masks also leaves much to be desired, especially on the mask’s edges.
This dataset was specifically created for person segmentation in images. Images were collected from photo stocks. Although the visual annotation quality in this dataset is quite high, the original images are too far from what can usually be seen on the webcam: background is blurred, the focus is on the person, the scene is well lit, the image resolution is high and it’s almost noise-free. Models trained on this dataset did not suit our needs.
Dataset designed to solve another problem, which is very close to ours — portrait matting. However, the problem they addressed was formulated differently from ours in two major aspects. First, instead of classifying each pixel into two classes (“person” and “background”), we predicted the value of pixel transparency in the [0; 1] range. This allowed increasing the quality of background removal in complicated conditions such as hair segmentation or ambiguities between background and person figure. Second, the authors of the dataset provided only portrait images, so there was a lack of diversity in poses and almost no arms visible in the whole dataset. Models trained based on this dataset were not able to segment arms correctly.
Upon examination of datasets and several unsuccessful attempts to train a model with acceptable quality on them, we decided that the easiest way to achieve our goal would be to train the model on images from the target domain. We wanted to collect a dataset directly from webcams so that the visual characteristics of images used for model training were similar to those that the model work with on the inference stage. We asked employees of our company to record short demo videos in front of a webcam while taking various positions (hand, head, and upper body movements).
The quantity and diversity of our dataset were not enough to significantly improve the results. Discouraged by the fact that it was not possible to expand the dataset by collecting data directly from end-users, we decided to turn to the largest collection of videos on Earth — YouTube.
We have retrieved about 300 videos from video blogs and stream recordings that were shot in conditions similar to our target domain. We mainly focused on videos shot in bad conditions with low-quality webcams and realized that professional streaming setups weren’t very common. Data collection and labeling proceeded iteratively as the model improved, and each new round of labeling improved the result even without modifying the model. At the moment, our dataset consists of about 12,000 images with high-quality masks.
Applied augmentations did not affect metrics significantly, but improved the stability of model predictions in a natural environment.
Although the dataset was a key component for high-quality background removal, the model itself still played an important role. If you need to squeeze every bit of performance out of your solution, you can not just use any common deep architecture from your favorite article.
All our models were based on the “encoder-decoder” scheme. The first part of the network (encoder) extracts features from the input image in a stage-wise manner, reducing the spatial resolution of the feature map by a factor of 2 at each stage of processing. The resulting feature map resolution can be 32 times lower than the original image, so it is hard to predict fine details from such a small image. That’s why we needed the second part of the network — the decoder. It was responsible for restoring the original resolution of the image using features obtained by the encoder part.
The first version of the model used a miniaturized variant of DenseNet as an encoder and a sequence of bilinear interpolation mixed with convolution as a decoder. Although the model performance was good, it was able to predict only the rough outline of a true person mask. Complicated backgrounds and visible arms confused the model so much that it was not usable.
Next, we tried a plain old UNet model. It was very accurate, but too heavy even for the GPU inference in real-time. We tried different options to scale down the model naively: decrease input image resolution, decrease feature map depth, do fewer decoding stages. Unfortunately, the quality of predictions dropped faster than the inference speed increased. Finally, we replaced regular convolutions with a depthwise-separable option. This made inference speed acceptable without significant loss in segmentation quality. This model demonstrated the quality sufficient for demonstration and deployment.
At the same time, we evaluated lightweight models based on “dilated” convolution. However, we were unable to get fast models with good segmentation quality quickly enough, so we abandoned this path.
As an inspiration for subsequent iterations of the model, we used the UNet-like architecture proposed in the monocular depth estimation paper. The authors proposed asymmetric architecture with a light encoder and a heavy decoder. Another important idea from this paper was doing prediction at a lower resolution compared to the input. We took these ideas and built our own model using lightweight bottlenecks everywhere and a simplified decoder. We discarded the multi-scale loss evaluation since we didn’t observe any positive effect on prediction quality from it. Instead, we added one more branch that used the last feature map to estimate only the boundaries of the mask, which helped with the stability of prediction at the boundaries.
Once the training was completed, the model needed to be prepared for deployment to a production environment. At the very beginning of SDK development, we planned to deploy the model as part of a native Windows application with the ability to run on both CPU and GPU. Unfortunately, it wasn't possible to deploy the model using PyTorch due to the slow inference when running on the CPU. Also, only Nvidia GPUs were supported by Windows, and the framework library binaries' size was about 3 GB. Moreover, CuDNN had to be installed by the user manually since it was illegal to distribute it as a dependency within the application.
Having accepted this sad truth, we turned to other solutions designed specifically for neural network inference. Fortunately, there were plenty of mature frameworks readily availble for this task.
First of all, our eyes caught WinML — the framework for neural networks inference from Microsoft that had integration with Windows 10 starting from version 1803. It allowed you to run models on CPUs and GPUs from any vendor. Frameworks could work only with models serialized in ONNX format. Fortunately, ONNX serialization is supported natively in PyTorch.
Using WinML we were able to get the prototypes quickly and continue the examination of other tools for model inference.
OpenVINO is an Intel framework designed for optimization and inference of neural networks on x86 CPUs and Intel iGPUs. It implements a large collection of popular operators needed for computer vision networks, so there are almost zero chances of having issues during the model conversion. It is accompanied by excellent documentation, tools for model testing and benchmarking, a huge collection of ready-to-use models. You cannot export the model directly from PyTorch, first, you need to convert it to ONNX. Of course, integration into C ++ projects is also supported.
After the initial beta of the desktop SDK, we received a request to add the ability to run the model on mobile devices and web browsers.
We started with the MediaPipe library created by Google. It was created for the implementation of pipelines for streaming video and audio processing in multimedia applications. You set the data sources (camera, microphone), the graph that specifies the sequence of data manipulations, and the framework took care of video and audio frames processing. As modern applications are likely to use some neural networks for fancy stuff, MediaPipe includes TensorFlow Lite for neural networks inference. It supports desktop and mobile platforms, but with caveats, which we will discuss below.
Since the library didn't support importing from PyTorch or ONNX natively, we needed to decide how to convert the model into the format that TensorFlow Lite allows. After studying the experience of other developers we took not the simplest but most predictable path — re-implementing the model in TensorFlow from scratch and importing weights from PyTorch. The process was simple and mostly mechanical. The most annoying thing was that TF used the NHWC tensor format and PyTorch uses the NCHW format by default. This incompatibility could be fixed by simply transposing the weights of the convolutional layers during import. We ensured that the model was correct (by comparing outputs of PyTorch and TensorFlow models) and converted it from the TensorFlow graph into the TensorFlow Lite format and tried to launch the model through MediaPipe.
For Android, we had to implement several operations like concatenation of previously predicted mask with current image from camera using OpenGL shaders, but overall model porting took about 2 days. On the mid-range devices, we were able to get smooth 25-30 FPS, which made us happy. Also, by that time we had new ideas to further optimize the model.
Unfortunately, things went wrong on iOS. The main problem was Apple’s decision not to implement OpenGL compute shaders on mobile devices so we had to implement texture conversion between Metal and OpenGL and vice versa ourselves. We weren't able to implement this quickly enough, so we switched from MediaPipe to CoreML on iOS.
CoreML is a native Apple framework for neural networks inference. It supports all recent Apple hardware, including GPUs and specialized SoC blocks for neural networks acceleration - NPUs. The process of porting the model turned out to be smooth as Apple implemented direct PyTorch-to-CoreML conversion in their coremltools support library. This model runs at 30 FPS using the mobile GPU on iPhone X.
Today many applications are developed as web-oriented from the very start, so we don’t want to miss the opportunity to deliver our model to web environments. To accomplish this task we tried three options: ONNX.js, TensorFlow.js, and TensorFlow Lite. The experience, unfortunately, was quite painful and only one solution is working satisfactorily at the moment.
The easiest way to launch the model in the browser (in theory, at least) is ONNX.js, which allows you to load models serialized to ONNX and execute them on GPU or CPU. Sadly, it seems that the library is not actively maintained and supports the import of a limited set of operators and from older versions of ONNX IR only. This can cause problems when exporting models from newer versions of PyTorch. In our case, the problem was the transposed convolution and upsampling layers. Again, we didn't manage to overcome this quickly enough, so we went to explore other options.
The last option we looked at was TensorFlow Lite compiled to WebAssembly using emscripten. WebAssembly is a standard that defines which platform-independent bytecode can be executed in the sandbox. It was created primarily for delivering high-performance code into the browser environment. The emscripten toolchain allows you to compile your C / C ++ code to WebAssembly. Luckily for us, TensorFlow Lite was written in C ++, which allowed us to run it directly in the browser. Compilation of TensorFlow Lite to WASM wasn’t very tortuous and showed performance sufficient for our needs. Also, this method opens up the possibility to use other libraries, for example, OpenCV for image pre- and post-processing, which we happily used.
The major showstopper for integration was the lack of built-in model encryption capabilities in all frameworks we had examined. That’s why we opted to compile TensorFlow Lite from sources and implement model encryption ourselves.
Currently, we have four SDK versions prepared for all major platforms with fast, stable, and high-quality background segmentation. Already, there are integrations with commercial projects, and we will certainly continue to develop this product.
Stunning virtual background functionality for your software.