Trajectory Extraction and 3D Pose Estimation using DeepPose and TrailerDetector

Motivation

Trajectory extraction from real-world video footage is a challenging task. It is however crucial for developing and testing more robust methods for path planning and autonomous driving. In this work, we extend the preexisting video post processing library, video trajectory extraction: we integrate an alternative deep learning-based method for 3D pose estimation, evaluate it and compare it to other approaches (Box center, Box fit and Linear fit estimators) and finally integrate a car trailer detector which cleans persistent trajectory-artifacts. These can result either from trailers detected as separate vehicles, multiple detection of same entity or wrong detection of random objects in the scene as vehicles. We also add an evaluation module using a third party tool called Supervisely to provide a tooling support to label test data and to help in evaluating the performance of the pose estimation module.

There is a prominent increase in the research around Autonomous Driving. This is a particularly challenging task to handle due to the highly dynamic nature of the environment and the interaction between traffic agents. Furthermore, since real-world testing is associated with a lot of costs and danger, a surge in interest in simulation environments for testing self-driving algorithms and training deep-learning or reinforcement-learning based models. One such simulation environments is CommonRoad. In this work, we add two additional layers of computation to the CommonRoad Trajectory Extraction pipleline:

DeepPose: A CNN-based approach for local orientation and dimensions estimation. Furthermore, it includes functions to estimate the observation angle (global orientation) which is then used to compute a 3D Bounding Box based on a geometrical method presented in this paper.
TrailerDetector: A module for removing certain types of trajectory estimation errors, namely car trailers which can be detected as separate entities and double detections of vehicles where a vehicle is assigned two overlapping trajectories.

Finally, we provide an evaluation of the 4 different 3D Pose Estimators currently implemented in the pipeline: BoxCenter, BoxFit, LinearFit and DeepPose.

The Munich Highlight Tower Dataset

The Munich Highlight Tower Dataset is a collection of traffic footage. It differs from many Traffic datasets as it captures the vehicles from a higher altitude. It is categorized into three sub-datasets that differ in terms of perspective: Ring ost, Ring west and Zip merge.

Samples from the Munich Highlight Tower Dataset, from left to right, ring_ost, ring_west and zip_merge

Additions to the CommonRoad Pipeline

In this work, we introduce additional Module to the above mentioned pipeline. We integrate for instance a Deep Pose Estimation module which uses a CNN for local orientation and dimensions prediction. Furthermore, we add a pose evaluation module in order to compare different Pose Estimation approaches. Finally, we implement a Car trailer detection which helps clear some of the trajectory extraction artifacts, namely trailers of car and multiple detections of vehicles.

The existing CommonRoad Pipeline

DeepPose

The DeepPose Module follows an implementation proposed by this paper which in turn uses an approach to 3D Pose estimation introduced by another work.

In the proposed approach, a MultiBin architecture for orientation estimation is used. The orientation angle is discretized divided into \(n\) overlapping bins. For each bin, the Neural Network estimates the probability \(c_i\) that the output angle lies inside the \(i\)th bin and the residual rotation correction that needs to be applied to the orientation of the center ray of that bin in order to obtain the output angle. The figure below shows the architecture of the convolutional neural network. It consists of three branches.

Left: Estimation of dimensions of the object of interest.
Center: Compute the \(cos(\delta \theta)\) and \(sin(\delta \theta)\) of each bin
Right: Compute the confidence for each bin

The DeepPose CNN architecture

The Neural Network used in the orientation and dimensions estimation is implemented in Tensorflow. Furthermore it uses the pretrained-weights provided by this paper. This model was trained on the KITTI Dataset.

Trailer Detection

This section describes an additional post-processing layer that has been integrated to the current pipeline in order to solve one reoccurring issue: detection of car trailers as separate entities throughout the post-processing pipeline.

Car trailers exhibit unique characteristics relative to the vehicle they are attached to compared to other traffic partic- ipants: They move with the same speed as the car, in the same direction and are close to it at all times. These three rather evident characteristics have been used to detect such trailers after the smoothing/trajectory generation is executed. The thresholds for the velocity and the distance are estimated empirically, through Trial and Error.

Trailer Detection