Improving 3D Multi-View Object Detection via Explicit Query Supervision

D'Addeo, Filippo; Zinelli, Andrea; Bertozzi, Massimo

doi:10.1109/IV64158.2025.11097467

Perception is a crucial aspect of an autonomous driving system. One essential task is represented by multi-camera 3D object detection, which allows an intelligent vehicle to detect the surrounding obstacles using a camera-only setup. Currently, there are a lot of different approaches trying to solve this task, with many of them being transformer-based. Specifically, most of these make use of object queries instead of a Bird's Eye View plane to directly represent the set of possible detections and avoid any post-processing operation, like non-maxima suppression. However, the ambiguous supervision caused by the bipartite matching loss typically leads to training instability. To overcome this limitation, we propose an additional module able to 'push' the object queries toward the locations that more likely contain obstacles, providing both better insights into their position to the detection module and stabilizing the bipartite matching during training. We evaluate our proposal against different objet queries-based baselines both on the nuScenes dataset test and validation sets. Specifically, compared to the lightweight PETR architecture, we highlight an increase of 1.6% both in NDS and mAP under the same configuration settings

Improving 3D Multi-View Object Detection via Explicit Query Supervision / D'Addeo, Filippo; Zinelli, Andrea; Bertozzi, Massimo. - (2025), pp. 1415-1420. ( IEEE Intelligent Vehicles Symposium) [10.1109/IV64158.2025.11097467].