US20250356667
2025-11-20
Physics
G06V20/64
The method and apparatus for three-dimensional (3D) object detection outlined here involve a series of steps using image processing techniques. Initially, two-dimensional (2D) image features are extracted from multiple images via an image backbone. These features are then transformed into a 3D feature map that includes depth prediction information through a view transformer, which is specifically designed for domain generalization.
The transformation to a bird's eye view (BEV) feature is accomplished using a BEV encoder. This BEV feature is crucial for predicting the position and class of the detected object through a detection head. The process involves DepthNet, which predicts depth outputs and inputs them along with 2D features into a BEV pool, enabling accurate 3D mapping.
A significant aspect of this method is the relative depth normalization technique, which reduces errors in depth and position prediction due to camera parameter differences. This involves calculating a transformation matrix for geometric transformations between camera pairs. The method further refines alignment between images using photometric matching based on depth predictions.
The system incorporates domain adaptation adapters in components like the image backbone and view transformer. These adapters allow for fine-tuning parameters by performing operations such as skip connections, which help in updating gradients effectively. This adaptability ensures the system can generalize across various domains.
The method also includes augmenting the 3D feature map via decoupling-based image depth estimation, enhancing its robustness. The described system can be implemented in electronic devices with memory and processors capable of executing these instructions, ensuring versatile application in fields like autonomous vehicle navigation and robotics.