SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

ECCV 2024


1 Fraunhofer Heinrich Hertz Institute, HHI
2 Humboldt University of Berlin
MY ALT TEXT

Abstract

In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only comprises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose.

Examples

SPVLoc estimates the 6D camera pose of a query image relative to an untextured 3D reference model with semantic annotations. It establishes the perspective relation between the rendered panorama and the query image by predicting the viewport of the camera within a set of rendered panoramas.

The visualization includes the input image in the top row, the model rendered with the estimated top-1 pose in the second row, the selected reference panorama with the estimated viewport in the third row, and a map showing the estimated position (green circle) alongside the ground truth position (gray circles with radii of 50cm, and 100cm) in the bottom row.

Example resulats of spvloc

Method

The content of the query image is searched inside the semantic panorama rendering via cross-domain image-to-panorama matching. First, both inputs are encoded with CNN backbones, then the feature tensors are interwoven via depth-wise correlations and multiple tensor operations. This matching is inspired by DTOID, a generic 2D object instance detection approach.

The correlated features R* are fed into multiple network heads, which estimate the binary mask corresponding to the viewport of the camera in the panorama, the 2D bounding box enclosing the estimated viewport mask, and the relative 6D pose offset between the camera image and the panorama. The key point is that each match is given a score via the bounding box estimation, which can be used to select the best match from a large number of reference panoramas. The absolute pose is determined from the estimated relative pose between this panorama and the camera image.

Matching of spvloc

BibTeX

If you use our method in your research, please cite our paper. You can use the following BibTeX entry:


@article{Gard2024_SPVLOC,
  title={SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments},
  author={Niklas Gard and Anna Hilsmann and Peter Eisert},
  journal={arXiv preprint arXiv:2404.10527},
  year={2024}
}