RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction

Johannes Künzel, Anna Hilsmann, Peter Eisert

Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI
🌊🌺ICCV 2025🌺🌊

TL;DR

RIPE demonstrates that keypoint detection and description can be learned with Reinforcement Learning from image pairs only - no depth, no pose, no artificial augmentation required.

RIPE: A short walkthrough of its main components.

Contributions

RIPE introduces an innovative weakly-supervised training framework based on Reinforcement Learning, using only labeled image pairs and removing the need for depth or pose information.
The method enables training on diverse datasets, improving generalizability under real-world conditions.
Rewards are derived from the epipolar constraint, ensuring consistency with core principles of multi-view geometry.
RIPE uses multi-scale hypercolumn features to enhance the connection between keypoint locations and descriptors.
A contrastive loss further improves descriptor quality, making them more robust and discriminative.

Key Insight

RIPE demonstrates that keypoint detection and description can be learned using only image pairs. A positively labeled image pair (a sufficient number of underlying 3D scene points appear in both images) contains enough implicit information to guide the learning process. Leveraging the epipolar constraint prevents collapse to trivial solutions and provides a strong reward signal. Despite relying on this much weaker training signal, RIPE performs on par with fully supervised extractors.

Example result from MegaDepth 1500.

Poster

TBD

BibTeX


@article{ripe2025, 
year = {2025}, 
title = {{RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction}}, 
author = {Künzel, Johannes and Hilsmann, Anna and Eisert, Peter}, 
journal = {arXiv}, 
eprint = {2507.04839}, 
}