AppleGrowthVision

A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards


1 Fraunhofer Heinrich Hertz Institute, HHI
2 Humboldt University of Berlin
3 Fraunhofer Institute for Transportation and Infrastructure Systems, IVI

Agriculture Vision Workshop at CVPR 2025

TL;DR

We present a large-scale stereo dataset of six validated apple growth stages, enabling accurate classification, 3D reconstruction, and improved fruit detection and with growth stage prediction of over 95% accuracy.

Contributions

  1. We propose a large-scale publicly available resource comprising over 9,300 high-resolution stereo images and 1,125 densely annotated images. It captures a growth cycle of apple trees, covering six BBCH phenological stages that have been validated by agricultural experts. The dataset adds to existing apple orchard datasets by contributing additional geographic contexts through imagery from two German orchards (Brandenburg and Saxony, Germany).
  2. The dataset includes BBCH growth stage metadata, specifying which image sets correspond to specific phenological stages, thereby enabling temporal and phenological analysis. When used to train classification models such as VGG16, ResNet152, DenseNet201, and MobileNetv2, it achieved over 95% accuracy, thereby supporting research in precision agricultural, such as fungicide application and yield estimation.
  3. When the AppleGrowthVision dataset was used to augment existing datasets such as MinneApple and MAD for fruit detection, it led to a 7.69% improvement in YOLOv8's F1-score and a substantial 31.06% improvement in Faster R-CNN's F1-score, demonstrating that increased dataset diversity and phenological richness significantly enhance fruit detection accuracy in complex and variable orchard environments.
  4. The paper introduces a calibrated stereo imaging setup and a novel 3D reconstruction pipeline that combines DISK features, LightGlue, and COLMAP, enabling efficient multi-view stereo reconstructions of orchard scenes. Remarkably, the method requires only 18 images per tree — an order of magnitude fewer than previous approaches like CherryPicker, which needed 250 images—making it significantly more practical for large-scale applications. This advancement paves the way for creating detailed digital twins of orchards, supporting precise monitoring and targeted agricultural interventions.

Remaining challenges

  1. Incomplete Annotation of Growth Stages: While the dataset includes expert-validated BBCH growth stages, annotations are currently limited to primary stages. Many agricultural decisions (e.g., fungicide application, precise yield estimation, optimal harvest timing) require fine-grained annotations of secondary BBCH stages, which are not yet comprehensively labeled in the dataset.
  2. Limitations in Automatic Labeling: The semi-automated annotation approach used (AI-assisted YOLOv8 with human verification) is a step forward but still insufficient for reliable, fully automated crop monitoring and large-scale yield prediction. Enhancing the accuracy and consistency of automatic annotation tools remains a critical challenge.
  3. Reconstruction in Complex Environments: Although the paper presents a novel reconstruction pipeline for large orchard scenes, traditional 3D reconstruction tools like COLMAP struggle in these environments due to dense vegetation, occlusion, and wind-induced motion. The current pipeline relies on manual calibration inputs and is not yet fully robust or scalable for all conditions.
  4. Handling Occlusions and Dense Clusters: Fruit detection in highly occluded and densely clustered scenes remains a challenge. Models still face difficulty distinguishing overlapping fruits and dealing with variable lighting and scale changes, especially in real-world orchard settings.
  5. Real-Time Inference and Model Efficiency: The models evaluated, including YOLOv8 and Faster R-CNN, show promising accuracy but may not be optimized for real-time inference on edge devices used in agricultural robotics. Further testing with lightweight models (e.g., YOLO-NAS, MobileNetV3, EfficientDet) is necessary for deployment in constrained environments.
  6. Multimodal Data Integration: The dataset currently focuses on stereo RGB imagery. However, integrating additional data modalities such as multispectral, thermal, or LiDAR could significantly improve robustness and utility for broader phenotyping tasks, which is not yet explored in this work.
  7. Benchmarking with Advanced Architectures: The study primarily evaluates CNN-based models and does not include transformer-based architectures like DETR or Vision Transformers, which may offer better performance in capturing complex spatial relationships and occlusions in orchard imagery.

Abstract

Deep learning has transformed computer vision for precision agriculture, yet apple orchard monitoring remains limited by dataset constraints. The lack of diverse, realistic datasets and the difficulty of annotating dense, heterogeneous scenes. Existing datasets overlook different growth stages and stereo imagery, both essential for realistic 3D modeling of orchards and tasks like fruit localization, yield estimation, and structural analysis. To address these gaps, we present AppleGrowthVision, a large-scale dataset comprising two subsets. The first includes 9,317 high resolution stereo images collected from a farm in Brandenburg (Germany), covering six agriculturally validated growth stages over a full growth cycle. The second subset consists of 777 densely annotated images from the same farm in Brandenburg and one in Pillnitz (Germany), containing a total of 21.934 apple labels. AppleGrowthVision provides stereo-image data with agriculturally validated growth stages, enabling precise phenological analysis and 3D reconstructions. Extending MinneApple with our data improves YOLOv8 performance by 5 % in terms of F1-score, while adding it to MinneApple and MAD boosts Faster R-CNN F1-score by 28 %. Additionally, six BBCH stages were predicted with over 95 % accuracy using VGG16, ResNet152, DenseNet201, and MobileNetv2. AppleGrowthVision bridges the gap between agricultural science and computer vision, by enabling the development of robust models for fruit detection, growth modeling, and 3D analysis in precision agriculture. Future work includes improving annotation, enhancing 3D reconstruction, and extending multimodal analysis across all growth stages.

BibTeX

If you use our method in your research, please cite our paper. The paper was presented at Agriculture-Vision CVPR Workshop 2025 and published in the official proceedings in 2025.

@InProceedings{von_Hirschhausen_2025_CVPR,
    author    = {von Hirschhausen, Laura-Sophia and Magnusson, Jannes S. and Kovalenko, Mykyta and Boye, Fredrik and Rawat, Tanay and Eisert, Peter and Hilsmann, Anna and Pretzsch, Sebastian and Bosse, Sebastian},
    title     = {AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchards},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
    month     = {June},
    year      = {2025},
    pages     = {5443-5450}
}