Enhancing Navigational Scene Understanding using Integrated Language Models in Maritime Environments

Korea Advanced Institute of Science and Technology (KAIST)
IROS 2025 review

*Corresponding Author

Abstract

In this study, we introduce an innovative algorithm for enhanced navigational scene understanding in complex canal environments by utilizing large language models (LLM) and visual language models (VLM) to achieve autonomous maritime situational awareness. The proposed algorithm interprets the meanings of various features and marks on detected objects in maritime contexts. By combining this information with radar and camera data, the algorithm generates cost maps for safe navigation. This approach offers two key benefits: (1) the ability to identify navigable areas considering obstacles, maritime marks, rules, and ship intentions, and (2) decision-making support based on reasoning, bridging the information gap between human operators and perception results. The performance of the proposed approach is demonstrated using a real-world dataset.

Flowchart of the proposed algorithm. The algorithm has three stages: Stage I detects objects with a predefined class, followed by coordinate transformation and tracking. Stage II detects maritime marks using VLM and applies LLM for scene understanding. In Stage III, the data is integrated to create a Scene Understanding Cost Map for safe path planning and logical inferences.

Methodology

Perception of Extrinsic Features

Image Description
In maritime environments, where proactive avoidance is essential, detecting distant objects is crucial. We used the RT-DETR and YOSO models for precise detection. RT-DETR excels in detecting distant boats, while YOSO provides pixel-level segmentation of land and bridge structures. The proposed algorithm integrates image data into the radar coordinate system using extrinsic and intrinsic parameters of the camera. Fig. 2(a) shows accurate boat detection, while (b) demonstrates the correct generation of land and bridge information. The results are displayed in Fig. 2(c), where boat detection and segmentation of bridges and land are accurately aligned with the radar coordinates.

Perception of Intrinsic Features

VLM Prompt Detection

Image Description
To detect complex objects in marine situations, we employed the VLM model, specifically Grounding DINO. This model uses free-form prompts to perform detection for additional object classes.

LLM Navigational Scene Understanding

Image Description
In the LLM Navigational Scene Understanding module, object detection results from Stage I and VLM prompt detection from Stage II are used for maritime scene understanding. GPT-4o was chosen for its ability to handle visual referring prompting and complex environments.

Scene Understanding Cost Map

Image Description
The results of the VLM and LLM in Stage II are reflected as follows: Maritime marks, such as bridge and buoy marks, and ship intentions are incorporated into obstacle zones with the probability. An occupancy grid map was generated based on the results from Stages I and II using the occupancy grid map model. As shown in Fig. 5(b), the bridge pillars visible from the camera are precisely detected as obstacle areas, and the area between the bridge pillars is designated as a navigable zone due to the camera’s ability to detect height. This allows for flexible application according to the given situation. Fig. 5(c) illustrates this, where maritime marks, ship intentions, and rules are integrated, marking the left-side route as unnavigable.

Experiment Results of the proposed algorithm.

BibTeX

@inproceedings{shin2024llmship,
  title={Enhancing Navigational Scene Understanding using Integrated Language Models in Maritime Environments},
  author={Shin, Yeongha and Kim, Jinwhan},
  booktitle={},
  year={2025},
  organization={},
  note={Robotics Program, Korea Advanced Institute of Science and Technology (KAIST)}
}