You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Where Unmanned Aerial Vehicles Take Off and Large Language Models Unfold!
🏡About
This repository accompanies the work: UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility
This is an active repository, you can watch for the latest advances.
If you find it useful, please star ⭐ this repo and cite the paper.
🔥 News
[2025-03-25] 🎉 Our paper "UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility" has been accepted by Information Fusion! Stay tuned for the camera-ready version.
[2024-12-28] This repository is newly launched to explore the synergy between Unmanned Aerial Vehicles (UAVs) and Large Language Models (LLMs). We will continually update it with fresh papers, demos, and insights.
[2024-12-27]Fei Lin and Yonglin Tian curated this list and published the first version.
If you have any questions or suggestions, please feel free to open an issue or contact us via email.
Introduction
This repository accompanies our work on "UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility".
Here, we primarily store various tables referenced in the survey/overview paper. These tables focus on:
Summarization of typical LLMs, VLMs, and VFMs
Awesome works on Fundation Models based UAV Systems
UAV-oriented Datasets across multiple application domains
Note: The goal is to provide a structured, easy-to-navigate resource for researchers interested in the intersection of UAVs and Large Language Models.
Mapping images are provided as 24-bit PNG files, with the resolution of 5280x3956. Video images are provided as JPG files at a resolution of 3840x2160. There are 16 possible class labels detailed.
2864 videos, each with 5 descriptions, totaling 14,320 texts. Each video lasts 5 seconds and is captured at 30 frames/second with a resolution of 640 × 640 pixels.
A total of 2,864 videos, including disaster events, traffic accidents, sports competitions, and other 25 categories. Each video is 24 frames/second for 5 seconds.
Nearly 1.7 million well-aligned visible-thermal (RGB-T) image pairs with 500 sequences for unveiling the power of RGB-T tracking. Including 13 sub-classes and 15 scenes cross 2 cities.
19,000+ target tracks, containing 6 types of targets, about 20,000 target interactions, 40,000 target interactions with the environment, covering 100+ scenes in the university campus.
67,428 videos (155 types of actions, 119 subjects), 22,476 frames of annotated key points (17 key points), 41,290 frames of people re-recognition (1,144 identities), 22,263 frames of attribute recognition (such as gender, hat, backpack, etc.).
The dataset consists of 2,200 pairs of annotated thermal infrared and sRGB image data, and video data from 7 traffic scenes, with a total duration of approximately 240 minutes. Each scene includes a high-precision map, providing a detailed layout and topological information.
263 videos, 179,264 frames. 10,209 still images. More than 2,500,000 object instance annotations. The data covers 14 different cities, covering a wide range of weather and light conditions.
A total of 173 aerial images were collected, including 135 in the training set with 23,543 vehicles and 38 in the test set with 5,545 vehicles. There is 60% regional overlap between the images, and there is no overlap between the training set and the test set.
32,823 frames of video, 1920x1080 resolution, 30 FPS, divided into 30,000 training validation samples and 2,823 test samples. The total duration of the 8 videos is about 2 hours, with a total of 132,034 instances, distributed in 8 categories.
There are more than 1 million goals and 60 categories, including vehicles, buildings, facilities, boats and so on, which are divided into seven parent categories and several sub-categories.
800 high-resolution images, of which 650 contain targets and 150 are background images, covering 10 categories (such as aircraft, ships, bridges, etc.), totaling more than 3,000 targets.
984 high-resolution RGB images (5472 × 3648 pixels), 93 of which have detailed polygonal annotations, divided into 3 to 4 categories (small, medium, large, and background).
10,607 UAV images containing 17 classes of power assets with a total of 28,933 labeled instances, and defect labels for 5 assets with a total of 402 defect samples classified into 6 defect types.
The whole dataset has 2,343 images, divided into training (~60%), validation (~20%), and test (~20%) sets. The semantic segmentation labels include: Background, Building Flooded, Building Non-Flooded, Road Flooded, Road Non-Flooded, Water, Tree, Vehicle, Pool, Grass.
It includes 24 types of UAV signals (9 types of outdoor acquisition and 15 types of indoor acquisition) and 1 type of background signals, covering 3 ISM frequency bands.
Zhong et al. (A safer vision-based autonomous planning system for quadrotor uavs with dynamic obstacle trajectory prediction and its application with llms)
We want to thank the following contributors for creating, maintaining, and curating the tables in this repository:
Yonglin Tian
Fei Lin
Yiduo Li
Tengchao Zhang
Xuan Fu
If you have any questions about this repository, feel free to get in touch with Yonglin Tian📧 or Fei Lin📧.
(If you would like to contribute to this repo, please open an Issue or Pull Request.)
Star History
Citation
If you find this repository useful, please consider citing this paper:
@misc{tian2025uavsmeetllmsoverviews,
title={UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility},
author={Yonglin Tian and Fei Lin and Yiduo Li and Tengchao Zhang and Qiyao Zhang and Xuan Fu and Jun Huang and Xingyuan Dai and Yutong Wang and Chunwei Tian and Bai Li and Yisheng Lv and Levente Kovács and Fei-Yue Wang},
year={2025},
eprint={2501.02341},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2501.02341},
}