🚁 UAVs Meet LLMs 🚀

Where Unmanned Aerial Vehicles Take Off and Large Language Models Unfold!

🏡About

This repository accompanies the work:
UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility

This is an active repository, you can watch for the latest advances.
If you find it useful, please star ⭐ this repo and cite the paper.

🔥 News

[2025-03-25] 🎉 Our paper "UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility" has been accepted by Information Fusion! Stay tuned for the camera-ready version.
[2025-01-07] 📃 Check out our new paper: UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility.
[2024-12-28] This repository is newly launched to explore the synergy between Unmanned Aerial Vehicles (UAVs) and Large Language Models (LLMs). We will continually update it with fresh papers, demos, and insights.
[2024-12-27] Fei Lin and Yonglin Tian curated this list and published the first version.

If you have any questions or suggestions, please feel free to open an issue or contact us via email.

Introduction

This repository accompanies our work on "UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility".
Here, we primarily store various tables referenced in the survey/overview paper. These tables focus on:

Summarization of typical LLMs, VLMs, and VFMs
Awesome works on Fundation Models based UAV Systems
UAV-oriented Datasets across multiple application domains

Note: The goal is to provide a structured, easy-to-navigate resource for researchers interested in the intersection of UAVs and Large Language Models.

Preliminaries for UAVs

Typical configurations of UAV

Category	Characteristics	Advantages	Disadvantages
Fixed-wing UAV	Fixed wings generate lift with forward motion.	High speed, long endurance, stable flight.	Cannot hover, high demands for takeoff/landing areas.
Multirotor UAV	Multiple rotors provide lift and control.	Low cost, easy operation, capable of VTOL and hovering.	Limited flight time, low speed, small payload capacity.
Unmanned Helicopter	Single or dual rotors allow vertical take-off and hovering.	High payload capacity, good wind resistance, long endurance, VTOL.	Complex structure, higher maintenance cost, slower than fixed-wing UAVs.
Hybrid UAV	Combines fixed-wing and multirotor capabilities.	Flexible missions, long endurance, VTOL.	Complex mechanisms, higher cost.
Flapping-wing UAV	Uses clap-and-fling mechanism for flight.	Low noise, high propulsion efficiency, high maneuverability.	Complex analysis and control, limited payload capacity.
Unmanned Airship	Aerostat aircraft with gasbag for lift.	Low cost, low noise.	Low speed, low maneuverability, highly affected by wind.

UAV Swarm Path Planning Method

Category	Examples	References
Intelligent optimization algorithm	Ant Colony Algorithm	Ref
	Genetic Algorithm	Ref
	Simulated Annealing Algorithm	Ref
Mathematical programming	mixed integer linear programming	Ref
	nonlinear programming	Ref
AI based method	Deep Learning	Ref
	Reinforcement Learning	Ref

UAV Swarm Task Allocation

Category	Examples	References
Heuristic Algorithm	Particle Swarm Optimization Algorithm	Ref
	Genetic Algorithm	Ref
	Simulated Annealing Algorithm	Ref
AI Based Algorithm	Reinforcement Learning	Ref
	Artificial Neural Network	Ref
Mathematical Programming Methods	Mixed Integer Programming	Ref
Market Mechanism Based Method	Auction Based Algorithm	Ref
	Consensus Based Bundle Algorithm	Ref
	Contract Net Protocol	Ref

UAV Swarm Communication Architecture

Category	References
infrastructure-based architectures	Ref
Flying Ad-hoc Network (FANET) Architectur	Ref

UAV Swarm Formation Control Algorithm

Category	Example	References
Centralized Control	Virtual Structure	Ref
	Leader-Follower Approaches	Ref
Decentralized Control	Decentralized Model Prediction Method	Ref
Distributed Control	Behavior Method	Ref
	Consistency Method	Ref

Summarization of LLMs, VLMs, and VFMs

LLMs

Subcategory	Model Name	Institution / Author
General	GPT-3, GPT-3.5, GPT-4	OpenAI
	Claude 2, Claude 3	Anthropic
	Mistral series	Mistral AI
	PaLM series, Gemini series	Google Research
	LLaMA, LLaMA2, LLaMA3	Meta AI
	Vicuna	Vicuna Team
	Qwen series	Qwen Team, Alibaba Group
	InternLM	Shanghai AI Laboratory
	BuboGPT	Bytedance
	ChatGLM	THUKEG & THUDM
	DeepSeek series	DeepSeek

VLMs

Subcategory	Model Name	Institution / Author
General	GPT-4V, GPT-4o, GPT-4o mini, GPT o1-preview	OpenAI
	Claude 3 Opus, Claude 3.5 Sonnet	Anthropic
	Step-2	Jieyue Xingchen
	LLaVA, LLaVA-1.5, LLaVA-NeXT	Liu et al.
	MoE-LLaVA	Lin et al.
	LLaVA-CoT	Xu et al.
	Flamingo	Alayrac et al.
	BLIP	Li et al.
	BLIP-2	Li et al.
	InstructBLIP	Dai et al.
Video Understanding	LLaMA-VID	Li et al.
	IG-VLM	Kim et al.
	Video-ChatGPT	Maaz et al.
	VideoTree	Wang et al.
Visual Reasoning	X-VLM	Zeng et al.
	Chameleon	Lu et al.
	HYDRA	Ke et al.
	VISPROG	PRIOR @ Allen Institute for AI

VFMs

Subcategory	Model Name	Institution / Author
General	CLIP	OpenAI
	FILIP	Yao et al.
	RegionCLIP	Microsoft Research
	EVA-CLIP	Sun et al.
Object Detection	GLIP	Microsoft Research
	DINO	Zhang et al.
	Grounding-DINO	Liu et al.
	DINOv2	Meta AI Research, FAIR}
	AM-RADIO	NVIDIA
	DINO-WM	Zhou et al.
	YOLO-World	Cheng et al.
Image Segmentation	CLIPSeg	Lüdecke and Ecker
	SAM	Meta AI Research, FAIR
	Embodied-SAM	Xu et al.
	Point-SAM	Zhou et al.
	Open-Vocabulary SAM	Yuan et al.
	TAP	Pan et al.
	EfficientSAM	Xiong et al.
	MobileSAM	Zhang et al.
	SAM 2	Meta AI Research, FAIR
	SAMURAI	University of Washington
	SegGPT	Wang et al.
	Osprey	Yuan et al.
	SEEM	Zou et al.
	Seal	Liu et al.
	LISA	Lai et al.
Depth Estimation	ZoeDepth	Bhat et al.
	ScaleDepth	Zhu et al.
	Depth Anything	Yang et al.
	Depth Anything V2	Yang et al.
	Depth Pro	Apple

General-domain Datasets for UAV

Environmental Perception

Name	Year	Types	Amount
AirFisheye	2024	Fisheye image, Depth image, Point cloud, IMU	Over 26,000 fisheye images in total. Data is collected at a rate of 10 frames per second.
SynDrone	2023	Image, Depth image, Point cloud	Contains 72,000 annotation samples, providing 28 types of pixel-level and object-level annotations.
WildUAV	2022	Image, Video, Depth image, Metadata	Mapping images are provided as 24-bit PNG files, with the resolution of 5280x3956. Video images are provided as JPG files at a resolution of 3840x2160. There are 16 possible class labels detailed.

Event Recognition

Name	Year	Types	Amount
CapERA	2023	Video, Text	2864 videos, each with 5 descriptions, totaling 14,320 texts. Each video lasts 5 seconds and is captured at 30 frames/second with a resolution of 640 × 640 pixels.
ERA	2020	Video	A total of 2,864 videos, including disaster events, traffic accidents, sports competitions, and other 25 categories. Each video is 24 frames/second for 5 seconds.
VIRAT	2016	Video	25 hours of static ground video and 4 hours of dynamic aerial video. There are 23 event types involved.

Object Tracking

Name	Year	Types	Amount
WebUAV-3M	2024	Video, Text, Audio	4,500 videos totaling more than 3.3 million frames with 223 target categories, providing natural language and audio descriptions.
UAVDark135	2022	Video	135 video sequences with over 125,000 manually annotated frames.
DUT-VTUAV	2022	RGB-T Image	Nearly 1.7 million well-aligned visible-thermal (RGB-T) image pairs with 500 sequences for unveiling the power of RGB-T tracking. Including 13 sub-classes and 15 scenes cross 2 cities.
TNL2K	2022	Video, Infrared video, Text	2,000 video sequences, comprising 1,244,340 frames and 663 words.
PRAI-1581	2020	Image	39,461 images of 1581 person identities.
VOT-ST2020/VOT-RT2020	2020	Video	1,000 sequences, each varying in length, with an average length of approximately 100 frames.
VOT-LT2020	2020	Video	50 sequences, each with a length of approximately 40,000 frames.
VOT-RGBT2020	2020	Video, Infrared video	50 sequences, each with a length of approximately 40,000 frames.
VOT-RGBD2020	2020	Video, Depth image	80 sequences with a total of approximately 101,956 frames.
GOT-10K	2019	Image, Video	420 video clips belonging to 84 object categories and 31 motion categories.
DTB70	2017	Video	70 video sequences, each consisting of multiple video frames, with each frame containing an RGB image at a resolution of 1280x720 pixels.
Stanford Drone	2016	Video	19,000+ target tracks, containing 6 types of targets, about 20,000 target interactions, 40,000 target interactions with the environment, covering 100+ scenes in the university campus.
COWC	2016	Image	32,716 unique vehicles and 58,247 non-vehicle targets were labeled. Covering 6 different geographical areas.

Action Recognition

Name	Year	Types	Amount
Aeriform in-action	2023	Video	32 videos, 13 types of action, 55,477 frames, 40,000 callouts.
MEVA	2021	Video, Infrared video, GPS, Point cloud	Total 9,300 hours of video, 144 hours of activity notes, 37 activity types, over 2.7 million GPS track points.
UAV-Human	2021	Video, Night-vision video, Fisheye video, Depth video, Infrared video, Skeleton	67,428 videos (155 types of actions, 119 subjects), 22,476 frames of annotated key points (17 key points), 41,290 frames of people re-recognition (1,144 identities), 22,263 frames of attribute recognition (such as gender, hat, backpack, etc.).
MOD20	2020	Video	20 types of action, 2,324 videos, 503,086 frames.
NEC-DRONE	2020	Video	5,250 videos containing 256 minutes of action videos involving 19 actors and 16 action categories.
Drone-Action	2019	Video	240 HD videos, 66,919 frames, 13 types of action.
UAV-GESTURE	2019	Video	119 videos, 37,151 frames, 13 types of gestures, 10 actors.

Navigation and Localization

Name	Year	Types	Amount
CityNav	2024	Image, Text	32,000 natural language descriptions and companion tracks.
CNER-UAV	2024	Text	12,000 labeled samples containing 5 types of address labels (e.g., building, unit, floor, room, etc.).
AerialVLN	2023	Simulator path, Text	25 city-level scenes, 8,446 paths, 3 natural language descriptions per path, totaling 25,338 instructions.
DenseUAV	2023	Image	Training: 6,768 UAV images, 13,536 satellite images. Test: 2,331 UAV query images, 4,662 satellite images.
map2seq	2022	Image, Text, Map path	29,641 panoramic images, 7,672 navigation instruction texts.
VIGOR	2021	Image	90,618 aerial images, 238,696 street panorama images.
University-1652	2020	Image	1,652 university buildings, 72 universities, 50,218 training images, 37,855 UAV query images, 701 satellite query images, and 21,099 ordinary & 5,580 street view images.

Domain-specific Datasets for UAV

Transportation

Name	Year	Types	Amount
TrafficNight	2024	Image, Infrared Image, Video, Infrared Video, Map	The dataset consists of 2,200 pairs of annotated thermal infrared and sRGB image data, and video data from 7 traffic scenes, with a total duration of approximately 240 minutes. Each scene includes a high-precision map, providing a detailed layout and topological information.
VisDrone	2022	Video, Image	263 videos, 179,264 frames. 10,209 still images. More than 2,500,000 object instance annotations. The data covers 14 different cities, covering a wide range of weather and light conditions.
ITCVD	2020	Image	A total of 173 aerial images were collected, including 135 in the training set with 23,543 vehicles and 38 in the test set with 5,545 vehicles. There is 60% regional overlap between the images, and there is no overlap between the training set and the test set.
UAVid	2020	Image, Video	30 videos, 300 images, 8 semantic category annotations.
AU-AIR	2020	Video, GPS, Altitude, IMU, Speed	32,823 frames of video, 1920x1080 resolution, 30 FPS, divided into 30,000 training validation samples and 2,823 test samples. The total duration of the 8 videos is about 2 hours, with a total of 132,034 instances, distributed in 8 categories.
iSAID	2020	Image	Total images: 2,806. Total number of instances: 655,451. Test set: 935 images (not publicly labeled, used to evaluate the server).
CARPK	2018	Image	1448 images, approx. 89,777 vehicles, providing box annotations.
highD	2018	Video, Trajectory	16.5 hours, 110,000 vehicles, 5,600 lane changes, 45,000 km, totaling approximately 447 hours of vehicle travel data; 4 predefined driving behavior labels.
UAVDT	2018	Video, Weather, Altitude, Camera angle	100 videos, about 80,000 frames, 30 frames per second, containing 841,500 target boxes, covering 2,700 targets.
CADP	2016	Video	A total of 5.24 hours, 1,416 traffic accident clips, 205 full-time and space annotation videos.
VEDAI	2016	Image	1,210 images (1024 × 1024 and 512 × 512 pixels), 9 types of vehicles, containing about 6,650 targets in total.

Remote Sensing

Name	Year	Types	Amount
RET-3	2024	Image, Text	Approximately 13,000 samples. Including RSICD, RSITMD and UCM.
DET-10	2024	Image	In the object detection dataset, the number of objects per image ranges from 1 to 70, totaling about 80,000 samples.
SEG-4	2024	Image	The segmented data set covers different regions and resolutions, totaling about 72,000 samples.
DIOR	2020	Image	23,463 images, containing 192,472 target instances, covering 20 categories, including aircraft, vehicles, ships, bridges, etc., each category contains about 1,200 instances.
TGRS-HRRSD	2019	Image	Total images: 21,761. 13 categories, including aircraft, vehicles, bridges, etc. The total number of targets is approximately 53,000 targets.
xView	2018	Image	There are more than 1 million goals and 60 categories, including vehicles, buildings, facilities, boats and so on, which are divided into seven parent categories and several sub-categories.
DOTA	2018	Image	2806 images, 188, 282 targets, 15 categories.
RSICD	2018	Image, Text	10,921 images, 54,605 descriptive sentences.
HRSC2016	2017	Image	3,433 instances, totaling 1,061 images, including 70 pure ocean images and 991 images containing mixed land-sea areas. 2,876 marked vessel targets. 610 unlabeled images.
RSOD	2017	Image	Contains 4 types of targets (tank, aircraft, overpass, playground) with 12,000 positive samples and 48,000 negative samples.
NWPU-RESISC45	2017	Image	A total of 31,500 images, covering 45 scene categories, 700 images per category, resolution 256 × 256 pixels, spatial resolution from 0.2m to 30m.
NWPU VHR-10	2014	Image	800 high-resolution images, of which 650 contain targets and 150 are background images, covering 10 categories (such as aircraft, ships, bridges, etc.), totaling more than 3,000 targets.

Agriculture

Name	Year	Types	Amount
WEED-2C	2024	Image	Contains 4,129 labeled samples covering 2 weed species.
CoFly-WeedDB	2023	Image, Health data	Consisting of 201 aerial images, different weed types of 3 disturbed row crops (cotton) and their corresponding annotated images.
Avo-AirDB	2022	Image	984 high-resolution RGB images (5472 × 3648 pixels), 93 of which have detailed polygonal annotations, divided into 3 to 4 categories (small, medium, large, and background).

Industry

Name	Year	Types	Amount
UAPD	2021	Image	There are 2,401 crack images in the original data and 4,479 crack images after data enhancement.
InsPLAD	2023	Image	10,607 UAV images containing 17 classes of power assets with a total of 28,933 labeled instances, and defect labels for 5 assets with a total of 402 defect samples classified into 6 defect types.

Emergency Response

Name	Year	Types	Amount
AFID	2023	Image	A total of 816 images with resolutions of 2720 × 1536 and 2560 × 1440. Contains 8 semantic segmentation categories.
FloodNet	2021	Image, Text	The whole dataset has 2,343 images, divided into training (~60%), validation (~20%), and test (~20%) sets. The semantic segmentation labels include: Background, Building Flooded, Building Non-Flooded, Road Flooded, Road Non-Flooded, Water, Tree, Vehicle, Pool, Grass.
Aerial SAR	2020	Image	2,000 images with 30,000 action instances covering multiple human behaviors.

Military

Name	Year	Types	Amount
MOCO	2024	Image, Text	7,449 images, 37,245 captions.

Wildlife

Name	Year	Types	Amount
WAID	2023	Image	14,375 UAV images covering 6 species of wildlife and multiple environment types.

Drone Detection

Name	Year	Types	Amount
DroneRFa	2024	RF signal	It includes 24 types of UAV signals (9 types of outdoor acquisition and 15 types of indoor acquisition) and 1 type of background signals, covering 3 ISM frequency bands.
IDTDSAT	2019	Infrared image, Trajectory	Infrared image sequence of 22 segments, total number of frames 16,177, total number of targets 16,944, 30 tracks; image resolution 256 × 256 pixels.
DTDAOTRES	2019	Radar	15 segments of 8.76 GB.

Open Platforms for UAVs

Name	Publication
AirSim	Airsim: High-fidelity visual and physical simulation for autonomous vehicles
Carla	CARLA: An open urban driving simulator
NVIDIA Isaac Sim
AerialVLN Simulator	Aerialvln: Vision-and-language navigation for uavs
Embodied City	EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

Advances of FM-based UAV Systems in Various Tasks

Visual Perception

Title	Type	Publication	Code
Li et al. (A Benchmark for UAV-View Natural Language-Guided Tracking)	VFM	MDPI	GitHub
Ma et al. (Applying Unsupervised Semantic Segmentation to High-Resolution UAV Imagery for Enhanced Road Scene Parsing)	VFM	Arxiv	-
Limberg et al. (Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery)	VFM+VLM	Arxiv	-
Kim et al. (Weather-Aware Drone-View Object Detection Via Environmental Context Understanding)	VLM+VFM	ICIP 2024	-
LGNet (Shooting condition insensitive unmanned aerial vehicle object detection)	VFM	Expert Systems with Applications	-
Sakaino et al. (Dynamic Texts From UAV Perspective Natural Images)	VLM+VFM	ICCV 2023	-
COMRP (Unsupervised semantic segmentation of high-resolution UAV imagery for road scene parsing)	VFM	Arxiv	GitHub
CrossEarth (CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation)	VFM	Arxiv	GitHub
TanDepth (TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs)	VFM	Arxiv	GitHub
DroneGPT (DroneGPT: Zero-shot Video Question Answering For Drones)	VLM+LLM+VFM	CVDL 2024	-
de Zarzà et al. (Socratic video understanding on unmanned aerial vehicles)	LLM	Procedia Computer Science	-
AeroAgent (Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones)	VLM	Arxiv	-
RS-LLaVA (Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery)	VLM	MDPI	-
GeoRSCLIP (RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing)	VFM	IEEE Transactions on Geoscience and Remote Sensing	GitHub
SkyEyeGPT (Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model)	VFM+LLM	Arxiv	GitHub
AirVista (AirVista: Empowering UAVs with 3D Spatial Reasoning Abilities Through a Multimodal Large Language Model Agent)	VFM+VLM	ITSC2024	-

VLN

Title	Type	Publication	Code
Neuro-LIFT (Neuro-LIFT: A Neuromorphic, LLM-based Interactive Framework for Autonomous Drone FlighT at the Edge)	LLM	Arxiv	-
NaVid (Navid: Video-based vlm plans the next step for vision-and-language navigation)	VFM+LLM	Arxiv	-
VLN-MP (Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts)	VFM	Arxiv	GitHub
Gao et al. (Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning)	VFM+LLM	Arxiv	-
MGP (CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information)	LLM+VFM	Arxiv	GitHub
UAV Navigation LLM (Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology)	LLM+VFM	Arxiv	GitHub
GOMAA-Geo (GOMAA-Geo: GOal Modality Agnostic Active Geo-localization)	LLM+VFM	Arxiv	GitHub
NavAgent (NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation)	LLM+VFM+VLM	Arxiv	-
ASMA (ASMA: An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation via Scene-Aware Control Barrier Functions)	LLM+VFM	Arxiv	-
Zhang et al. (Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model)	VFM+LLM	IPSN 2024	-
Chen et al. (Vision-Language Navigation for Quadcopters with Conditional Transformer and Prompt-based Text Rephraser)	LLM	MMAsia 2023	-
CloudTrack (CloudTrack: Scalable UAV Tracking with Cloud Semantics)	VFM+VLM	Arxiv	-
NEUSIS (NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex UAV Search Missions)	VFM+VLM	Arxiv	-
Say-REAPEx (Say-REAPEx: An LLM-Modulo UAV Online Planning Framework for Search and Rescue)	LLM	Openreview	-
OpenFLY (OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation)	LLM+VLM	Arxiv	GitHub

Planning

Title	Type	Publication	Code
TypeFly (Typefly: Flying drones with large language model)	LLM	Arxiv	-
SPINE (SPINE: Online Semantic Planning for Missions with Incomplete Natural Language Specifications in Unstructured Environments)	LLM+VFM+VLM	Arxiv	-
LEVIOSA (LEVIOSA: Natural Language-Based Uncrewed Aerial Vehicle Trajectory Generation)	LLM	MDPI	GitHub
TPML (TPML: Task Planning for Multi-UAV System with Large Language Models)	LLM	ICCA 2023	-
REAL (Real: Resilience and adaptation using large language models on autonomous aerial robots)	LLM	Arxiv	-
Liu et al. (Multi-Agent Formation Control Using Large Language Models)	LLM	Techrxiv	-
AutoHMA-LLM (AutoHMA-LLM: Efficient Task Coordination and Execution in Heterogeneous Multi-Agent Systems Using Hybrid Large Language Models)	LLM	IEEE	-
ACMA (Agent in the Sky: Intelligent Multi-Agent Framework for Autonomous HAPS Coordination and Real-World Event Adaptation)	LLM	AAAI	-
UAV-VLA (UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation)	LLM+VLM	Arxiv	-

Flight Control

Title	Type	Publication	Code
PromptCraft (Chatgpt for robotics: Design principles and model abilities)	LLM	IEEE Access	GitHub
Zhong et al. (A safer vision-based autonomous planning system for quadrotor uavs with dynamic obstacle trajectory prediction and its application with llms)	LLM	WACV 2024	-
Tazir et al. (From words to flight: Integrating openai chatgpt with px4/gazebo for natural language-based drone control)	LLM	WCSE 2023	-
Phadke et al. (Integrating Large Language Models for UAV Control in Simulated Environments: A Modular Interaction Approach)	LLM	Arxiv	-
EAI-SIM (EAI-SIM: An Open-Source Embodied AI Simulation Framework with Large Language Models)	LLM	ICCA 2024	GitHub
TAIiST (TAIiST CPS-UAV at the SBFT Tool Competition 2024)	LLM	SBFT 2024	GitHub
Swarm-GPT (Swarm-gpt: Combining large language models with safe motion planning for robot choreography design)	LLM	Arxiv	-
FlockGPT (FlockGPT: Guiding UAV Flocking with Linguistic Orchestration)	LLM	Arxiv	-
CLIPSwarm (CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models)	VFM	Arxiv	-

Infrastructures

Title	Type	Publication	Code
DTLLM-VLT (DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM)	VFM+LLM	CVPR 2024	-
Yao et al. (Can llm substitute human labeling? a case study of fine-grained chinese address entity recognition dataset for uav delivery)	LLM	Companion Proceedings of the ACM Web Conference 2024	GitHub
GPG2A (Cross-View Meets Diffusion: Aerial Image Synthesis with Geometry and Text Guidance)	LLM	Arxiv	GitLap
AeroVerse (AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models)	VLM+LLM	Arxiv	-
Tang et al. (Defining and Evaluating Physical Safety for Large Language Models)	LLM	Arxiv	Hugging face
Xu et al. (Emergency Networking Using UAVs: A Reinforcement Learning Approach with Large Language Model)	LLM	IPSN 2024	-
LLM-RS (Real-time Integration of Fine-tuned Large Language Model for Improved Decision-Making in Reinforcement Learning)	LLM	IJCNN 2024	-
Pineli et al. (Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks)	LLM	Arxiv	-

Contributors

We want to thank the following contributors for creating, maintaining, and curating the tables in this repository:

Yonglin Tian
Fei Lin
Yiduo Li
Tengchao Zhang
Xuan Fu

If you have any questions about this repository, feel free to get in touch with Yonglin Tian 📧 or Fei Lin 📧.

(If you would like to contribute to this repo, please open an Issue or Pull Request.)

Star History

Citation

If you find this repository useful, please consider citing this paper:

@misc{tian2025uavsmeetllmsoverviews,
      title={UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility}, 
      author={Yonglin Tian and Fei Lin and Yiduo Li and Tengchao Zhang and Qiyao Zhang and Xuan Fu and Jun Huang and Xingyuan Dai and Yutong Wang and Chunwei Tian and Bai Li and Yisheng Lv and Levente Kovács and Fei-Yue Wang},
      year={2025},
      eprint={2501.02341},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.02341}, 
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🚁 UAVs Meet LLMs 🚀

🏡About

🔥 News

Introduction

Table of Contents

Preliminaries for UAVs

Typical configurations of UAV

UAV Swarm Path Planning Method

UAV Swarm Task Allocation

UAV Swarm Communication Architecture

UAV Swarm Formation Control Algorithm

Summarization of LLMs, VLMs, and VFMs

LLMs

VLMs

VFMs

General-domain Datasets for UAV

Environmental Perception

Event Recognition

Object Tracking

Action Recognition

Navigation and Localization

Domain-specific Datasets for UAV

Transportation

Remote Sensing

Agriculture

Industry

Emergency Response

Military

Wildlife

Drone Detection

Open Platforms for UAVs

Advances of FM-based UAV Systems in Various Tasks

Visual Perception

VLN

Planning

Flight Control

Infrastructures

Contributors

Star History

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages