A Benchmark for UAV-View Natural Language Guided Tracking
We propose a new benchmark, UAVNLT, for the UAV view Natural Language guided Tracking task. UAVNLT consists of videos taken from UAV cameras from four cities for vehicles on city roads. For each video, vehicles' bounding boxes, trajectories, and natural language are carefully annotated. Compared to the existing datasets, which are only annotated with bounding boxes, the natural language sentences in our dataset can be more suitable for many application fields where humans take part in the system for that language is not only more friendly for human-computer interaction but also can overcome the appearance features' low uniqueness for tracking. We test several existing methods on our new benchmarks and find that the performance of existing methods is not satisfactory. To pave the way for future work, we propose a baseline method suitable for this task, achieving state-of-the-art performance. We believe our new dataset and proposed baseline method will be helpful in many fields, such as smart city, smart transportation, vehicle management, .etc.
Coming soon.