SkyData Version 1 info and downloadables are given in the table.
Downloadables are divided into three parts:
Labeled data (.png)
Annotations (.json)
Unlabeled data (.png)
Labeled data contains image data for object detection / segmentation and object tracking.
Images are frames from multiple video recordings (Scene1-7).
Scenes are divided into parts and indices of each part is shown in the table.
Each scene, part and frame relations are shown in the table.
Training and test images are identified by indices of scenes.
Each column of labeled data is explained below:
SCENE NAME:
SkyData images comes from seven different videos and videos are names as Scene1-7.
This column shows the name of scenes.
TOTAL FRAMES:
This column shows the total number of frames in each scene.
TRAINING FRAME INDICES:
This column shows which frames of each scene are used for training data.
The format is [from, to] and limits are included.
TEST FRAME INDICES:
This column shows which frames of each scene are used for test data.
The format is [from, to] and limits are included.
FOLDER NAME:
Each scene is divided into parts in SkyData. This column shows the names of the parts.
Since the image data is kept into folder respective to their part name, this column is named as folder name.
FRAME INDICES:
This column shows which frames of each scene are in that part.
The format is [from, to] and limits are included.
DOWNLOADS:
Due to large size of SkyData, image data can be downloaded as multiple parts, instead of downloading all of them at once.
Downloadables are separeted according to their part names and train/test splits.
Expected folder structure also provided in this page.
SkyData Version 1 contains annotaions for object detection/segmentation and object tracking.
These annotations are split into training and test/challenge.
Training annotations are shared but test/challenge annotations are hidden for benchmarking purposes.
Annotations are in JSON format and details are explained in this page.
SkyData Version 1 also contains unlabeled image data that can be used for unsupervised learning.
Note
Dataset is not available currently due to a review (publication) process. The actual data files and their corresponding annotations will be released, once the publication process is completed.
Overview
Currently, SkyData (our proposed dataset in this paper) consists of 9 sequences of frames and the total number of frames in each sequence varies between 200 and 1650. We have 9 classes in total in SkyData. They are: people, bicycle, motor, pickup, car, van, truck, bus and boat. The dataset is collected via an on-board UAV camera at a public area with dense traffic of pedestrians and vehicles from different angles and at different during the Covid pandemic (therefore most people are using face masks). The number of scenes is 7 and each scene contains multiple sequences.
Sequence Name
Duration
Altitude
Total Frames
Labeled Frames
Annotation Rate
Scene1
08:08.89
N/A
14652
300
2%
Scene2
04:58.23
119m
8933
4000
45%
Scene3
05:47.61
69m
10413
2960
28%
Scene4
05:02.00
69m
9051
1450
16%
Scene5
00:54.58
48m
1650
1650
100%
Scene6
05:10.01
N/A
9289
0
0%
Scene7
02:03.42
48m
3693
0
0%
Dataset
Images
FPS
Max Resolution
Category
Sequences
Tracks
Labels
Density
Segmentation
SkyDataV1
10360
30
1920x1080
9
9
5447
3695245
356.68
Yes
VisDrone-MOT[1]
33682
24
1360x765
13
79
10689
1530288
45.46
No
MOTS[2]
2862
30
1920x1080
3
8
228
26892
9.40
Yes
KAIST[3]
95324
30
512x640
3
41
0
108132
1.34
No
VisDrone-DET[1]
7019
N/A
1360x765
10
N/A
0
381964
54.41
No
DOTA[4]
1869
N/A
12029x5014
15
N/A
0
127698
68.32
No
VHR-10[5]
650
N/A
1920x1080
10
N/A
0
3921
6.03
Yes
VEDAI[6]
1268
N/A
1024x1024
10
N/A
0
10210
8.05
No
Sample segmentation masks on sample frames from the dataset are shown below.
Folder Structure
Expected folder structure and file sizes for SkyDataV1 is shown below.
Table and figure for the number of labels and tracks per class in training data is given below.
Class ID
Class Name
Number of Labels in Training Data
Number of Tracks in Training Data
0
People
1,918,679
3,065
1
Bicycle
552
1
2
Motor
87,312
79
3
Pickup
10,665
8
4
Car
594,261
622
5
Van
140,363
100
6
Truck
18,618
17
7
Bus
44,098
47
8
Boat
4,211
4
TOTAL
2,818,759
3,943
In SkyData, most of the objects are small. The distribution of object size is given in the table below. Similar to COCO dataset, area is measured as the number of pixels in the segmentation mask (segmentation area).
In COCO, an object is considered as small if area < 322, medium if 322 < area < 962 and large if area > 962.
In addition to COCO, we split small objects into micro (area < 122), tiny (122 < area < 222) and small (222 < area < 322) due to heavy number of small objects in SkyData.
Object Size
Number of Labels in Training Data
Percentage of Labels in Training Data
Micro (area < 122)
1,669,426
59.23%
Tiny (122 < area < 222)
770,670
27.34%
Small (222 < area < 322)
212,684
7.55%
Medium (322 < area < 962)
154,069
5.47%
Large (area > 962)
11,910
0.42%
You can see the comparison between SkyDataV1 and COCO 2017 for the number of labels and object size in the figure below.
Data Format
SkyData format is similar to COCO format, you can see the format details for detection and tracking below.
[1] Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1–1 (2021). https://doi.org/10.1109/TPAMI.2021.3119563
[2] Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.B.G., Geiger, A., Leibe, B.: Mots: Multi-object tracking and segmentation. arXiv:1902.03604[cs] (2019), http://arxiv.org/abs/1902.03604, arXiv: 1902.03604
[3] Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.S.: Multispectral pedestrian detection: Benchmark dataset and baselines. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015
[4] Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: Dota: A large-scale dataset for object detection in aerial images. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
[5] Cheng, G., Zhou, P., Han, J.: Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 54(12), 7405–7415 (2016)
[6] Razakarivony, S., Jurie, F.: Vehicle detection in aerial imagery: A small target detection benchmark. Journal of Visual Communication and Image Representation 34, 187–203 (2016)
[7] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ́ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755. Springer International Publishing, Cham (2014)