Dataset

Data and annotations
SkyData Version 1
DOWNLOADABLE LABELED DATA (.png)
SCENES PARTS
SCENE NAME TOTAL FRAMES TRAINING FRAME INDICES
([from, to])
TEST FRAME INDICES
([from, to])
FOLDER NAME FRAME INDICES
([from, to])
DOWNLOADS
Scene1 14652 [2000, 2299] N/A scene1_part1 [2000, 2299] Train N/A
Scene2 8933 [1, 3000] [3001, 4000] scene2_part1 [1, 2000] Train N/A
scene2_part2 [2001, 4000] Train Test
Scene3 10413 [1, 2470] [4001, 4490] scene3_part1 [1, 2500] Train N/A
scene3_part2 [4001, 4500] N/A Test
Scene4 9051 [1, 560],
[1000, 1689]
[2000, 2199] scene4_part1 [1, 1000] Train N/A
scene4_part2 [1001, 2000] Train N/A
scene4_part3 [2001, 3000] N/A Test
Scene5 1650 [1, 1065] [1066, 1650] scene5_part1 [1, 1650] Train Test
Scene6 9289 N/A N/A N/A N/A N/A N/A
Scene7 3693 N/A N/A N/A N/A N/A N/A
ANNOTATIONS (.json)
OBJECT DETECTION / SEGMENTATION OBJECT TRACKING
TRAINING TEST / CHALLENGE TRAINING TEST / CHALLENGE
train_DET test_DET
(NOT RELEASED)
train_VID test_VID
(NOT RELEASED)
DOWNLOADABLE UNLABELED DATA (.png)
TO BE ADDED

SkyData Version 1 info and downloadables are given in the table. Downloadables are divided into three parts:

  • Labeled data (.png)
  • Annotations (.json)
  • Unlabeled data (.png)
Labeled data contains image data for object detection / segmentation and object tracking. Images are frames from multiple video recordings (Scene1-7). Scenes are divided into parts and indices of each part is shown in the table. Each scene, part and frame relations are shown in the table. Training and test images are identified by indices of scenes. Each column of labeled data is explained below:
  • SCENE NAME:
    SkyData images comes from seven different videos and videos are names as Scene1-7. This column shows the name of scenes.

  • TOTAL FRAMES:
    This column shows the total number of frames in each scene.

  • TRAINING FRAME INDICES:
    This column shows which frames of each scene are used for training data. The format is [from, to] and limits are included.

  • TEST FRAME INDICES:
    This column shows which frames of each scene are used for test data. The format is [from, to] and limits are included.

  • FOLDER NAME:
    Each scene is divided into parts in SkyData. This column shows the names of the parts. Since the image data is kept into folder respective to their part name, this column is named as folder name.

  • FRAME INDICES:
    This column shows which frames of each scene are in that part. The format is [from, to] and limits are included.

  • DOWNLOADS:
    Due to large size of SkyData, image data can be downloaded as multiple parts, instead of downloading all of them at once. Downloadables are separeted according to their part names and train/test splits. Expected folder structure also provided in this page.

SkyData Version 1 contains annotaions for object detection/segmentation and object tracking. These annotations are split into training and test/challenge. Training annotations are shared but test/challenge annotations are hidden for benchmarking purposes. Annotations are in JSON format and details are explained in this page.

SkyData Version 1 also contains unlabeled image data that can be used for unsupervised learning.


Note

Dataset is not available currently due to a review (publication) process. The actual data files and their corresponding annotations will be released, once the publication process is completed.

Overview

Currently, SkyData (our proposed dataset in this paper) consists of 9 sequences of frames and the total number of frames in each sequence varies between 200 and 1650. We have 9 classes in total in SkyData. They are: people, bicycle, motor, pickup, car, van, truck, bus and boat. The dataset is collected via an on-board UAV camera at a public area with dense traffic of pedestrians and vehicles from different angles and at different during the Covid pandemic (therefore most people are using face masks). The number of scenes is 7 and each scene contains multiple sequences.


Sequence Name Duration Altitude Total Frames Labeled Frames Annotation Rate
Scene1 08:08.89 N/A 14652 300 2%
Scene2 04:58.23 119m 8933 4000 45%
Scene3 05:47.61 69m 10413 2960 28%
Scene4 05:02.00 69m 9051 1450 16%
Scene5 00:54.58 48m 1650 1650 100%
Scene6 05:10.01 N/A 9289 0 0%
Scene7 02:03.42 48m 3693 0 0%
Dataset Images FPS Max Resolution Category Sequences Tracks Labels Density Segmentation
SkyDataV1 10360 30 1920x1080 9 9 5447 3695245 356.68 Yes
VisDrone-MOT[1] 33682 24 1360x765 13 79 10689 1530288 45.46 No
MOTS[2] 2862 30 1920x1080 3 8 228 26892 9.40 Yes
KAIST[3] 95324 30 512x640 3 41 0 108132 1.34 No
VisDrone-DET[1] 7019 N/A 1360x765 10 N/A 0 381964 54.41 No
DOTA[4] 1869 N/A 12029x5014 15 N/A 0 127698 68.32 No
VHR-10[5] 650 N/A 1920x1080 10 N/A 0 3921 6.03 Yes
VEDAI[6] 1268 N/A 1024x1024 10 N/A 0 10210 8.05 No

Sample segmentation masks on sample frames from the dataset are shown below.


Folder Structure

Expected folder structure and file sizes for SkyDataV1 is shown below.

 SkyDataV1/                 (2 annotations, 2 folders, 42.0 GB)
     ├── test/                      (2275 images, 4 folders, 8.82 GB)
     │   ├── scene2_part2/              (1000 images, 3.89 GB)
     │   ├── scene3_part2/              (490 images, 2.03 GB)
     │   ├── scene4_part3/              (200 images, 690 MB)
     │   └── scene5_part1/              (585 images, 2.21 GB)
     ├── train/                     (8085 images, 7 folders, 30.0 GB)
     │   ├── scene1_part1/              (300 images, 1.08 GB)
     │   ├── scene2_part1/              (2000 images, 7.69 GB)
     │   ├── scene2_part2/              (1000 images, 3.87 GB)
     │   ├── scene3_part1/              (2470 images, 10.8 GB)
     │   ├── scene4_part1/              (560 images, 1.21 GB)
     │   ├── scene4_part2/              (690 images, 1.5 GB)
     │   └── scene5_part1/              (1065 images, 3.89 GB)
     ├── train_DET.json             (1.47 GB)
     └── train_VID.json             (1.68 GB)


Table and figure for the number of labels and tracks per class in training data is given below.

Class ID Class Name Number of Labels in Training Data Number of Tracks in Training Data
0 People 1,918,679 3,065
1 Bicycle 552 1
2 Motor 87,312 79
3 Pickup 10,665 8
4 Car 594,261 622
5 Van 140,363 100
6 Truck 18,618 17
7 Bus 44,098 47
8 Boat 4,211 4
TOTAL 2,818,759 3,943

In SkyData, most of the objects are small. The distribution of object size is given in the table below. Similar to COCO dataset, area is measured as the number of pixels in the segmentation mask (segmentation area). In COCO, an object is considered as small if area < 322, medium if 322 < area < 962 and large if area > 962. In addition to COCO, we split small objects into micro (area < 122), tiny (122 < area < 222) and small (222 < area < 322) due to heavy number of small objects in SkyData.

Object Size Number of Labels in Training Data Percentage of Labels in Training Data
Micro (area < 122) 1,669,426 59.23%
Tiny (122 < area < 222) 770,670 27.34%
Small (222 < area < 322) 212,684 7.55%
Medium (322 < area < 962) 154,069 5.47%
Large (area > 962) 11,910 0.42%

You can see the comparison between SkyDataV1 and COCO 2017 for the number of labels and object size in the figure below.

Data Format

SkyData format is similar to COCO format, you can see the format details for detection and tracking below.


Data Format For Detection

 {           
        'info'          : info, 
        'licenses'      : [license], 
        'categories'    : [category], 
        'images'        : [image], 
        'annotations'   : [annotation] 
     } 
    
     info{           
        'description'   : NoneType, 
        'url'           : NoneType, 
        'version'       : NoneType, 
        'year'          : int, 
        'contributor'   : NoneType, 
        'date_created'  : str 
     } 
    
     license{           
        'url'           : NoneType, 
        'id'            : int, 
        'name'          : NoneType 
     } 
    
     category{           
        'supercategory' : NoneType, 
        'id'            : int,
        'name'          : str 
     } 
    
     image{           
        'license'       : int, 
        'url'           : NoneType, 
        'file_name'     : str, 
        'height'        : int, 
        'width'         : int, 
        'date_captured' : NoneType, 
        'id'            : int 
     } 
    
     annotation{           
        'segmentation'  : [polygon] 
        'iscrowd'       : int (0 or 1), 
        'image_id'      : int, 
        'category_id'   : int, 
        'id'            : int, 
        'bbox'          : [x,y,width,height], 
        'area'          : float, 
        'track_id'      : str, 
        'frame_id'      : int 
     }  


Data Format For Tracking

 {           
        'info'          : info, 
        'licenses'      : [license], 
        'categories'    : [category], 
        'images'        : [image], 
        'annotations'   : [annotation],
        'videos'        : [video]
     } 
    
     info{           
        'description'   : NoneType, 
        'url'           : NoneType, 
        'version'       : NoneType, 
        'year'          : int, 
        'contributor'   : NoneType, 
        'date_created'  : str 
     } 
    
     license{           
        'url'           : NoneType, 
        'id'            : int, 
        'name'          : NoneType 
     } 
    
     category{           
        'supercategory' : NoneType, 
        'id'            : int,
        'name'          : str 
     } 
    
     image{           
        'license'       : int, 
        'url'           : NoneType, 
        'file_name'     : str, 
        'height'        : int, 
        'width'         : int, 
        'date_captured' : NoneType, 
        'id'            : int,
        'video_id'      : int,
        'frame_id'      : int,
        'mot_frame_id'  : int
     } 
    
     annotation{           
        'segmentation'  : [polygon] 
        'iscrowd'       : int (0 or 1), 
        'image_id'      : int, 
        'category_id'   : int, 
        'id'            : int, 
        'bbox'          : [x,y,width,height], 
        'area'          : float, 
        'track_id'      : str, 
        'frame_id'      : int,
        'mot_conf'      : float,
        'visibility'    : float,
        'mot_class_id'  : int
        'instance_id'   : int
     }
     
     video{           
        'id'            : int,
        'name'          : str,
        'fps'           : int,
        'width'         : int,
        'height'        : int
     }

[1] Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence pp. 1–1 (2021). https://doi.org/10.1109/TPAMI.2021.3119563

[2] Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.B.G., Geiger, A., Leibe, B.: Mots: Multi-object tracking and segmentation. arXiv:1902.03604[cs] (2019), http://arxiv.org/abs/1902.03604, arXiv: 1902.03604

[3] Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.S.: Multispectral pedestrian detection: Benchmark dataset and baselines. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015

[4] Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: Dota: A large-scale dataset for object detection in aerial images. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

[5] Cheng, G., Zhou, P., Han, J.: Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 54(12), 7405–7415 (2016)

[6] Razakarivony, S., Jurie, F.: Vehicle detection in aerial imagery: A small target detection benchmark. Journal of Visual Communication and Image Representation 34, 187–203 (2016)

[7] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ́ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755. Springer International Publishing, Cham (2014)