Cityscapes. Urban scenes segmentation

Please log in.

    • Welcome!

      The Cityscapes Dataset focuses on semantic understanding of urban street scenes. In the following, we give an overview on the design choices that were made to target the dataset’s focus.


      Type of annotations

      • Semantic
      • Instance-wise
      • Dense pixel annotations



      • 50 cities
      • Several months (spring, summer, fall)
      • Daytime
      • Good/medium weather conditions
      • Manually selected frames
        • Large number of dynamic objects
        • Varying scene layout
        • Varying background


      • 5 000 annotated images with fine annotations (examples)
      • 20 000 annotated images with coarse annotations (examples)

      Benchmark suite and evaluation server

      • Pixel-level semantic labeling
      • Instance-level semantic labeling
    • Evaluation Criteria

      Pixel-Level Semantic Labeling Task

      The first Cityscapes task involves predicting a per-pixel semantic labeling of the image without considering higher-level object instance or boundary information.


      To assess performance, we rely on the standard Jaccard Index, commonly known as the PASCAL VOC intersection-over-union metric IoU = TP ⁄ (TP+FP+FN) [1], where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. Owing to the two semantic granularities, i.e. classes and categories, we report two separate mean performance scores: IoUcategory and IoUclass. In either case, pixels labeled as void do not contribute to the score.

      It is well-known that the global IoU measure is biased toward object instances that cover a large image area. In street scenes with their strong scale variation this can be problematic. Specifically for traffic participants, which are the key classes in our scenario, we aim to evaluate how well the individual instances in the scene are represented in the labeling. To address this, we additionally evaluate the semantic labeling using an instance-level intersection-over-union metric iIoU = iTP ⁄ (iTP+FP+iFN). Again iTP, FP, and iFN denote the numbers of true positive, false positive, and false negative pixels, respectively. However, in contrast to the standard IoU measure, iTP and iFN are computed by weighting the contribution of each pixel by the ratio of the class’ average instance size to the size of the respective ground truth instance. It is important to note here that unlike the instance-level task below, we assume that the methods only yield a standard per-pixel semantic class labeling as output. Therefore, the false positive pixels are not associated with any instance and thus do not require normalization. The final scores, iIoUcategory and iIoUclass, are obtained as the means for the two semantic granularities.

      Instance-Level Semantic Labeling Task

      In the second Cityscapes task we focus on simultaneously detecting objects and segmenting them. This is an extension to both traditional object detection, since per-instance segments must be provided, and pixel-level semantic labeling, since each instance is treated as a separate label. Therefore, algorithms are required to deliver a set of detections of traffic participants in the scene, each associated with a confidence score and a per-instance segmentation mask.


      To assess instance-level performance, we compute the average precision on the region level (AP [2]) for each class and average it across a range of overlap thresholds to avoid a bias towards a specific value. Specifically, we follow [3] and use 10 different overlaps ranging from 0.5 to 0.95 in steps of 0.05. The overlap is computed at the region level, making it equivalent to the IoU of a single instance. We penalize multiple predictions of the same ground truth instance as false positives. To obtain a single, easy to compare compound score, we report the mean average precision AP, obtained by also averaging over the class label set. As minor scores, we add AP50% for an overlap value of 50 %, as well as AP100m and AP50m where the evaluation is restricted to objects within 100 m and 50 m distance, respectively.

    • Terms and Conditions

      Submissions must be submitted for Phase 1 before the 2019-03-14 23:59:00 Moscow time and for Phase 2 before the 2019-04-09 23:59:00 Moscow time. You may submit 20 submissions every day and 200 in total.

    • This challenge relies on the cityscapes dataset. To load train data you may here

      Here is the example of a code submission with a custom model written in tensorflow. It contains three files:

      • metadata
      • .hdf5 pre-trained NN

      In metadata you do not need to change anything. It needs for ingestion program at server to process your submission.

      The script takes pre-trained model, predicts on test data which is given as input in metadata and writes predictions into output.

      There are 500 images in test dataset in total. The model locally you may train on gpu, but to submit you need to change it to cpu.

      You should submit files in your repository on github. The info how to integrate github with codalab is here.

      Also, you likely need to research cityscapes-dataset repository for pre-processing scripts that will help you to work with data.

    • Pixel-Level Semantic Labeling Task

  • Make your submission using github

    ID Status Inputs