Multimodal Brain Tumor Segmentation Challenge 2019: Evaluation
In this year's challenge, three reference standards are used for the three tasks of the challenge: 1) manual segmentation labels of tumor sub-regions, 2) clinical data of overall survival, and 3) uncertainty estimation for the predicted tumor sub-regions.
For the segmentation task, and for consistency with the configuration of the previous BraTS challenges, we will use the "Dice score", and the "Hausdorff distance (95%)". Expanding upon this evaluation scheme, since BraTS'17 we also use the metrics of "Sensitivity" and "Specificity", allowing to determine potential over- or under-segmentations of the tumor sub-regions by participating methods. Since the BraTS'12-'13 are subsets of the BraTS'19 test data, we will also calculate performance on the '12-'13 data to allow for a comparison against the performances reported in the BraTS TMI reference paper.
For the task of survival prediction, two evaluation schemes are considered. First, for ranking the participating teams, evaluation will be based on the classification of subjects as long-survivors (e.g., >15 months), short-survivors (e.g., <10 months), and mid-survivors (e.g. between 10 and 15 months). Predictions of the participating teams will be assessed based on accuracy (i.e. the number of correctly classified patients) with respect to this grouping. Note that participants are expected to provide predicted survival status only for subjects with available age and resection status of GTR (i.e., Gross Total Resection). For post-challenge analyses, we will also compare both the mean and median square error of survival time predictions.
For the task of estimating uncertainty, uncertain voxels will be filtered out at several predetermined N number of uncertainty threshold points "Thr", and the model performance will be assessed based on the "Dice score" of the remaining voxels at each of these Thr. For example, Thr:75 implies that all voxels with uncertainty values >75 will be marked as uncertain and the associated predictions will be filtered out and not considered for the subsequent Dice calculations. Dice values will only be calculated for the remaining predictions at the unfiltered voxels. This evaluation will reward approaches where the confidence in the correct assertions is high (True Positives - TPs) and low for incorrect assertions (False Positives - FPs, and False Negatives - FNs). For these approaches, it is expected that as more uncertain voxels are filtered out, the Dice score will increase on the remaining predictions. A second evaluation will keep track of the ratio of TPs that are filtered relative to the initial/baseline number of TPs (TP at threshold 100) at different Thr. This evaluation will essentially penalize approaches that filter out a large percentage of TP voxels, in order to attain the reported Dice value, and thereby rewarding approaches with a lower percentage of uncertain TPs.
Visual and quantitative examples are given in the figure and table below, where increasing the Thr leads to filtering out voxels with incorrect assertions. This, in turn, leads to an increase in the Dice value for the remaining voxels. The example case 2 leads to a marginally better Dice value than the slice in example 1 at uncertainty thresholds (Thr) 50 and 25. However, the Ratio of Filtered TPs indicates that this is at the expense of marking more TPs as uncertain.
|Dice score||Ratio of Filtered TP (1-(TP_x/TP_baseline))|
For all tasks, we will announce a 2-week evaluation period (26 August-7 September), during which the participants will be able to request different dates for the test data to be released to them. Note that each team should analyze the test data using their local computing infrastructure and submit their results 48-hours later in CBICA's Image Processing Portal (IPP).
Feel free to send any communication related to the BraTS challenge to email@example.com.