Dermofit 10-class – differences in ISBI and MLMI accuracy explained

I just got a great question asking why there is a discrepancy in the accuracy reported in our two works:

[ISBI paper, we report 81.8% accuracy over 10 classes]
Kawahara, J., BenTaieb, A., & Hamarneh, G. (2016). Deep features to classify skin lesions. In IEEE ISBI (pp. 1397–1400). Summary and slides here.

[MICCAI MLMI paper, we report 74.1% accuracy over 10 classes]
Kawahara, J., & Hamarneh, G. (2016). Multi-Resolution-Tract CNN with Hybrid Pretrained and Skin-Lesion Trained Layers. In MLMI. Summary and slides here.

We use the same Dermofit dataset, so it seems surprising the accuracy we report in the papers are different. So I thought I would elaborate on why here.

The reason for the discrepancy is that in the MLMI paper, we changed the experimental setup. The experimental setups and our reasons for changing it are as follows:

In ISBI, we used 3 folds, where 2/3rds of the data was used to train a linear model, and 1/3 to test. This was repeated across all 3 folds. We used this setup, since we wanted to compare our accuracy with other works that used a similar setup. As well, in the ISBI work, we did not really tune many hyper-parameters, so we left out a separate validation set.

In the MLMI work, we wanted to tune many more hyper-parameters (e.g., how long to fine-tune for, number of nodes, etc), so we split the data to have a separate validation set. Thus in the MLMI work, we used 1/3 of the data to train a model, 1/3rd as validation, and 1/3rd to test. We then re-ran the ISBI method using this experimental setup. I think the main reason for the drop in accuracy is that we used 1/2 the amount of training data in the MLMI paper as in the ISBI paper.

Another reason for a dip in accuracy is that in the MLMI results, we did not augment the feature vectors when we re-ran the ISBI method. The reason for this is that we wanted to keep the focus on how the change in extracting multi-resolution info affected things, rather than add a possible confounding factor of data augmentation into the mix.

I tried to highlight these subtle changes in the MLMI work with sentences like,
“note this experimental setup only uses half of the training images that [8] did”
“the experiments in rows a-i did not use data augmentation”

but it’s easy to miss, and the justifications were not made clear in the paper (mainly due to limited room).

Hopefully this helps explain this oddity in the results 🙂

Questions/comments? If you just want to say thanks, consider sharing this article or following me on Twitter!