0
点赞
收藏
分享

微信扫一扫

AnalysisoftheISICimagedatasets: Usage, benchmarksandRecommendations 对ISIC图像数据集的分析:使用情况、基准测试

Analysis of the ISIC image datasets: Usage, benchmarks and

Recommendations

对ISIC图像数据集的分析:使用情况、基准测试和建议

国际皮肤成像协作(ISIC)数据集已经成为医学图像分析的机器学习研究人员的领先存储库,特别是在皮肤癌检测和恶性肿瘤评估领域。它们包含数千上万的皮肤镜照片和金标准的病变诊断元数据。相关的年度挑战对该领域作出了重大贡献,文件报告的措施远远超过了人类专家。皮肤癌可分为两大类——黑色素瘤和非黑色素瘤。虽然黑色素瘤不那么普遍,但它被认为是更严重的,因为如果不在早期阶段进行治疗,它可以迅速扩散到其他器官。在本文中,我们总结了ISIC数据集图像的使用情况,并对2016 - 2020年期间的年度发布情况进行了分析。我们的分析发现,在数据集内和数据集之间,都有大量重复的图像。此外,我们还注意到在测试和训练集之间分布的重复。由于这些不规则性,我们提出了一种重复删除策略,并建议研究人员在研究ISIC数据时使用一个精心策划的数据集

The International Skin Imaging Collaboration (ISIC) datasets have become a leading repository for researchers in machine learning for medical image analysis, especially in the field of skin cancer detectionand malignancy assessment. They contain tens of thousands of dermoscopic photographs together withgold-standard lesion diagnosis metadata. The associated yearly challenges have resulted in major contributions to the field, with papers reporting measures well in excess of human experts. Skin cancers canbe divided into two major groups - melanoma and non-melanoma. Although less prevalent, melanoma isconsidered to be more serious as it can quickly spread to other organs if not treated at an early stage. Inthis paper, we summarise the usage of the ISIC dataset images and present an analysis of yearly releasesover a period of 2016 - 2020. Our analysis found a significant number of duplicate images, both withinand between the datasets. Additionally, we also noted duplicates spread across testing and training sets.Due to these irregularities, we propose a duplicate removal strategy and recommend a curated dataset

for researchers to use when working on ISIC datasets

鉴于ISIC 2020专注于黑色素瘤的分类,我们进行了实验,以提供ISIC 2020测试集的基准结果,并对较小的ISIC 2017测试集进行了额外的分析。测试是在应用重复删除策略和额外的数据平衡步骤后完成。由于从训练集中删除了14,310张重复的图像,我们的基准结果显示了良好的黑色素瘤预测水平,对于表现最好的模型的AUC为0.80。由于我们的目标不是使网络性能最大化,所以在我们的实验中没有包括额外的步骤。最后,我们通过突出可能提出研究挑战的违规行为,为未来的研究提出了建议。参考推荐的策划训练集的原始ISIC数据集源的图像文件列表将在我们的GitHub存储库上共享(可在www.github.com/mmu-dermatology-research/isic_duplicate_removal_strategy).获得

Given that ISIC 2020 focused on melanoma classi-

fication, we conduct experiments to provide benchmark results on the ISIC 2020 test set, with additional

analysis on the smaller ISIC 2017 test set. Testing was completed following the application of our duplicate removal strategy and an additional data balancing step. As a result of removing 14,310 duplicate images from the training set, our benchmark results show good levels of melanoma prediction with an AUC of 0.80 for the best performing model. As our aim was not to maximise network performance, we did not include additional steps in our experiments. Finally, we provide recommendations for future research by highlighting irregularities that may present research challenges. A list of image files with reference to the original ISIC dataset sources for the recommended curated training set will be shared on our GitHub repository (available at www.github.com/mmu-dermatology-research/isic_duplicate_removal_strategy).© 2021 Elsevier B.V. All rights reserved

1.皮肤癌是所有癌症中最常见的,每年被诊断出皮肤癌症的人比所有其他癌症的总和还要多。美国每天有9500例新病例被确诊(皮肤癌基金会,2017年)。黑色素瘤是最致命的皮肤癌,预计到2040年将达到近50万例。这意味着自2018年以来增长了62%。每4分钟就有一人死于皮肤癌。因此,皮肤癌发病率的上升被许多皮肤科医生视为一种全球流行病(英国黑色素瘤,2020年)。对皮肤癌,特别是黑色素瘤的早期干预,是必要的,以确保高

1. Introduction

Skin cancer is the most common of all cancers, with more people being diagnosed with the condition each year than all other ancers combined. There are 9500 new cases being diagnosed every day in the US (Skin Cancer Foundation, 2017). Melanoma, the deadliest form of skin cancer, is projected to reach almost half a million cases by 2040. This represents a 62% increase since 2018. rise of skin cancer incidence is seen by many dermatologists asa global epidemic (Melanoma UK, 2020). Early intervention for skin cancer, melanoma in particular, is essential to ensure high

面对不断增长的病例,存活率(Thorn等人,1994年;Cormier等人,2015年)。皮肤癌的主要可识别的原因是过度暴露在紫外线(UV)辐射中(NHS,2020a)。这可能是由于暴露在自然阳光下(英国癌症研究中心),或来自其他紫外线来源,如室内晒黑设备(世界卫生组织,2017年)。耗尽的臭氧水平会导致地面紫外线辐射的上升,从而增加暴露在自然阳光下的风险(环境食品和农村事务部,2020年)。也有证据表明,生活在紫外线辐射水平较高的低纬度地区,非黑色素瘤皮肤癌的发病率有所增加(Henriksen et al.,1989)。其他可改变的风险因素也可能还包括不良饮食(萨尔诺夫和Gerome,2017年)、饮酒(鲁伊斯,2018年;美国癌症研究所,2018年)和吸烟(De Hertog et al.,2001年)。

survival rates in the face of an ever growing number of cases (Thörn et al., 1994; Cormier et al., 2015). The main identifiable cause of skin cancer is excessive exposure to ultraviolet (UV) radiation (NHS, 2020a). This may be due to exposure to natural sunlight (Cancer Research UK), or from other UV sources such as indoor tanning devices (World Health Organization, 2017). Depleted ozone levels lead to a rise in ground-level UV radiation which can increase the risk of exposure in natural sunlight (Department for Environment Food & Rural Affairs, 2020). There is also evidence of increased incidence of non-melanoma skin cancer in populations living in lower latitude regions where UV radiation levels are high (Henriksen et al., 1989). Other modifiable risk factors may also include poor diet (Sarnoff and Gerome, 2017), alcohol consumption (Ruiz, 2018; American Institute for Cancer Research, 2018) and smoking (De Hertog et al., 2001).

皮肤镜检查是一种广泛使用的成像技术,可以通过使用浸没液进行光放大来可视化皮肤表面(Kittler等人,2002年),然而,其诊断准确性高度依赖于皮肤科医生的经验(Brinker等人,2019b;2019c;汉斯勒等人,2018年)。较贫穷国家的专家资源短缺会显著影响皮肤癌的及时治疗。许多与皮肤癌相关的公开统计数据被认为被低估了,因为非黑色素瘤病例没有被癌症登记处跟踪,由于成功治疗导致的不完全登记,或较贫穷的国家没有癌症登记处(美国癌症研究所,2018)。

Dermoscopy is a widely used imaging technique that enables

the skin surface to be visualised by light-amplification using immersion fluid (Kittler et al., 2002), however, its diagnostic accuracy is highly dependant on the experience of dermatologists (Brinker

et al., 2019b; 2019c; Haenssle et al., 2018). Scarcity of expert resources in poorer countries can significantly impact timely treatment for skin cancers. Many of the publicly available statistics relating to skin cancer are thought to be underestimates due to issues such as non-melanoma cases not being tracked by cancer registries, incomplete registrations due to successful treatment or poorer countries not having cancer registries (American Institute for Cancer Research, 2018).

由于皮肤癌病例对全球医疗保健服务的需求不断增加,对远程自动诊断解决方案的需求正变得越来越重要。这在较贫穷的国家尤其重要,这些国家的患者无法获得准确诊断所需的最新医疗设备和专门知识。近年来,随着深度学习技术在医学图像分析领域的广泛应用,皮肤病变分类已成为一个流行的研究领域。然而,由于大多数最先进的解决方案都是由数据驱动的,因此开放数据集的可靠性和一致性是算法开发的关键因素。因此,在本文中,我们分析了来自过去五年中发布的最大的皮肤镜开放数据集——国际皮肤成像合作组织(ISIC)数据集。本文的主要贡献如下:

Due to increased demands that skin cancer cases are incurring on global healthcare services, the need for remote automated diagnosis solutions is becoming increasingly important. This is particularly pertinent in poorer countries where patients do not have access to the latest medical equipment and expertise required for accurate diagnosis. Skin lesion classification has become a popular field of research in recent years following the growing adoption of deep learning techniques in the field of medical image analysis. However, as the majority of the state-of-the-art solutions are data-driven, the reliability and the consistency of open datasets are key factors for algorithm development. Therefore, in this paper we analyse images from the largest dermoscopic open datasets released over the past five years - the International Skin Imaging Collaboration (ISIC) datasets. The main contributions of this paper

are as follows:

1.我们通过选择过去3-4年中被引用良好的研究论文,分析了ISIC图像数据集的使用情况,并确定了相关问题。 2.我们提出了一种重复去除策略来管理数据集。通过删除重复的图像(在测试和训练集之间和内部的重叠图像),我们生成了一个经过清理的(非重复的)数据集和一个平衡的数据集。 3.我们使用19种最先进的黑色素瘤识别深度学习架构,对精心策划的平衡训练集进行了基准测试。我们评估了我们的基准算法在ISIC 2020测试集(在Kaggle上)的二元分类(黑色素瘤和非黑色素瘤)上的性能,并对ISIC 2017测试集进行了额外的分析。 4.我们为未来的研究提供建议,并在我们的GitHub存储库上分享我们的研究结果(可在www.github.com/mmu-dermatology-research/isic_duplicate_temise_策略中获得)。

1. We analyse the usage of ISIC image datasets with a selection of well-cited research papers from the past 3–4 years and identify related issues. 2. We propose a duplicate removal strategy to curate the datasets. By removing the duplicate images (overlap images between and within the test and training sets), we produced a cleaned (nonduplicate) dataset and a balanced dataset.

3. We benchmark the curated balanced training set using 19 stateof-the-art deep learning architectures for melanoma recognition. We evaluate the performance of our benchmark algorithms on the ISIC 2020 testing set (on Kaggle) for binary classification (melanoma and non-melanoma), with additional analysis on ISIC 2017 testing set.

4. We provide recommendations for future research and share our research findings on our GitHub repository (available at www.github.com/mmu-dermatology-research/isic_duplicate_ removal_strategy).

2.相关工作本节概述了皮肤镜数据集的使用,重点关注ISIC图像数据集及其使用相关的问题,包括讨论重复图像、类别不平衡、图像分辨率和标签噪声的研究。越来越多的研究表明,中枢在诊断恶性和良性皮肤病变方面与人类专家一样有能力,在某些情况下能够超越它们(Esteva等,2017b;2017a;Brinker等,2019b;2019a;藤泽等人,2019;Pham等人,2020年;金奈等人,2020年)。在一些国家,缺乏经验丰富的皮肤科医生,再加上观察者的高度可变性,为开发解决这个问题的解决方案提供了机会。

2. Related work

This section outlines the usage of dermoscopic datasets, focusing on the ISIC image datasets and issues relating to their use, including research discussing duplicate images, class imbalance, image resolution and label noise. A growing number of studies have demonstrated that CNNs are just as capable as human experts in the diagnosis of malignant and benign skin lesions, and in some cases are able to out-perform them (Esteva et al., 2017b; 2017a; Brinker et al., 2019b; 2019a; Fujisawa et al., 2019; Pham et al.,2020; Jinnai et al., 2020). A shortage of experienced dermatologists in some countries, combined with high observer variability, presents an opportunity to develop solutions that address this problem.

2.1.ISIC数据集的使用情况我们对用于研究目的的ISIC数据集的使用情况进行了调查。由于ISIC数据集被广泛使用,因此不可能提供一个详尽的列表,但是,我们已经选择了一些

2.1. Usage of ISIC datasets

We conducted a survey on the usage of ISIC datasets for research purposes. As ISIC datasets are widely used, it is not possible to provide an exhaustive list, however, we have selected some of

在过去3-4年中被引用的论文的分析,显示了在该领域的重大贡献。表1显示了我们调查的35篇论文的摘要。从这些出版物中,只有5个实现了某种重复的删除。剩下的30篇论文没有提到任何形式的重复删除。此外,我们观察到,最近的大多数研究使用了多个数据集,其中表明使用单个ISIC数据集的论文数量为13篇,其中22篇论文表明使用了多个ISIC数据集。考虑到在ISIC数据集中存在大量的重复,我们观察到,使用多个ISIC数据集没有实现某种形式的重复删除的实验可能会在其结果中表现出偏差。因此,我们建议进行一个分析来验证这种偏差的存在

The usage of ISIC datasets is broad, with the majority of tasks

focusing on classification and segmentation. The most popular research involves binary classification, as these challenges provide more images to train the algorithms. With the introduction of ISIC 2018 and ISIC 2019, researchers started to explore multi-class,classification, the majority of which used the ISIC 2020 dataset. However, the ISIC 2020 challenge focused on melanoma detection, therefore further additional binary classification papers are expected. Segmentation tasks appear to be not as popular as lesion diagnosis as ISIC did not continue this challenge type beyond 2019. Only the ISIC 2016–2018 datasets provided delineated segmentation masks, and are relatively few in number compared to those found in the classification tasks. Other usage of ISIC datasets include a study of the effect of colour constancy (Ng et al., 2019) and data augmentation using generative adversarial networks (Kendrick et al., 2020).

2.2.在本节中,我们将介绍与ISIC数据集图像的使用相关的问题,并得到了最近最先进的研究结果的支持

2.2. Related issues

In this section we present the issues related to the usage of ISIC dataset images, supported by the findings of recent state-of-the-art research

在用于训练cnn的数据集中表现出高度相似性的图像可能会在结果模型中引入不必要的偏差。为了解决这个问题,研究人员研究了在大数据集中识别视觉上相似的图像的方法。Hu等人(2018)提出了使用具有二进制约束正则化的深度约束暹罗哈希编码网络来检测接近重复的图像。他们在三个数据集上测试了他们的网络,并演示了一种额外的负载平衡方法,该方法可以在准确性和速度方面进一步提高性能。Zhang(2018)采用了一种不同的方法来测试图像的相似性。他们使用双通道架构实现了一个深度CNN。这种架构可能在深度学习数据集的平衡中被证明是有用的,特别是那些有大量样本来自少量参与者的数据集

Image duplication and images exhibiting high similarity within

datasets used to train CNNs may introduce unwanted bias in the resulting models. To address this problem, researchers have investigated methods of identifying visually similar images within large datasets. Hu et al. (2018) proposed the use of a deep constrained siamese hash coding network with binary constrained regularization to detect near duplicate images. They tested their network on three datasets and demonstrated an additional load balancing method that was shown to further increase performance in terms of accuracy and speed. Zhang (2018) adopted a different approach to test for image similarity. They implemented a deep CNN using a double-channel architecture. Such architectures might prove useful in the balancing of deep learning datasets, especially those where a high number of samples have been sourced from a low number of participants

最近,一些研究人员(苏乔鲁茨基和朔恩劳,2020年)提出,独特特征的数量对网络性能的贡献更大,而不是简单地增加输入图像的数量。虽然数据增强在以前的大多数研究中被广泛使用(如表1所示),但目前尚不清楚这是否有助于增加独特特征的数量。

More recently, a number of researchers

(Sucholutsky and Schonlau, 2020) have suggested that the number of unique features contribute more to network performance as opposed to simply increasing the number of input images. Although data augmentation is widely used in the majority of previous research (as shown in Table 1), it is unclear if this helps in increasing the number of unique features.

Brinker等人(2018)和Gessert等人(2020)表明,在训练cnn时,包含患者元数据对网络性能有明显的好处。Rezvantalab等人(2018)在两个公共皮肤镜皮肤癌数据集上进行了实验,使用在ImageNet上预先训练的四种神经网络,对8种皮肤病变类型进行分类,包括黑色素瘤。这些网络使用HAM10000和PH2数据集进行训练,前者包含了ISIC数据集的很大一部分,即ISIC 2018 - 2020。这项工作结束了

Brinker et al. (2018) and Gessert et al. (2020) showed that there was a clear benefit to network performance from the inclusion of patient metadata when training CNNs. Rezvantalab et al. (2018) conducted experiments on two public dermoscopy skin cancer datasets using four CNNs pretrained on ImageNet in the classification of eight skin lesion types, including melanoma. These networks were trained using the HAM10000、 and PH2 datasets, the former comprising a large part of theISIC datasets, namely ISIC 2018 - 2020. This work concluded

使用ISIC图像数据集(非详尽列表)的研究论文子集的摘要。∗多类分类分为7类: (1)光化性角化病,(2)基底细胞癌,(3)良性角化病,(4)皮肤纤维瘤,(5)黑素细胞痣,(6)黑色素瘤,(7)血管性皮肤病变。∗∗多类分为8类:(1)黑色素瘤,(2)黑素细胞痣,(3)基底细胞癌,(4)良性角化病,(5)光化性角化病和上皮内癌,(6)皮肤纤维瘤,(7)血管病变,(8)非典型痣。∗∗∗(角质形成细胞癌vs良性脂溢性角质病和恶性黑色素瘤vs良性痣∗∗∗∗三类: (1)黑色素瘤,(2)痣和(3)良性。

Table 1 A summary of a subset of research papers that use the ISIC image datasets (non-exhaustive list). ∗Multi-class classification divided into seven classes: (1) actinic keratosis,

(2) basal cell carcinoma, (3) benign keratosis, (4) dermatofibroma, (5) melanocytic nevi, (6) melanoma, (7) vascular skin lesion. ∗∗Multi-class classification into eight

classes: (1) melanoma, (2) melanocytic nevi, (3) basal cell carcinoma, (4) benign keratosis, (5) actinic keratosis and intraepithelial carcinoma, (6) dermatofibroma, (7)

vascular lesions, (8) atypical nevi. ∗∗∗(keratinocyte carcinomas vs benign seborrheic keratosis and malignant melanomas vs benign nevi ∗∗∗∗Three classes: (1) melanoma,

(2) nevi, and (3) benign  

Dupl。删除-作者提到删除重复的图像,DA-数据增强,ISIC(未指定)-年份没有在论文中说明

类别不平衡已被证明会显著影响模型的性能(Tschandl等人,2019年),数据增强被用于训练cnn,作为解决这一问题的一种手段。Hosny等人(2019)使用AlexNet进行了6次分类实验,以达到>95%的准确率。他们在三个数据集上进行了两套实验——ISIC 2017、MED-NODE和皮肤科信息系统(DermIS)。这项工作表明,各种数据增强技术与对Softmax的调整相结合,有助于显著改善在所有三个数据集上训练的网络的输出度量。这项工作也需要使用在DermIS数据集中已知的低质量图像的数据集,这可能有助于分类模型的鲁棒性。

Class imbalance has been shown to significantly impact model performance (Tschandl et al., 2019), with data augmentation being used in the training of CNNs as a means of addressing this problem. Hosny et al. (2019) conducted six classification experiments using AlexNet to achieve >95% accuracy. They performed two sets of experiments on three datasets - ISIC 2017, MED-NODE and dermatology information system (DermIS). This work showed that various data augmentation techniques combined with adjustments to Softmax contributed to significant improvements in output measures for networks trained on all three datasets. This work was also notable for using a dataset with known low quality images, found in the DermIS dataset, which may have contributed to the robustness of the classification model.  

Le等人(2020)设计了一个ResNet50网络集成,利用焦点损失函数来减轻HAM10000数据集中固有的类不平衡,他们将其用作训练数据。他们实验使用了一个预处理阶段,对病变进行分割。然而,这种方法是

Le et al. (2020) devised an ensemble of ResNet50 networks that utilised class-weighting with a focal loss function to mitigate the inherent class imbalance in the HAM10000 dataset, which they

used as training data. They experimented using a pre-processing stage where lesions were segmented. However, this approach re

由于准确性降低,这表明病变周围的皮肤区域对神经网络学习到的识别特征提供了重要的贡献。随着效率网的发展(Tan和Le,2020),Gessert等人(2020)发现,从ISIC 2019数据集的高分辨率图像训练的效率网模型提高了网络性能。这可能是由于高效网体系结构中固有的缩放功能,其中模型的宽度和深度被统一地缩放到输入的大小。他们还发现,使用损失平衡来解决类不平衡问题提高了网络性能

Hekler et al. (2020) investigated the effects of label noise on CNNs for skin cancer classification. This research noted that many skin cancer classification studies used non-biopsy-verified training

images, and that such imperfect ground truth could introduce systematic error. They observed a correlation between models trained with diagnosis from several dermatologists and high quality results on a test set whose labels had been produced by dermatologists. They found that CNNs could identify the features that dermatologists also identified, but that the CNNs also learned sources of errors in dermatological decisions. They also observed that if

Hekler等人(2020)研究了标签噪声对皮肤癌分类中cnn的影响。本研究指出,许多皮肤癌分类研究使用了非活检验证的训练图像,而这种不完美的地面真相可能会引入系统误差。他们观察到由几个皮肤科医生进行诊断训练的模型与由皮肤科医生制作标签的测试集的高质量结果之间的相关性。他们发现,cnn可以识别皮肤科医生也能识别的特征,但cnn也学习了皮肤科决策中的错误来源。他们还观察到,如果

表2是ISIC 2016 - 2020年数据集的摘要。请注意,图像计数不包括掩模和超像素图像。数据集列车测试总ISIC 2016 900 379 1279 ISIC 201720006002600020201810、015151215121、527 ISIC 201925、331823833、569 ISIC 202033、12610、98244、108

Table 2

Summary of the ISIC 2016 - 2020 datasets.

Note that image counts do not include mask

and superpixel images.

Dataset Train Test Total

ISIC 2016 900 379 1279

ISIC 2017 2000 600 2600

ISIC 2018 10,015 1512 11,527

ISIC 2019 25,331 8238 33,569

ISIC 2020 33,126 10,982 44,108

一个经过多数决策训练的CNN在生物心理验证的地面真相上进行了测试,性能显著下降,准确率从75.03%下降到64.24%。然而,本研究存在一些局限性,即(1)他们只使用了804张测试和训练图像;(2)他们只测试了一个深度学习架构(ResNet上的ImageNet);(3)所有评估的病变都进行了活检,这自然更难分类,因此代表边缘病例,作者指出,引入更简单的病例可能会提高网络的准确性。

a CNN trained with majority decisions was tested on a biopsyverified ground truth, there was a significant decrease in performance, with accuracy dropping from 75.03% to 64.24%. However,

there were several limitations in this study, namely (1) they used only 804 test and training images; (2) they tested only one deep learning architecture (ResNet50 pretrained on ImageNet); (3) all lesions assessed were biopsied, which are naturally more difficult to classify, and therefore represent edge cases, with the authors noting that the introduction of simpler cases would likely increase network accuracy.

利用皮肤镜图像数据集进行早期发现皮肤癌和恶性肿瘤评估的皮肤肿瘤医学图像分析的研究人员正专注于开发新的计算机算法。然而,所使用的数据集中固有的问题经常被忽视或研究不足。在下一节中,我们将分析最大和最广泛使用的皮肤镜数据集,即ISIC数据集。

Researchers in medical image analysis of skin cancer who use dermoscopic image datasets for the early detection of skin cancer and malignancy assessment are focused on developing new computer algorithms. However, issues inherent within the datasets used are often overlooked or under researched. In the following section we analyse the largest and most widely used dermoscopic datasets, namely, the ISIC datasets.

2.ISIC的挑战已经成为黑色素瘤分类研究的驱动力。他们提供了经活检证实的数字高分辨率皮肤病变图像数据集,以及来自世界各地的专家注释和元数据。其目的是促进该领域的研究,这将导致自动计算机辅助诊断(CAD)解决方案的开发,用于黑色素瘤和其他癌症的诊断。该社区还每年组织一次皮肤损伤挑战,以吸引研究人员更广泛的参与,以改进CAD算法的诊断,并传播人们对皮肤癌所代表的日益增长的问题的认识(Codella et al.,2018b)。表2显示了ISIC数据集中(2016-2020年)内的图像数量的摘要。我们注意到,自从引入图像以来,图像的数量每年都在大幅增加

3.3. Datasets

4.The ISIC challenges have become a driving force for research into melanoma classification. They provide biopsy-proven digital high resolution skin lesion image datasets, with expert annotations and metadata from around the world. The aim is to promote research in the field, which will lead to the development of automated Computer Aided Diagnosis (CAD) solutions for the diagnosis of melanoma and other cancers. This community also organises yearly skin lesion challenges to attract wider participation of researchers to improve the diagnosis of CAD algorithms and to spread awareness of the growing problem that skin cancer represents (Codella et al., 2018b). Table 2 shows a summary of the number of images within the ISIC datasets (2016–2020). We note that the number of images has increased substantially every year since its introduction.

ISIC 2016数据集(Gutman et al.,2016)包含900张训练图像和379张测试图像,共计1279张图像。为训练集和测试集都提供了地面真实数据,表明每个病变是恶性的还是良性的。该数据集限制了未来的使用,因为在临床实践中,皮肤科医生经常识别特定类型的恶性肿瘤和良性。ISIC 2017数据集(Codella等人,2017)包含2000张训练图像和600张测试图像,共计2600张图像。为训练集和测试集均提供基本真实信息和患者元数据,表明病变是否属于四类组之一:(1)黑色素瘤;(2)痣或脂溢性角化病;(3)脂溢性角化病;或(4)黑色素瘤或痣。患者的近似年龄和性别也作为额外的元数据提供。表3显示了ISIC 2017 - 2020的类分布的详细划分。2018年,ISIC共享了一个更实质性的数据集(Codella等人,

3. Datasets

The ISIC challenges have become a driving force for research into melanoma classification. They provide biopsy-proven digital high resolution skin lesion image datasets, with expert annotations and metadata from around the world. The aim is to promote research in the field, which will lead to the development of automated Computer Aided Diagnosis (CAD) solutions for the diagnosis of melanoma and other cancers. This community also organises yearly skin lesion challenges to attract wider participation of researchers to improve the diagnosis of CAD algorithms and to、 spread awareness of the growing problem that skin cancer represents (Codella et al., 2018b). Table 2 shows a summary of the number of images within the ISIC datasets (2016–2020). We note that the number of images has increased substantially every year since its introduction.

2018年,ISIC共享了一个更实质性的数据集(Codella等人,2018a;特尚德尔,2018),其中包含10015张训练图像和1512张测试图像,共11527张图像。地面真实数据仅提供于训练集,其中包括更详细的病变类型标签,包括黑色素瘤、黑素细胞痣、基底

In 2018, ISIC shared a more substantial dataset (Codella et al., 2018a; Tschandl, 2018) which contains 10,015 training images and 1512 test images, a total of 11,527 images. Ground truth data is provided for the training set only, which includes more detailed lesion type labels, including melanoma, melanocytic nevus, basal

表3《ISIC 2017 - 2020培训集内的班级分布情况》。请注意,所有ISIC 2020的未知病例均被诊断为良性。

Table 3

Class distribution within the ISIC 2017 - 2020 training sets. Note that all unknown cases for ISIC 2020 are diagnosed as benign.

细胞癌,光化性角化病,良性角化病,皮肤纤维瘤和血管病变。在接下来的一年里,ISIC 2019数据集(特申德尔,2018;Codella等人,2017;ISIC等人,2019年)发布。该数据集包含25,331张训练图像和8238张测试图像,共计33,569张图像。与ISIC 2018类似,仅为训练集提供了地面真实数据,显示了以下类别:黑色素瘤、黑素细胞痣、基底细胞癌、光化性角化病、良性角化病、皮肤纤维瘤、血管病变和鳞状细胞癌。测试集包括9个类,8个类加上一个额外的未知类。为训练集和测试集都提供了患者元数据。训练元数据显示了患者的大致年龄

cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma and vascular lesions. In the following year, the ISIC 2019 dataset (Tschandl, 2018; Codella et al., 2017; Combalia et al., 2019) was released. This dataset contains 25,331 training images and 8238 test images, a total of 33,569 images. Similar to ISIC 2018, ground truth data is provided for the training set only, indicating the following classes: melanoma, melanocytic nevus, basal cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma, vascular lesions and squamous cell carcinoma. The testing set consists of 9 classes, 8 classes as in the training set plus an additional unknown class. Patient metadata is provided for both training and testing sets. The training metadata indicates the patient’s approximate age

解剖部位、病变标识和性别。病变ID为23,247张图像指定,对2084张图像未指定,包括共计25,331张图像中的11,848张唯一ID。测试元数据显示了患者的大致年龄、解剖部位和性别。ISIC 2019数据集还包括了多个单个病变,这些病变在不同的放大级别上具有相同的病变,这可能在不同的放大水平下提供重要的独特特征

anatomical

site, lesion ID and gender. Lesion ID is specified for 23,247 images, and unspecified for 2084 images, with 11,848 unique IDs from a、 total of 25,331 images. The testing metadata indicates the patient’s approximate age, anatomical site and gender. The ISIC 2019 dataset is also notable for including multiplets of single lesions which feature the same lesion at different zoom levels which may provide important unique features at different levels of magnification

2020年,最大的ISIC数据集发布(罗腾堡等人,2020年),包含33126张训练图像和10982张测试图像,共44108张图像。与前一年相似,仅为训练集提供地面真实数据,显示患者ID、病变ID、性别、大致年龄、解剖部位、诊断(见表3)、良恶性状态。在训练集中的33,126张图像中,有2056个独特的患者id和32,701个独特的病变id。这可能表明大量的病变图像来自的相对较少的不同间隔的患者。该测试集还包括指示患者身份、患者大致年龄、解剖部位和性别的患者元数据。

In 2020, the largest ISIC dataset was released (Rotemberg et al.,

2020)which contains 33,126 training images and 10,982 test images, a total of 44,108 images. Similar to the previous year, ground truth data is provided for the training set only, indicating patient ID, lesion ID, gender, approximate age, anatomical site, diagnosis (see Table 3) and benign or malignant status. Of the 33,126 images in the training set, there are 2056 unique patient IDs and 32,701 unique lesion IDs. This would suggest that a large number of lesion images have been sourced from a relatively small pool of patients at different intervals. The test set also includes patient metadataindicating patient ID, patient approximate age, anatomical site and gender

2021).

ISIC数据集(2016-2020年)由18个底层子数据集组成。这些子数据集的摘要如表4所示。我们从ISIC档案画廊(ISIC,2020)获得了这些数据。对重叠图像的第一次观察可以在ISIC 2018 - 2020数据集中看到,其中包括HAM10000数据集,包括10,015张训练图像和1511张测试图像,总共11,526张图像。此外,ISIC 2019年和2020年的数据集包括BCN20000年的数据集,共计19,424张图像,其中包括在指甲和粘膜等难以诊断的部位发现的病变。请注意,我们在我们的实验和分析中排除了分割掩模和超像素图像。如表3所示,2017 - 2020年的ISIC训练集共有16个类和70,472张图像。我们注意到,尽管

表4 ISIC数据集的组成(2016-2020年)

Table 4

Composition of the ISIC datasets (2016–2020)

从2019年到2020年,图像总数增加了一倍,数据集仍然不平衡,在光化性角化病、皮肤纤维瘤、血管癌和鳞状细胞癌方面存在不足。我们还注意到,在2019年至2020年的训练集之间,黑色素瘤病例数量显著减少,以及2020年的大量未知病例。为了分析和比较这些数据集,我们下载了2017年至2020年的ISIC数据集。下面的部分描述了我们用于分析这些数据集的方法。


4. Method

This section details the following: (1) the implementation of

a proposed duplicate removal strategy to address class imbalance

within and across the ISIC datasets; (2) following the implementation of our proposed duplicate removal strategy, we curated a new

cleaned and balanced dataset (henceforth curated dataset), using

images from the ISIC 2017 - 2020 datasets (ISIC 2016 was excluded

due to missing labels of the type melanoma and non-melanoma);

and (3) we train a selection of the most widely used pretrained

deep CNNs using our curated dataset and report on the benchmark

results.

5.方法本节详细介绍了以下内容: (1)实施提出的重复删除策略,以解决ISIC数据集内部和之间的类不平衡;(2)实施我们提出的重复删除策略后,我们策划了一个新的清理和平衡的数据集(从此管理的数据集),使用来自ISIC 2017 - 2020数据集的图像(ISIC 2016由于缺少黑色素瘤和非黑色素瘤类型的标签而被排除);(3)我们使用我们策划的数据集训练最广泛使用的预训练深度CNNs,并报告基准结果。


4.1.重复去除策略作为初始预处理阶段,我们删除了ISIC 2017训练数据集中包含的所有2000幅超像素图像和ISIC 2017测试数据集中包含的所有600幅超像素图像。表5显示了删除所有超像素图像文件后的所有数据集的摘要。任务编号是指ISIC数据集网站上的任务编号类别,因为一些数据集每年被划分为任务。我们只使用了来自分类任务的数据集,其中逗号分隔值(CSV)地面真实标记可用于相应的训练集。表6显示了一个

4.1. Duplicate removal strategy

As an initial preprocessing stage, we removed all 2000 superpixel images contained in the ISIC 2017 training dataset and all

600 superpixel images contained in the ISIC 2017 test dataset. A

summary of all datasets following the removal of all superpixel image files is shown in Table 5. Task number refers to the task number category on the ISIC dataset website, as some datasets are split

into tasks for each year. We only used datasets from classification

tasks, where comma-separated value (CSV) ground truth labelling

was available for the corresponding training set. Table 6 shows a

表6在单个ISIC数据集中的二进制相同的图像文件的摘要。请注意,这些数字不包括降采样的副本。数据集列车测试列车和测试2016 1 0 3 2017 0 2 2 2018 2 0 0 2019 50 0 0 2020 433 78 0表7降采样重复图像文件的摘要,其中降采样文件名中的ISIC代码与非降采样文件名相同。请注意,所有降采样的图像文件都是2019年训练集的一部分

在每个数据集中出现的二进制相同的图像文件的摘要。在所有训练集中发现的二进制相同的重复图像文件的总数是12,039个。这包括在单个训练集和跨训练集中发现的重复。跨所有测试集的二进制相同的重复图像文件的总数是1,592个,其中包括在单个测试集和跨测试集中发现的重复文件。在所有训练和测试集上的二进制相同图像文件的总数是13,976个,其中包括在单个训练和测试集上,以及在训练和测试集中发现的重复。



我们实验的主要目的是只从训练集中删除重复的内容,因为我们将在ISIC 2020挑战网站上评估我们的基线结果。按年的顺序删除副本,如删除所有2016年的训练集副本,然后删除所有2017年的训练集副本等。2019年的训练集包含了2074个降采样图像文件的子集,用文件名后缀“_downsampled”表示。这些图像的尺寸(高度和宽度)已经缩小了,因此不能通过检查相同的二进制数据的算法来识别。在任何其他的训练或测试集中,都没有包含“_downsampled”后缀的图像。在ISIC 2019挑战网站或相关的挑战论文中,没有提供对降采样图像的正式描述

The main aim of our experiments was to remove duplicates from the training sets only, as we would be evaluating our baseline results on the ISIC 2020 challenge website. Duplicates were removed in year order, e.g. remove all 2016 training set duplicates, then remove all 2017 training set duplicates, etc. The 2019 training set contains a subset of 2074 downsampled image files, denoted by the filename suffix “_downsampled”. These are images that have been reduced in size (height and width) so would not be identified by algorithms that check for identical binary data. There were no images containing the “_downsampled” suffix in any other training or testing set. No formal description of the downsampled images is provided on the ISIC 2019 challenge website or in the associated challenge papers

我们在降采样集中总共识别了2263个重复的图像文件,其中降采样图像文件名中的ISIC代码与非降采样图像文件名中的ISIC代码相同。表7显示了降采样副本的摘要。考虑到在训练集中、测试集内以及在训练集和测试集之间可能存在重复,我们设计了一个重复删除策略,包括以下阶段

We identified a total of 2263 duplicate image files in the downsampled set where the ISIC code in the downsampled image file name is the same as the ISIC code in a non-downsampled image file name. A summary of the downsampled duplicates is shown in Table 7. Given that duplicates may exist within training sets, within testing sets and across training and testing sets, we devised a duplicate removal strategy, comprising the following stages

1.从2019年的训练集中删除所有满足以下标准的图像文件: (i)文件名包含后缀“_downsampled”;和(ii)文件名包含在任何测试集中的任何其他图像文件中找到的ISIC代码。 2.从2019年的训练集中删除所有满足以下标准的图像文件: (i)文件名包含后缀“_downsampled”;和(ii)文件名包含在任何训练集中的任何其他图像文件中找到的ISIC代码。 3.删除所有训练集上的所有重复的、二进制的、完全相同的图像文件(2016-2020年)。 4.从每个单独的训练集中删除所有有重复的图像文件

2.1. Delete all image files from the 2019 training set where the following criteria are satisfied: (i) the filename contains the suffix “_downsampled”; and (ii) the filename contains the ISIC code

3.found in any other image file in any of the testing sets. 2. Delete all image files from the 2019 training set where the following criteria are satisfied: (i) the filename contains the suffix “_downsampled”; and (ii) the filename contains the ISIC code found in any other image file in any of the training sets. 3. Delete all duplicate binary identical image files across all training sets (2016–2020). 4. Delete all image files from each individual training set where a duplicate is found in any of the test sets.

表8在应用我们的重复去除策略后,从每个ISIC训练集中删除的图像文件数。请注意,这些数字包括了2019年训练集中的二进制相同重复和降采样重复。

Table 8

Number of image files deleted from each ISIC training set after applying our duplicate removal strategy. Note that figures

include both binary identical duplicates and downsampled duplicates in the 2019 training set.

1注意:这个总数与我们在组合训练集中使用的图像总数不同,因为我们没有使用2016年的数据集。

1 Note: this total differs from the total number of images we used in our combined training set as we did not use the 2016 dataset.

图1。由ImageHash识别的具有高度相似性的图像的说明,截止= 5(a和b);截止= 1(c和d)。

Fig. 1. Illustration of images identified by ImageHash with high similarity, cutoff = 5 (a and b); cutoff = 1 (c and d).

所有训练集中删除的重复图像文件总数为14310个,其中2019年训练集中删除了1927个降采样的重复图像文件。表8显示了从每个训练集中删除的图像文件的摘要,以及应用我们的重复删除策略后剩余的数量。我们注意到,所有2018年训练集的删除是由于2018年训练数据包含HAM10000数据集,该数据在随后的ISIC数据集中完整使用。此外,我们不将多重体视为重复体,因为它们代表不同放大水平的病变,角度和照明略有变化。作为最后的检查阶段,我们计算了所有训练集中的二进制相同文件的数量,总共没有找到任何重复的图像文件。

The total number of duplicate image files deleted from all training sets is 14,310, of which 1927 downsampled duplicate image files were deleted from the 2019 training set. Table 8 shows a summary of image files deleted from each training set, and the number remaining, after applying our duplicate removal strategy. We note that the deletion of all of the 2018 training set is due to the 2018 training data comprising the HAM10000 dataset, which was used in its entirety in subsequent ISIC datasets. Additionally, we do not count multiplets as duplicates, given that they represent lesions at different levels of magnification with slight variations in angle and lighting. As a final checking stage, we counted the number of binary identical files across all training sets, with a total of zero duplicate image files found.

在完成了我们的重复去除策略的第1-4阶段后,我们使用四种图像相似度算法进行了实验,以确定是否有任何其他尚未被识别出来的降采样图像的例子。首先,我们测试了ImageHash Python库,它使用多种图像散列算法(平均、感知、差分和小波)来在没有颜色信息的情况下分析亮度上的图像结构。颜色哈希算法分析了没有位置信息的颜色分布和黑色和灰色分数(Buchner,2020)。我们在随机选择72小时的训练图像上,用截止值设置为5和1来测试该方法。在这个时间范围内没有发现精确的匹配,然而,我们报告了由算法识别的两个假阳性的例子(见图1)。请注意,较低的值表示相似性越近

Following the completion of stages 1–4 of our duplicate removal strategy, we conducted experiments using four image similarity algorithms to determine if there were any other examples of downsampled images that had not yet been identified. First, we tested the ImageHash Python library which uses multiple image hash algorithms (average, perceptual, difference and wavelet) to analyse the image structure on luminance without colour information. The colour hash algorithm analyses the colour distribution and black and gray fractions without position information (Buchner, 2020). We tested this method with cutoff values set to 5 and then to 1 on a random selection of training images for 72 hours. No exact matches were found within this time frame, however, we report on two examples of the false positives identified by the algorithm (see Fig. 1). Note that lower values indicate closer similarity

我们测试的第二种图像相似性方法是均方误差(MSE),它通过计算两幅图像之间的平方差之和来测试图像的相似性,从而估计感知误差。MSE值为零表示完全相似度,值越大表示相似度降低。虽然这比ImageHash算法要快,但当在随机选择的训练图像上进行72小时的测试时,结果都是假阳性。这两个例子见图2。注意,这个值越接近零,相似性就越近。

The second image similarity method we tested was mean squared error (MSE) which tests for image similarity by calculating the sum of the squared difference between the two images, resulting in an estimate of the perceived errors. An MSE value of zero indicates perfect similarity, with larger values indicating reduced similarity. Although this was faster than the ImageHash algorithm, the results were all false positives when tested over a 72 h period on a random selection of training images. See Fig. 2 for two such examples. Note that the closer the value is to zero, the closer the similarity.、

我们测试的第三种图像相似度方法是结构相似度指数度量(SSIM),它对图像结构信息的感知变化进行建模(Zhou Wang et al.,2004)。在对随机选择的训练图像进行测试72小时后,该方法只得到了假阳性结果(见图3)。请注意,值为1表示完全相似性。对于第四种图像相似度方法,我们测试了余弦相似度。该方法利用两个向量之间夹角的余弦来度量内积空间的两个向量之间的相似性,并确定两个向量是否指向大致相同的方向(Han et al.,2012)。在对随机选择的训练图像进行测试72小时后,该方法也只产生了假阳性结果(见图4)。请注意,该值越接近1,其相似性就越接近

The third image similarity method we tested was the Structural Similarity Index Measure (SSIM) which models the perceived

change in the structural information of the image (Zhou Wang

et al., 2004). After testing for 72 hours on a random selection of

the training images, this method resulted in only false positive results (see Fig. 3). Note that a value of 1 indicates perfect similarity.

For the fourth image similarity method, we tested cosine similarity. This method measures the similarity between two vectors

of an inner product space using the cosine of the angle between

two vectors, and determines whether the two vectors are pointing in roughly the same direction (Han et al., 2012). After testing

for 72 hours on a random selection of training images, this method

also resulted in only false positive results (see Fig. 4). Note that the

closer the value is to 1, the closer the similarity

虽然我们测试的图像相似度方法都不能识别出任何相同的图像,但图像相似度技术可能会在未来的研究中得到应用,以减少数据集中特征的过度表示。这个

Although none of the image similarity methods we tested were

able to identify any identical images, the application of image similarity techniques might be employed in future studies in order

to reduce the over-representation of features within datasets. This

图5。(a)裁剪病变说明;真皮镜测量覆盖层(b)模糊病变;(c)病变被毛发模糊;(d)存在临床标记;(e)存在圆形参考贴纸;(f)存在物理标尺;(g)存在浸泡液导致病变扭曲;(h)存在浸泡液空气袋。可能有助于提高一个网络的概括能力。然而,这种方法需要仔细考虑,因为去除太多的图像可能会导致相反的预期效果,导致模型的推广能力降低。我们的重复删除策略的所有阶段都是使用Linux应用程序FSlint完成的,由Brady(2014)创建,它使用严格的文件比较技术,比较文件大小、硬链接、消息摘要5(MD5)和安全哈希算法1(SHA-1)。MD5用于检查一个文件的前4千字节和整个文件,而SHA-1用于检查整个文件。在训练集中发现的其他可能阻碍模型性能的图像的显著观察结果包括:

Fig. 5. Illustration of (a) cropped lesion; (b) obfuscation of lesion by dermoscope

measurement overlay; (c) lesion obfuscated by hair; (d) presence of clinical markings; (e) presence of circular size reference stickers; (f) presence of physical ruler;

(g) presence of immersion fluid causing distortion of lesion; (h) presence of immersion fluid air pocket.

may help to improve a network’s ability to generalise. However,

such an approach would need to be carefully considered, as the

removal of too many images may result in the opposite desired effect, causing a reduction in the model’s ability to generalise.

All stages of our duplicate removal strategy were completed using the Linux application FSlint, created by Brady (2014), which

uses rigorous file comparison techniques that compare file size,

hardlinks, Message Digest 5 (MD5) and Secure Hash Algorithm 1

(SHA-1). MD5 is used to check both the first 4 kilobytes of a file

and the entire file, whereas SHA-1 is used to check the entire file.

Other notable observations of images found within the training

sets which may impede model performance include:

图像可能被严重裁剪-去除大量病变和/或正常皮肤边界区域(见图5(a))。•图像可能显示真皮面镜测量覆盖,有时会模糊病变或病变边界(见图5(b))。•图像可能包含不同数量的头发,这已被证明会阻碍模型的性能(Le等人,2020年)(见图5(c))。

• Images may be heavily cropped - removing large amounts of

the lesion and/or normal skin boundary regions (see Fig. 5(a)).

• Images may exhibit dermascope measurement overlays, sometimes obfuscating the lesion or lesion boundary (see Fig. 5(b)).

• Images may contain varying amounts of hair, which has been

shown to impede model performance (Le et al., 2020) (see

Fig. 5(c)).


图6。Illustration of (a) duplicate image with different filenames (ISIC_0016018.jpg and ISIC_0012271.jpg) found in the 2017 training and testing sets; (b) duplicate image with the same filename (ISIC_0029847.jpg) found in the 2018 and 2019 training sets; (c) duplicate image with the same filename (ISIC_0011132.jpg) found in two training sets (2017, 2019) and one testing set (2016); (d) duplicate image with different filenames (ISIC_5448850.jpg and ISIC_9881235.jpg) found in the 2020 training set.•图像可能包含病变周围的临床标记(见图5(d))。•图像可能包含靠近附近的尺寸参考贴纸(见图5(e))。


图6。Illustration of (a) duplicate image with different filenames (ISIC_0016018.jpg and ISIC_0012271.jpg) found in the 2017 training and testing sets; (b) duplicate image with the same filename (ISIC_0029847.jpg) found in the 2018 and 2019 training sets; (c) duplicate image with the same filename (ISIC_0011132.jpg) found in two training sets (2017, 2019) and one testing set (2016); (d) duplicate image with different filenames (ISIC_5448850.jpg and ISIC_9881235.jpg) found in the 2020 training set.•图像可能包含病变周围的临床标记(见图5(d))。•图像可能包含靠近附近的尺寸参考贴纸(见图5(e))。

Fig. 6 shows examples of images with duplicated file names and duplicate images with different file names found both within individual datasets and across multiple datasets. The ISIC 2019 training set contains 2074 image files with the suffix “_downsampled”. We observed that although the image dimensions for many of these files had been reduced compared to the non-resized originals found in other training sets, the file sizes were often more than double that of the original non-resized images. This is most likely a side-effect of using a lower compression rate when the images were resized in order to avoid introducing additional compression artefacts. Fig. 7 shows two examples of downsampled images that fall into this category

我们观察到,可能在边缘情况下,我们的重复删除策略可能错过了一些重复。例如,一个图像可能已经被调整了大小,但在文件名中不包含“_downsampled”后缀,或者重复的(具有不同的图像大小)也可能有不同的文件名。然而,我们相信,我们的策略至少将为删除大量重复提供基础,这可能有助于消除偏差,并使在ISIC数据集上训练的网络能够更好地推广到新数据。

We observe that there may be edge cases where our duplicate

removal strategy may have missed some duplicates. E.g. an image

may have been resized, but does not contain the “_downsampled”

suffix in the filename, or duplicates (of a different image size) may

also have different filenames. However, we believe that our strategy will at least provide a basis for removing a large number of

duplicates, which could help to eliminate bias and to enable networks trained on the ISIC datasets to better generalise to new data.

我们注意到,对于ISIC 2016年的数据集,地面真实数据并不表明病变是否属于黑色素瘤类型。只定义恶性和良性状态。鉴于并非所有的恶性皮肤癌都是黑色素瘤(NHS,2020b),我们没有在本文中报道的任何实验中包括ISIC 2016年的数据。对于不打算将研究结果上传到ISIC竞赛网站的研究人员,我们建议进行一个额外的重复删除步骤。这一步将涉及删除在所有测试集中发现的所有副本。在我们的研究中,我们没有达到

We note that for the ISIC 2016 dataset, the ground truth data

does not indicate if a lesion is of melanoma type. Only the malignant and benign status is defined. Given that not all malignant

skin cancers are melanomas (NHS, 2020b), we did not include the

ISIC 2016 data in any of our experiments reported in this paper.

For researchers who do not intend on uploading their results

to the ISIC competition website, we recommend an additional step

for duplicate removal. This step would involve the removal of all

duplicates found across all test sets. For our study, we did not per

图7。与未调整大小的原始文件相比,文件大小更大的降采样图像文件的说明:2016年(ISIC_0000019.jpg;1504x1129;107.6KB)、2017(ISIC_0000019.jpg;01504x1129;10;107.6(a))和2019(ISIC_0000019_downsampled.jpg;768;211.7KB)训练集;2016(1016(ISIC_0000030.jpg;1503x1129;95.1KB)(ISIC_0000030.jpg;1503x1129;95.1KB);195.6)训练集;这两个示例的文件大小都超过了原始图像文件的两倍。

Fig. 7. Illustration of downsampled image files that have a larger file size

compared to the non-resized original files: (a) duplicate image found in

the 2016 (ISIC_0000019.jpg; 1504x1129; 107.6KB), 2017 (ISIC_0000019.jpg;

1504x1129; 107.6KB) and 2019 (ISIC_0000019_downsampled.jpg; 1024x768;

211.7KB) training sets; (b) duplicate image found in the 2016 (ISIC_0000030.jpg;

1503x1129; 95.1KB), 2017 (ISIC_0000030.jpg; 1503x1129; 95.1KB) and 2019

(ISIC_0000030_downsampled.jpg; 1024x769; 195.6KB) training sets; Both examples

exhibit a file size more than double that of the original image file.

形成这最后一步,作为我们的意图是提交我们的结果到ISIC比赛网站。我们最初的实验表明,尽管使用了分类交叉熵作为解决不平衡的手段,但在被管理的数据集中,显著的类别不平衡导致了模型的过拟合。在我们随后的实验中,我们通过从大多数类别(非黑色素瘤)中去除图像,使用欠采样来平衡数据集。我们将这个数据集称为经过策划的平衡数据集。

form this final step as our intention was to submit our results to the ISIC competition website. Our initial experiments showed that the significant class imbalance in the curated dataset resulted in model over-fitting, despite the use of categorical crossentropy as a means to address the imbalance. For our subsequent experiments, we balanced the dataset using undersampling by removing images from the majority class (non-melanoma). We refer to this dataset as the curated balanced dataset.

ISIC 2020数据集包括每个图像的患者ID。作为使数据更加异构的进一步数据集清理步骤,只包含来自唯一患者id的图像可能是合适的。我们观察到,从ISIC 2020训练集中的33126张图像中,有3078个独特的患者id和30048个重复。剔除重复的患者id后,共有428例黑色素瘤病例和2650例非黑色素瘤病例。在ISIC 2020测试集中,总共有10,982张图像,包含690个独特的患者id和10,292张重复图像。这样数量有限的病人的例子,数量非常低的独特的黑色素瘤病例,这可能代表一个偏见的数据集可能影响模型的鲁棒性训练只在这个数据集,有限的独特情况下可能不代表公众从不同的皮肤类型。然而,患者可能在不同的临床就诊期间在一段时间内出现相同的病变,因此病变可能在不同的发展阶段提供重要的独特指标。此外,一个患者可能会出现一个以上的病变。

The ISIC 2020 dataset includes patient ID for each image. As a further dataset cleaning step to make the data more heterogeneous, it might be appropriate to include only images from unique patient IDs. We observed that from a total of 33,126 images in the ISIC 2020 training set, there are 3078 unique patient IDs and 30,048 duplicates. With duplicate patient IDs removed, there are a total of 428 melanoma and 2650 non-melanoma cases. In the ISIC2020 test set, there are a total of 10,982 images with 690 unique patient IDs and 10,292 duplicates. With such a limited number of patient examples, and the very low number of unique melanoma cases, this may represent a bias in the dataset which could affect the robustness of models trained exclusively on this dataset, as limited unique cases might not represent the general public from different skin types. However, patients may have presented the same lesion over a period of time during different clinical visits,、thus the lesions may provide important unique indicators at different stages of development. Additionally, a patient may present more than one lesion.

为了研究9810张训练图像的策划平衡数据集的分布,我们使用McInnes等人(2018)设计的统一统一流形近似和投影(UMAP)进行了统计分析和其分布分析。

To study the distribution of the curated balanced dataset of 9810 training images, we perform statistical analysis and analyse its distribution using Unified Uniform Manifold Approximation and

Projection (UMAP), devised by McInnes et al. (2018).

UMAP是一种基于流形学习的降维工具。McInnes等人(2018)证明,与t-SNE相比,UMAP是一个具有竞争力的工具,而t-SNE的计算成本更低。图8显示了UMAP可视化数据的特征分布(输入、效率netb0、顶部辍学层、密集层和输出),其中蓝色区域代表非黑色素瘤,橙色区域代表黑色素瘤。值得注意的是,经过高效netb0的训练后,这两个类变得具有可分离性,表9中的统计指标进一步支持了这一点。在输入分布上,类内的值和类间的值具有很高的相似性

UMAP is a dimensional reduction tool based on manifold learning. McInnes et al. (2018) demonstrated that UMAP is a competitive tool when compared to t-SNE, which has been shown to be

less computationally expensive. Fig. 8 shows the UMAP visualisation data feature distributions (input, EfficientNetB0, top dropout

layer, dense layer and output), where blue regions represent nonmelanoma and orange regions represent melanoma. It is noted that

after training with EfficientNetB0, the two classes become separable, which is further supported by the statistic metrics in Table 9.

On the input distribution, the intra-class and inter-class values

show high similarity

我们可以观察到,从输入数据集通过效率netb0的特征向量移动到密集层,类内值和类间的值都增加了,这表明数据集具有更好的可分离性。剪影得分和卡林斯基-哈拉巴斯指数进一步证实了这一点。我们观察到效率nentb0的特征分布显示类间距离(18.6252)大于类内距离(黑色素瘤为7.6235,非黑色素瘤为8.6028)。

We can observe that moving from the input dataset through the

feature vector of EfficientNetB0 to the dense layer that both intraclass and inter-class values increase, which indicates a better separability of the dataset. This is further confirmed by higher values of

the silhouette score and Calinski-Harabasz index. We observe that

the feature distribution of EfficientNetB0 shows that the inter-class

distance (18.6252) is larger than the intra-class distances (7.6235

for melanoma and 8.6028 for non-melanoma).

该领域的最新发展表明,人们越来越重视能够进行多类预测的模型(Codella等人,2018b;Kassem等人,2020年)。由于多类cnn在这一领域的重要性越来越大,我们使用我们管理的数据集对多类进行了进一步的分析,并评估了ISIC 2018测试集(任务3:多类病变诊断)的性能(Codella等人,2018b)。Task 3 ISIC 2018包括7种不同类型的皮肤病变,包括黑色素瘤(MEL)、Nevi(NV)、基底细胞癌(体心立方)、光化性角化病/肠病(上皮内癌)(AKIEC)、良性角化病(BKL)、皮肤纤维瘤(DF)和血管病(VASC)。所管理的数据集分别由4905 MEL、11、421NV、3316体心立方、859 AKIEC、2520 BKL、239 DF和253 VASC组成。图9显示了多类的UMAP可视化数据特征分布。由于类别不平衡和图像相似性等因素,数据集的组成对皮肤病变的诊断提出了重大挑战,在训练期间的每个阶段都有重叠。值得注意的是,卡林斯基-哈拉巴赫斯的价值,

Recent developments in the field have shown a growing emphasis on models capable of multi-class predictions (Codella et al.,

2018b; Kassem et al., 2020). Due to the growing importance of multi-class CNNs in this domain, we conduct further analysis on multi-class using our curated dataset, and evaluate the performance on the ISIC 2018 test set (Task 3: multi-class lesion diagnosis) (Codella et al., 2018b). Task 3 ISIC 2018 consists of 7 different types of skin lesions, including Melanoma (MEL), Nevi (NV), Basal cell carcinoma (BCC), Actinic keratosis / Bowens disease (intraepithelial carcinoma) (AKIEC), Benign keratosis (BKL), Dermatofibroma (DF) and Vascular (VASC). The curated dataset consists of 4905 MEL, 11,421 NV, 3316 BCC, 859 AKIEC, 2520 BKL, 239 DF and 253 VASC, respectively. Fig. 9 illustrates the UMAP visualisation data feature distribution of multi-class. Due to factors such as class imbalance and image similarity, the composition of the

dataset presents a significant challenge for skin lesion diagnosis, with overlaps on every stage during training. It is noted that the value of Calinski-Harabas、

剪影得分和Davies-Bouldin指数表明分布的可分性较低,在使用效率netb0(928.5820,-0.1254和3.9409)训练后,与输入分布(411.8463,-0.0613和12.3489)有所改善。为了实现这项工作的再现性,图像文件的列表和相应的标签可以从我们的GitHub存储库下载(可在www.github.com/ mmu-dermatology-research/isic_duplicate_removal_strategy).

silhouette score and Davies-Bouldin index indicate low separability for distribution, with some improvement from input distribution (411.8463, -0.0613 and 12.3489) after

training with EfficientNetB0 (928.5820, -0.1254 and 3.9409).

To allow for reproducibility of this work, the list of

image files and corresponding labels can be downloaded

from our GitHub repository (available at www.github.com/

mmu-dermatology-research/isic_duplicate_removal_strategy).

4.3.对于基线实验的基准,我们训练了19个最广泛使用的深度学习架构:深度121、经验169、深度201、效率经验b0-B4、经验RetV2、经验V3、ResNet50、ResNet50V2、ResNet101、ResNet101V2、ResNet152、ResNet152V2、VGG16、VGG19、19和初始测试。对于训练数据,我们使用80:20的分割进行训练和验证,使用我们基于来自ISIC 2017 - 2020数据集的图像的策划平衡数据集。迁移学习没有用于任何实验,因为本文的目的是提供基线的结果,而不提供额外的策略。我们使用随机梯度下降法训练了19个网络,初始学习率为0.01,动量为0.9。我们实现了早期停止,直到每个网络收敛,由10个时代的耐心决定。为了进行预处理,所有的图像都被调整了大小,用

4.3. Benchmarks

For baseline experiments, we trained 19 of the most widely used deep learning architectures: DenseNet121, DenseNet169, DenseNet201, EfficientNetB0 - B4, InceptionResNetV2, InceptionV3,

ResNet50, ResNet50V2, ResNet101, ResNet101V2, ResNet152, ResNet152V2, VGG16, VGG19 and Xception. For training data, we used an 80:20 split for training and validation using our curated

balanced dataset based on images from the ISIC 2017 - 2020 datasets. Transfer learning was not used for any of the experiments, as the purpose of this paper is to provide baseline results without additional strategies. We trained each of the 19 networks for 50 epochs with a batch size of 32 using stochastic gradient descent with an initial learning rate of 0.01 and momentum of 0.9. We implemented early stopping, until each network converged, determined by a patience of 10 epochs. For pre-processing, all images were resized, with the

图8。经过策划的平衡训练集的UMAP可视化,其中橙色区域代表黑色素瘤,蓝色区域代表其他区域。从左到右:输入,用高效netb0提取的特征分布,顶部辍学层,密集层和输出。这些图表直观地说明了在深度学习架构中,黑色素瘤与非黑色素瘤的可分离性有所增加。(为了解释本图例中对颜色的参考资料,读者可以参考本文的网络版本。)

Fig. 8. UMAP visualisation of the curated balanced training set, where orange regions represent melanoma and blue regions represent others. From left to right: input,

feature distributions extracted with EfficientNetB0, the top dropout layer, dense layer and output. These graphs visually illustrate an increase in the separability of melanoma

versus non-melanoma in the deep learning architecture. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this

article.)

Table 9

Statistical analysis on the separability of melanoma versus non-melanoma on the curated balanced dataset. Intra: average distance between samples of the same class; Inter-class: average distance between samples of different classes;

Si: silhouette score; CH: Calinski-Harabasz index; DB: Davies-Bouldin index

表9在经过整理的平衡数据集上,黑色素瘤与非黑色素瘤的可分性的统计分析。内部:同一类别样本之间的平均距离;类间:不同类别样本之间的平均距离;Si:轮廓得分;CH:卡林斯基-哈拉巴斯指数;DB:Davies-布尔丁指数

Fig. 9. UMAP visualisation of the curated multi-class training set, where orange regions represent MEL, blue regions represent NV, green regions represent BCC, red regions

represent AKIEC, purple regions represent BKL, brown regions represent DF and pink regions represent VASC. From left to right: input, feature distributions extracted with

EfficientNetB0, the top dropout layer, dense layer and output. These graphs visually illustrate the low separability of multi-class skin lesions within the deep learning

architecture. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

使用由Lund和Clark(2013)创建的×Python图像库中的高质量降采样过滤器,最短边减少到224像素和中心裁剪(224 224像素)。为了解决训练集规模有限的问题,我们应用了几种数据增强技术,包括随机旋转、随机缩放、随机宽度和高度位移、剪切和水平和垂直翻转。

shortest side reduced to 224 pixels and center-cropping (224 224 pixels) using the high-quality downsampling filter found in the ×Python Image Library, created by Lund and Clark (2013). To address the limited size of the training set, we applied several data augmentation techniques, including random rotations, random zooms,  random width and height shift, shearing and horizontal and vertical flipping

用于训练网络的硬件配置是Intel i7-7700四核3.60 GHz CPU,64GBDDR4 2400MHz双通道RAM和GTX1080 Ti 11GB GPU。该软件配置使用了运行在Windows 10上的张量流GPU 2.4.1和Keras 2.3.1。

The hardware configuration used to train the networks was an Intel i7-7700 Quad Core 3.60GHz CPU with 64GB DDR4 2400MHz  Duel Channel RAM and a GTX1080 Ti 11GB GPU. The software con- figuration used Tensorflow GPU 2.4.1 and Keras 2.3.1 running on Windows 10.

然而,我们评估了在Kaggle上的基线模型的性能,因为我们不能访问ISIC 2020测试集的地面真实标签,我们不能对这个数据集进行进一步的分析。为了讨论基线模型的性能,我们使用了ISIC 2017测试集,它完全属于我们的平衡训练集。

We evaluate the performance of the baseline models on Kaggle,

however, given that we do not have access to the ground truth labels for the ISIC 2020 test set, we cannot perform further analysis on this dataset. In order to discuss the performance of the baseline models, we use the ISIC 2017 test set, which is fully exclusive from our curated balanced training set.

为了为我们策划的多类数据集生成基线,我们使用了四种流行的深度学习模型(VGG19、DenseNet121、ResNet101和高效netb0),并在ISIC 2020实时排行榜上对结果进行评估。我们的实验将最大历元设置为200,并采用了早期停止策略。我们保存了能够最大化验证精度的最佳模型,并尽早停止im

To produce baselines for our curated multi-class dataset, we use four popular deep learning models (VGG19, DenseNet121, ResNet101 and EfficientNetB0) and evaluate the results on the ISIC 2020 live leaderboard. Our experiments set the maximum epoch to 200 and adopt an early stopping strategy. We save the best model that maximised validation accuracy, with early stopping im

当验证的分类准确性在8个时代后没有增加时,则进行了补充。初始学习率设置为0.001,当验证分数在5个时期后没有增加时,初始学习率降低了0.1倍

plemented when the categorical accuracy of validation did not increase after 8 epochs. The initial learning rate was set to 0.001, which was reduced by a factor of 0.1 when the validation score did not increase after 5 epochs

4.结果表10给出了使用我们策划的平衡数据集对ISIC 2020测试集上的19个深度学习架构的基准结果。请注意,ISIC 2020的基本事实并不是公开的,因此由Kaggle上的组织者提供的度量是使用接收机工作特征曲线(AUC)下的面积。

5. Results

Table 10 presents benchmark results using our curated balanced dataset for 19 deep learning architectures of their best epochs on the ISIC 2020 test set. Note that the ground truth for ISIC 2020 is not publicly available, therefore the metric provided by the organiser on Kaggle is used - area under the Receiver Operating Characteristics Curve (AUC).

在ISIC 2020测试结果中,表现最高的网络是VGG19,AUC为0.80,表明比性能第二好的网络(VGG16和DenseNet121)增加了0.03。其次,表现最好的网络是ResNet101、ResNet50、ResNet50V2、效率NetB2、效率NetB3、效率NetB0、效率NetB1、ResNet101V2、VGG16和DenseNet121,报告的AUC在0.70 - 0.77的范围内。InceptionV3是性能最低的网络,AUC为0.5。除了加密v3,其他表现最低的网络是DenseNet201,加密ResNetV2,ResNet152V2,DenseNet169和ResNet152,所有这些都报告了AUC

For the ISIC 2020 test results, the highest performing network was VGG19 with an AUC of 0.80, indicating an increase of 0.03 over the next best performing networks (VGG16 and DenseNet121). The next best performing networks were ResNet101, ResNet50, ResNet50V2, EfficientNetB2, EfficientNetB3, Xception, EfficientNetB0, EfficientNetB1, ResNet101V2, VGG16 and DenseNet121, reporting an AUC in the range of 0.70 - 0.77. InceptionV3 was shown to be the lowest performing network with an AUC of 0.5. In addition to InceptionV3, the other lowest performing networks were DenseNet201, InceptionResNetV2, ResNet152V2, Effi- cientNetB4, DenseNet169 and ResNet152, all reporting an AUC in

表10在ISIC 2020测试集上的基线模型的性能比较,没有使用预先训练的模型,报告了其最佳时期的结果。

Table 10

A performance comparison of the baseline models on the ISIC

2020 testing set without the use of a pre-trained model, results are reported on their best epoch.

表11在ISIC 2017测试集上的更详细的性能指标。请注意,这些结果不能与ISIC 2017排行榜进行比较,因为这些结果是基于二进制分类的,而ISIC 2017排行榜是基于3类的

Table 11

More detailed performance measures on the ISIC 2017 test set. Note that these results can not be compared with the ISIC 2017 leaderboard as these results are based on binary classification, while the ISIC 2017 leaderboard is based on 3-class.

范围为0.63到0.67。对于所有报告的网络,AUC的平均值为0.704,标准差为0.071。与效率netb0-b3相比,效率netb4的性能较差,这可能是由于大尺寸的网络体系结构和相对较小尺寸的训练集图像之间的差异。我们注意到,性能最好的网络(VGG19)也有最多数量的参数,然而,性能最差的网络(InceptionV3)没有最低数量的参数

the range of 0.63 to 0.67. For all reported networks, the mean average for AUC was 0.704, with a standard deviation of 0.071. The poor performance of EfficientNetB4 compared to EfficientNetB0 -、B3 may be due to a disparity between the large size of the network architecture and the relatively small size of the training set images. We note that the best performing network (VGG19) also has the highest number of parameters, however, the poorest performing network (InceptionV3) did not have the lowest number of Parameters

表11显示了使用我们策划的平衡数据集对ISIC 2017测试集上的19个深度学习架构的基准测试结果。对于ISIC 2017年的测试结果,VGG19的准确率最高,为0.56,InceptionV3的准确率最低,为0.30。在精度方面,效率netb3的结果最高,为0.22,而DenseNet169的结果最低,为0.16。InceptionV3的召回率最高,结果为0.94,表明该网络夸大了黑色素瘤病例。相反,效率netb4的召回率最低,为0.39,这与它在2020年测试集上的低性能相当。在AUC方面,DenseNet201表现出最高的水平

Table 11 shows the benchmark results using our curated balanced dataset for 19 deep learning architectures of their best epochs on the ISIC 2017 test set. For the ISIC 2017 test results, VGG19 demonstrated the highest accuracy at 0.56, with InceptionV3 having the lowest accuracy of 0.30. For precision, EfficientNetB3 showed the highest result at 0.22, while DenseNet169 reported the lowest result at 0.16. InceptionV3 showed the highest recall, with a result of 0.94, indicating that the network overclassified melanoma cases. Conversely, EfficientNetB4 showed the lowest recall at 0.39, which is comparable to its low performance on the 2020 test set. For AUC, DenseNet201 showed the highest

结果为0.56,DenseNet169报告的结果最低,为0.46。使用ISIC 2010测试集的所有网络的测量结果显示,与使用ISIC 2020测试集实验返回的网络相比,都显示出较差的性能。

result at 0.56, with DenseNet169 reporting the lowest result of

0.46. Measures for all networks using the ISIC 2017 test set demonstrated poor performance compared to those returned by the ISIC

2020 test set experiment.

f1-分数是整体网络性能的最佳指标,表示精度和召回率之间的调和平均值。图10显示了来自ISIC 2017测试集的6个测试图像的例子,其中噪声影响了三种预测的性能。图11显示了来自ISIC 2020测试集的测试图像的一个热图的选择,并与原始测试图像进行了比较。考虑到ISIC 2020测试集的地面真实数据并不公开,我们提出这些结果来证明训练后的网络清楚地关注数据集中存在的噪声。然而,在ISIC 2017测试结果的情况下,噪声似乎并不总是会影响预测的准确性。

F1-score is the best indicator of overall network performance, indicating the harmonic mean between precision and recall. Fig. 10 shows six examples of test images from the ISIC 2017 test set

where noise affected the performance of three predictions. Fig. 11 shows a selection of heatmaps for test images from the ISIC 2020 test set, compared against original test images. Given that the ground truth data is not publicly available for the ISIC 2020 test set, we present these results to demonstrate that the trained networks are clearly focusing on noise present within the dataset. However, in the case of the ISIC 2017 test results, noise would appear to not always affect the accuracy of the prediction.

表13显示了在ISIC 2018测试集的任务3上评估的多类分类的基准结果。由于该数据集的基本真相尚不公开,我们在实时排行榜上评估了我们的结果,并报告了平衡的多类精度(Codella et al.,2018b)。深度学习模型通过基于ImageNet的预训练模型获得了更好的精度。我们观察到,我们所管理的数据集的最佳基线精度是0.621,这是通过效率netb0和一个预先训练过的模型来实现的。

Table 13 shows the benchmark results of multi-class classification evaluated on Task 3 of the ISIC 2018 testing set. As the ground truth of this dataset is not publicly available, we evaluate our results on the live leaderboard and report the Balanced Multi-class Accuracy (Codella et al., 2018b). The deep learning models achieve better accuracy with pretrained models based on ImageNet. We observe that the best baseline accuracy of our curated dataset is 0.621, achieved by EfficientNetB0 with a pretrained model.

由于我们无法获得ISIC 2018测试集的地面真实数据,我们对ISIC 2017分类数据集上的最佳基线模型进行了进一步的分析,该数据集包括3类: MEL、NV和脂溢性角化病(SK)。图13显示了正确预测(左侧:a)、c)和e)和错误预测(右侧:b)、d)和f)的重力、CAm热图可视化。我们注意到,虽然大多数网络集中于正确预测皮肤损伤的区域,但一些裁剪区域和噪声区域也包括在内。我们在多类分类上的发现与我们的二值分类结果是一致的,其中噪声似乎并不总是影响预测的准确性。

Since we do not have access to ground truth data for the ISIC 2018 test set, we conduct further analysis of our best baseline model on the ISIC 2017 classification dataset, which consists of 3

classes: MEL, NV and seborrheic keratosis (SK). Fig. 13 illustrates the Grad-CAM heatmap visualisation of correct predictions (on the left: a), c) and e)) and incorrect predictions (on the right: b), d) and f)). We note that although the majority of the network focused on regions for correct predictions for skin lesions, some cropped regions and areas of noise were also included. Our findings on multiclass classification are consistent with our binary classification results, where noise appears to not always affect the accuracy of the prediction.

我们通过注释使网络能够进一步夸大其结果的特征来进一步分析所策划的数据集,例如临床笔标记。我们将这些非病变特征分为7类标签:(1)皮肤镜标尺;(2)浅暗头发;(3)临床笔标记;(4)尺寸参考标签;(5)气袋;(6)皮肤镜边界;(7)其他。最后一个类别包含了人工制品类型,无法创建新的类别,包括印有日期的图像和非常模糊的图像。接下来,我们通过去除所有非病变类来训练网络,类似于交叉折叠验证。该实验展示了非损伤特征(噪声)如何影响模型的准确性。我们通过去除数据集中的所有非病变图像来训练每个模型,然后通过从大多数类中去除图像来进行重新平衡,并保留20%用于验证。

We further analyse the curated dataset by annotating features that enable the network to further inflate its results, e.g. clinical pen markings. We categorise these non-lesion features into 7 separate class labels: (1) dermoscope ruler; (2) light and dark hair; (3) clinical pen marking; (4) size reference sticker; (5) air pocket; (6) dermoscope borders; (7) Other. The last category contains artefact types where there were too few examples to warrant creating new categories for, including images with dates printed onto them and images that were extremely blurry. Next, we train the network with all the non-lesion classes removed, similar to a cross-fold validation. This experiment shows how non-lesion features (noise) affect model accuracy. We train each model by removing all nonlesion images in the dataset, then rebalance by removing images from the majority class, with 20% reserved for validation.

表14展示了在ISIC数据集中的图像的多样性,包括在我们所策划的数据集中。此外,它还强调了某些特征如何会对黑色素瘤产生偏见,例如在黑色素瘤病例中较少出现的皮肤镜边界,这提供了一个轻微的准确性提高。同样,皮肤镜标尺覆盖物对非黑色素瘤有轻微的偏见。然而,性能最好的网络使用完整的数据集(没有删除),这与表15相反,表15中没有一个最佳分数来自在完整数据集上训练的网络。去除皮镜尺人工制品,DenseNet 201和穿刺v3获得了提高的准确性。同样地,去掉气袋制品的VGG19整体性能最好。这些结果表明,一些模型容易受到干扰性能的人工干扰,在某些情况下显著。未来的工作可能包括使用在被删除的类上训练的集成网络来克服这些障碍。

Table 14 demonstrates the diversity of images within the ISIC datasets, including within our curated dataset. Furthermore, it

highlights how certain features could give bias to melanoma, such as the dermoscopic borders which are much less present in melanoma cases, which gives a minor accuracy increase. Similarly, dermoscope ruler overlays have a slight bias towards nonmelanoma. However, the best performing network uses the full dataset (None removed), which is in contrast to Table 15 where none of the best scores come from networks trained on the full dataset. Training with dermoscope ruler artefacts removed, DenseNet 201 and InceptionV3 received improved accuracy. Similarly, VGG19 with air pocket artefacts removed had the best performance overall. These results show that some models are susceptible to noise from the artefacts that disrupt performance, in some cases significantly. Future work could involve the use of ensemble networks trained on the removed class to overcome these obstacles.

图10。2017年ISIC测试结果的重力cam热图可视化: a)网络只关注病变周围区域,包括浸泡液(DenseNet201)-预测:黑色素瘤:地面真相:其他,b)网络只关注病变周围区域,包括临床笔标记和皮镜测量覆盖(DenseNet201)-预测:黑色素瘤;地面真相:黑色素瘤,c)网络主要集中于病变周围区域,包括临床笔标记和皮镜测量覆盖(DenseNet201)-预测:黑色素瘤;地面真相:黑色素瘤,d)网络主要关注浸泡液(VGG16)预测:黑色素瘤;地面真相:黑色素瘤,e)网络主要关注临床笔标记和头发(VGG19)预测:黑色素瘤;地面真相:其他,f)网络关注病变和裁剪图像区域(VGG19)预测:黑色素瘤;地面真相:其他。

Fig. 10. Grad-CAM heatmap visualisation for the ISIC 2017 test results: a) network focused only on areas surrounding the lesion, including immersion fluid (DenseNet201) - prediction: melanoma; ground truth: other, b) network focused only on areas around the lesion, including clinical pen markings and dermascope measurement overlay (DenseNet201) - prediction: melanoma; ground truth: melanoma, c) network focused mainly on areas surrounding the lesion, including clinical pen markings and dermascope measurement overlay (DenseNet201) - prediction: melanoma; ground truth: melanoma, d) network focused mainly on immersion fluid (VGG16) - prediction: melanoma; ground truth: melanoma, e) network focused mainly on clinical pen markings and hair (VGG19) - prediction: melanoma; ground truth: other, f) network focused on lesion

and cropped image area (VGG19) - prediction: melanoma; ground truth: other.

图11。使用DenseNet121的ISIC 2020测试结果的重力-cam热图可视化: a)专注于病变和临床笔标记的网络,b)主要专注于皮肤镜测量覆盖的网络,c)专注于周围皮肤和伤口敷料的网络,d)主要专注于临床笔标记的网络。

Fig. 11. Grad-CAM heatmap visualisation for the ISIC 2020 test results using DenseNet121: a) network focused on both lesion and clinical pen markings, b) network focused

primarily on dermoscope measurement overlay, c) network focused on surrounding skin and wound dressing, d) network focused primarily on clinical pen markings.

图12。在ISIC 2017测试集上,效率netb0的UMAP可视化,其中橙色区域代表黑色素瘤,蓝色区域代表其他区域。从左到右:输入,用高效netb0提取的特征分布,顶部辍学层,密集层和输出。这些图表直观地说明了黑色素瘤与非黑色素瘤的可分离性。(为了解释本图例中对颜色的参考资料,读者可以参考本文的网络版本。)

Fig. 12. UMAP visualisation of EfficientNetB0 on the ISIC 2017 test set, where orange regions represent melanoma and blue regions represent others. From left to right:

input, feature distributions extracted with EfficientNetB0, the top dropout layer, dense layer and output. These graphs visually illustrate the separability of melanoma versus

non-melanoma. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

表12在ISIC 2017测试集上,黑色素瘤与非黑色素瘤的可分性的统计分析。内部:同一类别样本之间的样本之间的平均距离;类间:不同类别样本之间的平均距离;Si:剪影得分;CH:卡林斯基-哈拉巴兹指数;DB:Davies-布尔丁指数。

Table 12

Statistical analysis on the separability of melanoma versus non-melanoma on the ISIC 2017 testing set. Intra: average distance between samples of the same class; Inter-class: average distance between samples of different classes; Si: silhouette score; CH: Calinski-Harabasz index; DB: Davies-Bouldin index.

5.我们使用我们所策划的平衡数据集所训练的所有网络架构都提供了可比较的结果。然而,我们注意到,我们用于评估模型的ISIC 2020测试集明显大于我们的训练集——7848个训练示例vs 10,982个测试示例。ISIC 2020测试集中可能存在的类不平衡可能会进一步影响性能。我们还注意到,ISIC 2020测试集包含78个重复的图像文件,使用FSlint进行识别。ISIC 2020测试集的确切细节目前还不公开,因此我们在获得评估指标时只能推测其组成和可能的影响

6. Discussion

All network architectures we trained using our curated balanced dataset provided comparable results. However, we note that the ISIC 2020 testing set we used to evaluate our models is significantly larger than our training set - 7848 training examples vs 10,982 test examples. Performance may be further impacted by possible class imbalance in the ISIC 2020 testing set. We also note that the ISIC 2020 test set contains 78 duplicate image files, identified using FSlint. Exact details of the ISIC 2020 test set are not currently publicly available, therefore we can only speculate on its composition and possible effects when obtaining evaluation metrics

考虑到我们管理的平衡数据集相对较小,以及对训练后的网络缺乏任何额外的微调,我们将认为来自ISIC 2020测试集的测试结果对于性能最好的网络是很好的。然而,在ISIC 2017测试集上进行的实验结果对所有网络的所有测量值都很差。如图12中的UMAP可视化(对平衡训练集使用相同的设置)和表12中的统计分析所示,ISIC 2017测试集的可分离性较差,两类之间存在相似性和内部差异性。我们确定了导致这种情况的四个可能原因:(1)ISIC 2017测试集中的重复次数;(2)ISIC 2017测试集中的类不平衡;(3)训练和测试集中存在的噪声量;(4)我们策划的平衡训练集相对较小的规模

Given the comparatively small dataset size of our curated balanced dataset, and the lack of any additional fine-tuning of the trained networks, we would regard the test results from the ISIC 2020 test set to be good for the best performing networks. However, the results for the experiment performed on the ISIC 2017 test set were poor for all measures for all networks. As

shown in the UMAP visualisation (using the same settings for the balanced training set) in Fig. 12 and statistical analysis in Table 12, the ISIC 2017 test set is less separable, with intersimilarities and intra-dissimilarities between the two classes. We identify four possible causes for this: (1) the number of duplicates within the ISIC 2017 test set; (2) class imbalance in the ISIC 2017 test set; (3) the amount of noise present in both the training and testing sets

(4)我们所策划的平衡训练集的规模相对较小。我们认为,病变可能被毛囊、头发、使用浸泡液造成的气囊、尺寸参考贴纸、尺子、皮镜测量覆盖和临床笔标记混淆。Ju等人(2021年)指出,医疗数据集往往具有不对称的(类依赖的)噪声,并存在较高的观察者可变性。Rolnick等人(2018)表明,在大型监督数据集上训练的深度学习模型能够从训练数据中归纳出来,其中真实标签的数量远远超过错误标签。然而,这只在MNIST、CIFAR和ImageNet数据集上得到了证明,并且需要显著增加数据集的大小,而这与正确的标签被稀释的因素有关

and (4) the relatively small size of our curated balanced training set. We identify noise as cases where lesions may be obfuscated by hair follicles, hair, air pockets resulting from the application of immersion fluid, size reference stickers, rulers, dermascope measurement overlays and clinical pen markings. Ju et al. (2021) noted that medical datasets tend to have asymmetric (class-dependent) noise and suffer from high observer variability. Rolnick et al. (2018) showed that deep learning models trained on large supervised datasets are capable of

generalising from training data where true labels are massively outnumbered by incorrect labels. However, this was only demonstrated on MNIST, CIFAR and ImageNet datasets and requires a

significant increase in dataset size that is related to the factor by which correct labels have been diluted

我们的结果可能表明了迁移学习和数据集大小在这个领域的重要性。表13在ISIC 2018测试集上评估的多类分类上,对已管理的数据集的性能进行了基准测试。实时排行榜的主要度量值是平衡多级精度。预训练表明该模型使用了基于ImageNet的预训练模型

Table 13

Benchmarking the performance of the curated dataset on multi-class classification evaluated on the ISIC 2018 test set. The primary metric value for the live leaderboard is Balanced Multi-class Accuracy. Pretrained indicates that the model is using a pretrained model based on ImageNet

我们使用欠采样来平衡我们管理的管理数据集,这包括从大多数类别(非黑色素瘤)中去除图像。然而,根据Lin等人(2017)的研究,将该方法的结果与其他平衡技术(如少数类(黑色素瘤)的图像增强,或实现病灶损失函数等重量平衡,可能对未来的研究有用。

We balanced our curated dataset using undersampling, which involved the removal of images from the majority class (nonmelanoma). However, it may be useful for future research to compare the results of this approach with other balancing techniques such as image augmentation of the minority class (melanoma), or weight balancing such as the implementation of a focal loss function, as per Lin et al. (2017).

对于多类分类,我们在ISIC 2018 Task 3病变诊断测试集上提供了4个流行的深度学习模型的基线结果。所管理的数据集是不平衡的,需要额外的策略来提高使用该数据集训练的网络的性能。为了进行未来的改进,我们建议使用数据增强方法和/或包含外部非ISIC数据集来平衡类,特别是在AKIEC、DF、VASC和SCC上。

For multi-class classification, we provide baseline results with

four popular deep learning models on the ISIC 2018 Task 3 lesion diagnosis test set. The curated dataset is imbalanced and requires additional strategies to improve the performance of networks trained using this dataset. For future improvement, we recommend the use of data augmentation methods and/or the inclusion of external non-ISIC datasets to balance the classes, particularly on AKIEC, DF, VASC and SCC.

未来的工作可能会集中于大量视觉上相似的图像对使用ISIC数据集的训练模型的影响。我们在短时间内在有限的数据集上测试了四种图像相似性方法。然而,其他技术,如那些使用特征提取的技术,可能值得研究,因为最近的工作,如舒诺茨基和朔恩劳(2020),表明在训练深度cnn时,独特的特征比大量的训练图像更重要。我们还注意到颜色空间在医学图像数据处理中的重要性(Barata et al.,2014)。这可能有助于未来的im-

Future work might focus on the effect of the large number of visually similar images on trained models that use the ISIC datasets. We tested four image similarity methods on a limited  set of data over a short period of time. However, other techniques such as those employing feature extraction, may be worth investigating given that recent works, such as Sucholutsky and Schonlau (2020), suggest that unique features are more important than a large number of training images when training deep CNNs. We also note the importance of colour space in the processing of medical image data (Barata et al., 2014). This could contribute to future im-

图13。ISIC 2017测试结果上多类分类的热图可视化: a)网络主要关注病变预测:黑色素瘤;地面真相:黑色素瘤,b)网络和临床笔标记预测:痣:地面真相:黑色素瘤,c)网络关注病变和裁剪区域预测:痣;地面真相:痣,d)网络主要关注临床笔标记和头发区域预测:黑色素瘤;地面真相:痣,e)关注病变和临床笔标记的网络-预测:脂溢性角化病;地面真相:脂溢性角化病,f)网络完全关注周围的皮肤-预测:脂溢性角化病;地面真相:黑色素瘤

Fig. 13. Grad-CAM heatmap visualisation for multi-class classification on the ISIC 2017 test results: a) network focused mainly on the lesion - prediction: melanoma; ground truth: melanoma, b) network focused on the lesion and clinical pen marking - prediction: nevus; ground truth: melanoma, c) network focused on lesion and cropped area - prediction: nevus; ground truth: nevus, d) network focused mainly on clinical pen markings and hair regions - prediction: melanoma; ground truth: nevus, e) network focused on the lesion and clinical pen markings - prediction: seborrheic keratosis; ground truth: seborrheic keratosis, f) network focused entirely on the surrounding skin -prediction: Seborrheic keratosis; ground truth: melanoma

表14关于已删除的伪影的验证集的结果。注意:当删除伪影时,我们将数据集重新平衡到剩余图像数量最少的类中,然后从两个类中取20%进行验证。

Table 14

Results on the validation set for artefacts removed. Note: when removing artefacts we re-balance the dataset to the class with the lowest number of remaining images, then take 20% from both classes for validation.

表15在ISIC 2020测试集上去除个别伪影类的结果。

Table 15

Results of individual artefact class removal on the ISIC 2020 testing set.

对皮肤病变诊断的算法设计的证明,并将在未来的工作中进一步探讨。由于我们的论文侧重于皮肤病变的分类,所以我们没有包括对皮肤病变分割数据集的重复分析。

provements to algorithm design for skin lesion diagnosis, and will  be explored further in future work. As our paper focuses on skin lesions classification, we did not include duplicate analysis on skin lesion segmentation datasets. 虽然分类任务提供了病变的诊断,但病变分割,如在ISIC 2018 Task 1中的病变边界分割,提供了更好的病变定位。这可以用于未来的研究,以计算机生成的热图和皮肤科医生注释的重叠。

Whilst classification tasks provide the diagnosis of the lesions, lesion segmentation, such as in ISIC 2018 Task 1 on lesion boundary segmentation, provides better localisation of the lesions. This could be used in future studies for the overlap of the computer generated heatmap and the dermatologist’s annotation.

6.结论在这项工作中,我们提出了一种从ISIC 2017 - 2020数据集中删除重复图像文件的策略,作为减少在这些数据集上训练的深度学习模型的偏差的一种手段。我们展示了各种常用的CNN架构训练的平衡数据集的结果,这表明了优秀的类分布和良好的性能度量。这项工作的目的是强调使用ISIC数据集的重复图像的潜在偏差,以及其他许多问题,如噪声,

7.7. Conclusion

8.In this work, we propose a strategy for removing duplicate image files from the ISIC 2017 - 2020 datasets as a means of reducing bias in deep learning models trained on these datasets. We present results from a variety of commonly used CNN architectures trained

9.on a curated balanced dataset which indicates excellent class distribution and good performance measures. The aim of this work

10.is to highlight the potential biases of the usage of duplicate images of ISIC datasets, and other numerous issues, such as noise

并更好地理解它们对深度cnn的影响。这项工作并不是为了最大化cnn的性能,因此我们没有包括任何额外的步骤,如使用不同的预训练模型进行迁移学习,微调或对网络配置的调整。ISIC数据集中固有的噪声的影响,加上相对较小的训练集大小,被证明有助于显著降低网络性能。竞争利益声明作者声明,他们没有已知的相互竞争的经济利益或个人关系,这可能会影响本文报告的工作。

present within the ISIC datasets and to better understand their effects on deep CNNs. This work is not intended to maximise the performance of the CNNs, therefore we did not include any additional steps such as transfer learning with different pretrained models, fine-tuning or adjustments to network configurations. The effects of noise inherent within the ISIC datasets, in addition to a relatively small training set size, were shown to contribute to a significant reduction in network performance. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

答:概念化,数据管理,正式分析,写作-原始草稿。概念化,正式分析,写作-原稿。概念化,正式分析,写作-原稿。概念化,正式分析,写作-原始草稿。概念化,数据管理,正式分析,写作-原始草稿。我们感谢EPSRC(EP/N02700/1)和FAST医疗保健网络sPlus的资金支持。该研究项目部分得到了AGH科技大学的“卓越倡议-研究型大学”项目的支持

CRediT authorship contribution statement Bill Cassidy: Conceptualization, Data curation, Formal analysis, Writing – original draft. Connah Kendrick: Conceptualization, Formal analysis, Writing – original draft. Andrzej Brodzicki: Conceptualization, Formal analysis, Writing – original draft. Joanna Jaworek-Korjakowska: Conceptualization, Formal analysis, Writing – original draft. Moi Hoon Yap: Conceptualization, Data curation, Formal analysis, Writing – original draft. Acknowledgment We gratefully acknowledge the funding support of EPSRC (EP/N02700/1) and FAST Healthcare NetworksPlus. This research project was partly supported by the “Excellence Initiative - Research University” programme for the AGH University of Science

and Technology




举报

相关推荐

0 条评论