Histopathologic Cancer Detection(densenet169)学习笔记

1、kaggle比赛地址

Histopathologic Cancer Detection | Kaggle

2、说明

这是一个二元图像分类问题。先看下官方给出的数据集构成。

train:训练集。一些病理学图像,.tif文件,大小96 x 96px。(每个样本称为patch)

test:测试集。图像格式和训练集的相同。

train_labels.csv:训练集的标签。以训练数据集的每个样本的文件名作为id,0或者1作为label。0表示负样本(正常,没有癌症),1表示正样本(有癌症)。其中正样本的定义是:图像的中心区域(32 x 32px)至少有一个像素的肿瘤组织,而中心区域以外如果有肿瘤组织,则不影响标签。提供外部区域,这样在做卷积的时候就不再需要做填充了,方便训练。

sample_submission.csv:给出了测试集中的每个样本的id(文件名),后面的label暂时都是0。

任务:从96 x 96px的数字组织病理学图像中识别出是否存在转移灶。也就是对于测试集中的每个id(patch),必须预测patch的中心32x32px区域“至少包含一个像素肿瘤组织”的概率。

Histopathologic Cancer Detection(densenet169)学习笔记             Histopathologic Cancer Detection(densenet169)学习笔记

3、数据集背景

kaggle比赛用的数据集是PCam数据集的一个子集,它将Pcam中重复的图像删除了。而PCam,也就是Patch Camelyon,来自Camelyon16挑战赛数据集。可以理解为原来的图像是非常大的,将它切片(相当于已经做了预处理)以后,得到的Patch Camelyon。下面三个链接是对camelyon数据集的介绍,可以了解处理病理学图像(WSI)的流程。

camelyon数据集介绍:

https://zhuanlan.zhihu.com/p/50672544

camelyon16冠军方案:

https://zhuanlan.zhihu.com/p/51247262

camelyon17冠军方案:

https://zhuanlan.zhihu.com/p/51735826

4、平台

本来想把kaggle上的数据集通过API放在colab上运行,遇到两个问题。1.colab每次重启以后,数据集都会清空。2.实际运行的时候,内存消耗太大,没办法跑完。所以后来还是在实验室的服务器上跑。

5、参考代码

原作者代码地址:

A complete ML pipeline (Fast.ai) | Kaggle

在这个基础上,把可视化的部分删去了,路径修改一下,注意安装环境时fastai的版本应该就可以跑通。

6、安装环境

在服务器上装了anaconda,然后conda安装了各个库/包,应该都没什么问题。关于fastai的安装,试了一下conda用命令安装一直不成功。即使成功了运行代码时也会有一些报错。最后:

pip install fastai==1.0.50.post1

跑着就顺利了。

6、修改后的乞丐脚本(路径需要修改)

import numpy as np
import pandas as pd
import os
import cv2
from sklearn.utils import shuffle
from tqdm.notebook import tqdm

data = pd.read_csv('/kaggle/input/train_labels.csv')
train_path = '/kaggle/input/train/'
test_path = '/kaggle/input/test/'

# random sampling
shuffled_data = shuffle(data)

# 数据增强
import random

ORIGINAL_SIZE = 96  # 原始尺寸 original size of the images - do not change

# AUGMENTATION VARIABLES 增强变量
CROP_SIZE = 90  # final size after crop 剪裁后的尺寸 输入尺寸?
RANDOM_ROTATION = 3  # range (0-180), 180 allows all rotation variations, 0=no change 旋转角度 0 90 180
RANDOM_SHIFT = 2  # center crop shift in x and y axes, 0=no change. This cannot be more than (ORIGINAL_SIZE 96 - CROP_SIZE 90)//2 = 3 转变?
RANDOM_BRIGHTNESS = 7  # range (0-100), 0=no change 增加亮度
RANDOM_CONTRAST = 5  # range (0-100), 0=no change 增加对比度
RANDOM_90_DEG_TURN = 1  # 0 or 1= random turn to left or right ?


def readCroppedImage(path, augmentations=True):
    # augmentations parameter is included for counting statistics from images, where we don't want augmentations
    # 统计的时候不需要增强

    # OpenCV reads the image in bgr format by default 用opencv处理图片
    bgr_img = cv2.imread(path)
    # We flip it to rgb for visualization purposes
    b, g, r = cv2.split(bgr_img)
    rgb_img = cv2.merge([r, g, b])

    if (not augmentations):
        return rgb_img / 255

    # random rotation 随机旋转 旋转 0 90 180
    rotation = random.randint(-RANDOM_ROTATION, RANDOM_ROTATION)
    if (RANDOM_90_DEG_TURN == 1):
        rotation += random.randint(-1, 1) * 90
    M = cv2.getRotationMatrix2D((48, 48), rotation, 1)  # the center point is the rotation anchor 中心点是旋转锚
    rgb_img = cv2.warpAffine(rgb_img, M, (96, 96))

    # random x,y-shift x y 坐标互换? 随机翻转?
    x = random.randint(-RANDOM_SHIFT, RANDOM_SHIFT)
    y = random.randint(-RANDOM_SHIFT, RANDOM_SHIFT)

    # crop to center and normalize to 0-1 range 裁剪到中心并归一为0-1的范围
    start_crop = (ORIGINAL_SIZE - CROP_SIZE) // 2 #(96-90)// 2 整数除 应该得到3
    end_crop = start_crop + CROP_SIZE # 90+3 = 93
    rgb_img = rgb_img[(start_crop + x):(end_crop + x), (start_crop + y):(end_crop + y)] / 255

    # Random flip 随机翻转
    flip_hor = bool(random.getrandbits(1))
    flip_ver = bool(random.getrandbits(1))
    if (flip_hor):
        rgb_img = rgb_img[:, ::-1]
    if (flip_ver):
        rgb_img = rgb_img[::-1, :]

    # Random brightness 随机亮度
    br = random.randint(-RANDOM_BRIGHTNESS, RANDOM_BRIGHTNESS) / 100.
    rgb_img = rgb_img + br

    # Random contrast 随机对比度
    cr = 1.0 + random.randint(-RANDOM_CONTRAST, RANDOM_CONTRAST) / 100.
    rgb_img = rgb_img * cr

    # clip values to 0-1 range 将数值夹在0-1范围内
    rgb_img = np.clip(rgb_img, 0, 1.0)

    return rgb_img  # 返回处理好的图片


# 计算图像统计数据(在这里不要用增强)
# 计算统计数据将给出通道平均数[0.702447, 0.546243, 0.696453],以及标准差[0.238893, 0.282094, 0.216251]。

# As we count the statistics, we can check if there are any completely black or white images
# 当我们统计时,我们可以检查是否有任何完全黑色或白色的图像

dark_th = 10 / 255  # If no pixel reaches this threshold, image is considered too dark 如果没有像素达到这个阈值,则认为图像太暗。
bright_th = 245 / 255  # If no pixel is under this threshold, image is considerd too bright 如果没有像素达到这个阈值,则认为图像太亮。
too_dark_idx = []
too_bright_idx = []

x_tot = np.zeros(3)
x2_tot = np.zeros(3)
counted_ones = 0
for i, idx in tqdm(enumerate(shuffled_data['id']), 'computing statistics...(220025 it total)'):
    path = os.path.join(train_path, idx)
    imagearray = readCroppedImage(path + '.tif', augmentations=False).reshape(-1, 3)
    # is this too dark 判断是否太暗
    if (imagearray.max() < dark_th):
        too_dark_idx.append(idx)
        continue  # do not include in statistics 太暗的数据不要加入到统计数据里面
    # is this too bright 判断是否太亮
    if (imagearray.min() > bright_th):
        too_bright_idx.append(idx)
        continue  # do not include in statistics 太亮的数据不要加入到统计数据里面
    x_tot += imagearray.mean(axis=0)
    x2_tot += (imagearray ** 2).mean(axis=0)
    counted_ones += 1

# 计算通道的均值和标准差
channel_avr = x_tot / counted_ones
channel_std = np.sqrt(x2_tot / counted_ones - channel_avr ** 2)

# 结果应该是1张过暗,6张过亮
print('There was {0} extremely dark image'.format(len(too_dark_idx)))
print('and {0} extremely bright images'.format(len(too_bright_idx)))
print('Dark one:')
print(too_dark_idx)
print('Bright ones:')
print(too_bright_idx)

#  划分数据集
#  将训练数据分成90%的训练部分和10%的验证部分。我们希望在训练和测试两部分中保持相等的阴性/阳性比例(60/40)

from sklearn.model_selection import train_test_split

# we read the csv file earlier to pandas dataframe, now we set index to id so we can perform
# 我们先前将csv文件读取到pandas数据框架中,现在我们将索引设置为id,这样我们就可以执行
train_df = data.set_index('id')

# If removing outliers, uncomment the four lines below
# 移除异常值
# 移除以前,有多少样本;移出过暗的/过亮的以后,有多少样本
print('Before removing outliers we had {0} training samples.'.format(train_df.shape[0]))
train_df = train_df.drop(labels=too_dark_idx, axis=0)
train_df = train_df.drop(labels=too_bright_idx, axis=0)
print('After removing outliers we have {0} training samples.'.format(train_df.shape[0]))

train_names = train_df.index.values
train_labels = np.asarray(train_df['label'].values)

# split, this function returns more than we need as we only need the validation indexes for fastai
# 分割,这个函数返回的数据比我们需要的多,因为我们只需要fastai的验证索引。
tr_n, tr_idx, val_n, val_idx = train_test_split(train_names, range(len(train_names)), test_size=0.1,
                                                stratify=train_labels, random_state=123)
# 使用fastai库
# fastai 1.0
from fastai import *
from fastai.vision import *
from torchvision.models import *    # import *=all the models from torchvision

# 设置超参数
arch = densenet169                  # specify model architecture, densenet169 seems to perform well for this data but you could experiment(实验)
BATCH_SIZE = 128                    # specify batch size, hardware restrics (硬件会限制这个参数)this one. Large batch sizes may run out of GPU memory(过大会占据完GPU内存)
sz = CROP_SIZE                      # input size is the crop size (输入大小(输入进网络的大小)
MODEL_PATH = str(arch).split()[1]   # this will extract the model name as the model file name e.g. 'resnet50' 提取模型名称作为模型文件的名称

# 我们将图像加载到一个ImageDataBunch中进行训练。这个fastai的数据对象很容易定制,可以使用我们自己的readCroppedImage函数加载图像。我们只需要对ImageList进行子类化。
# create dataframe for the fastai loader
# dataframe

# 训练字典(文件+标签)
train_dict = {'name': train_path + train_names, 'label': train_labels}
df = pd.DataFrame(data=train_dict)
# create test dataframe 创建 test dataframe
# test_names放着测试文件的路径名
test_names = []
for f in os.listdir(test_path):
    test_names.append(test_path + f)
df_test = pd.DataFrame(np.asarray(test_names), columns=['name'])

# 继承
# Subclass ImageList to use our own image opening function
class MyImageItemList(ImageList):
    def open(self, fn:PathOrStr)->Image:
        img = readCroppedImage(fn.replace('/./','').replace('//','/'))
        # This ndarray image has to be converted to tensor before passing on as fastai Image, we can use pil2tensor
        # 这个ndarray图像在作为fastai图像传递之前必须转换为张量,我们可以使用pil2tensor
        return vision.Image(px=pil2tensor(img, np.float32))

# Create ImageDataBunch using fastai data block API
# 数据束 创建databunch
imgDataBunch = (MyImageItemList.from_df(path='/', df=df, suffix='.tif')
        #Where to find the data? 打开path路径中的后缀为.tif文件,dataframe为df
        .split_by_idx(val_idx)
        #How to split in train/valid?
        .label_from_df(cols='label')
        #Where are the labels?
        .add_test(MyImageItemList.from_df(path='/', df=df_test))
        #dataframe pointing to the test set?
        .transform(tfms=[[],[]], size=sz)
        # We have our custom transformations implemented in the image loader but we could apply transformations also here
        # Even though we don't apply transformations here, we set two empty lists to tfms. Train and Validation augmentations
        .databunch(bs=BATCH_SIZE)
        # convert to databunch
        .normalize([tensor([0.702447, 0.546243, 0.696453]), tensor([0.238893, 0.282094, 0.216251])])
        # Normalize with training set stats. These are means and std's of each three channel and we calculated these previously in the stats step.
       )

# 训练
# Next, we create a convnet learner object
# ps = dropout percentage (0-1) in the final layer

# 我们定义了一个convnet学习者对象,在这里我们设置了模型架构和我们的数据束。data bunch
# fastai的create_cnn

def getLearner():
    return create_cnn(imgDataBunch, arch, pretrained=True, path='.', metrics=accuracy, ps=0.5, callback_fns=ShowGraph)

# 构造了一个学习器
learner = getLearner()


# 1周期策略
# We can use lr_find with different weight decays and record all losses so that we can plot them on the same graph
# Number of iterations is by default 100, but at this low number of itrations, there might be too much variance
# from random sampling that makes it difficult to compare WD's. I recommend using an iteration count of at least 300 for more consistent results.
# 我们可以使用不同权重衰减的lr_find,并记录所有的损失,这样我们就可以把它们绘制在同一张图上。
# 迭代次数默认为100次,但在这么低的迭代次数下,可能会有太多的差异
# 来自随机抽样,使我们难以比较WD的情况。我建议使用至少300次的迭代次数以获得更一致的结果。

lrs = []
losses = []
wds = [] #weight decay
iter_count = 600

# WEIGHT DECAY = 1e-6 权重下降
learner.lr_find(wd=1e-6, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-6')
learner = getLearner() #reset learner - this gets more consistent starting conditions #重设学习者--这可以获得更一致的启动条件

# WEIGHT DECAY = 1e-4
learner.lr_find(wd=1e-4, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-4')
learner = getLearner() #reset learner - this gets more consistent starting conditions

# WEIGHT DECAY = 1e-2
learner.lr_find(wd=1e-2, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-2')
learner = getLearner() #reset learner

max_lr = 2e-2
wd = 1e-4

interp = ClassificationInterpretation.from_learner(learner)

learner.save(MODEL_PATH + '_stage1')


# 对基线模型进行微调 (更低的lr)
# 接下来,我们可以解冻模型的所有可训练参数,并继续其训练。
# 这个模型已经表现得很好了,现在,当我们解冻已经用大量的一般图像预先训练过的底层,以检测常见的形状和模式时,所有的权重大多被调整了。我们现在应该以更低的学习率进行训练。

# load the baseline model
learner.load(MODEL_PATH + '_stage1')

# unfreeze and run learning rate finder again 解冻可训练参数
learner.unfreeze()
learner.lr_find(wd=wd)

# Now, smaller learning rates. This time we define the min and max lr of the cycle
learner.fit_one_cycle(cyc_len=12, max_lr=slice(4e-5,4e-4))


interp = ClassificationInterpretation.from_learner(learner)

# 存储为第二个模型
learner.save(MODEL_PATH + '_stage2')



# Validation and analysis 验证和分析
preds,y, loss = learner.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc))


# roc和auc:记住,AUC是用于评估提交的指标。我们可以在这里计算它的验证集,但它很可能会与最终的分数不同。got

from sklearn.metrics import roc_curve, auc
# probs from log preds e为底的指数函数 preds,y, loss = learner.get_preds(with_loss=True)
probs = np.exp(preds[:,1])
# Compute ROC curve 计算roc曲线
fpr, tpr, thresholds = roc_curve(y, probs, pos_label=1)

# Compute ROC area 计算roc面积(auc)
roc_auc = auc(fpr, tpr)
print('ROC area is {0}'.format(roc_auc))

# 应该得到0.99


# TTA
# 为了评估该模型,我们在所有测试图像上运行推理。由于我们有测试时间的增强,如果我们对每张图像进行多次预测并对结果进行平均,我们的结果可能会有所改善。

# make sure we have the best performing model stage loaded
# 加载效果最好的模型
learner.load(MODEL_PATH + '_stage2')

# Fastai has a function for this but we don't want the additional augmentations it does (our image loader has augmentations) so we just use the get_preds
#preds_test,y_test=learner.TTA(ds_type=DatasetType.Test)

# We do a fair number of iterations to cover different combinations of flips and rotations.
# The predictions are then averaged.
# 我们做了相当多的迭代,以涵盖不同的翻转和旋转组合。
# 然后对预测结果进行平均化。

n_aug = 12
preds_n_avg = np.zeros((len(learner.data.test_ds.items),2))
for n in tqdm(range(n_aug), 'Running TTA...'):
    preds,y = learner.get_preds(ds_type=DatasetType.Test, with_loss=False)
    preds_n_avg = np.sum([preds_n_avg, preds.numpy()], axis=0)
preds_n_avg = preds_n_avg / n_aug

# Next, we will transform class probabilities to just tumor class probabilities
# 接下来,我们将把类的概率转化为只是肿瘤类的概率
print('Negative and Tumor Probabilities: ' + str(preds_n_avg[0]))
tumor_preds = preds_n_avg[:, 1]
print('Tumor probability: ' + str(tumor_preds[0]))
# If we wanted to get the predicted class, argmax would get the index of the max
# 如果我们想得到预测的类别,argmax会得到最大的索引。
class_preds = np.argmax(preds_n_avg, axis=1)
classes = ['Negative','Tumor']
print('Class prediction: ' + classes[class_preds[0]])

# 提交对模型的评估;对每一个测试样本,给出0-1的癌症可能性

# get test id's from the sample_submission.csv and keep their original order 得到id并保留他们原始的顺序
SAMPLE_SUB = '/home/wls/kaggle/input/sample_submission.csv'
sample_df = pd.read_csv(SAMPLE_SUB)
sample_list = list(sample_df.id)

# List of tumor preds.
# These are in the order of our test dataset and not necessarily in the same order as in sample_submission
pred_list = [p for p in tumor_preds]

# To know the id's, we create a dict of id:pred
pred_dic = dict((key, value) for (key, value) in zip(learner.data.test_ds.items, pred_list))

# Now, we can create a new list with the same order as in sample_submission
pred_list_cor = [pred_dic['/home/wls/kaggle/input/test/' + id + '.tif'] for id in sample_list]

# Next, a Pandas dataframe with id and label columns.
df_sub = pd.DataFrame({'id':sample_list,'label':pred_list_cor})

# Export to csv
df_sub.to_csv('{0}_submission.csv'.format(MODEL_PATH), header=True, index=False)

7、实验结果

Histopathologic Cancer Detection(densenet169)学习笔记

 

上一篇:机器学习算法系列(5)模型融合


下一篇:MVC从服务器端返回js到客户端的方法(总结)