1、kaggle比赛地址
Histopathologic Cancer Detection | Kaggle
2、说明
这是一个二元图像分类问题。先看下官方给出的数据集构成。
train:训练集。一些病理学图像,.tif文件,大小96 x 96px。(每个样本称为patch)
test:测试集。图像格式和训练集的相同。
train_labels.csv:训练集的标签。以训练数据集的每个样本的文件名作为id,0或者1作为label。0表示负样本(正常,没有癌症),1表示正样本(有癌症)。其中正样本的定义是:图像的中心区域(32 x 32px)至少有一个像素的肿瘤组织,而中心区域以外如果有肿瘤组织,则不影响标签。提供外部区域,这样在做卷积的时候就不再需要做填充了,方便训练。
sample_submission.csv:给出了测试集中的每个样本的id(文件名),后面的label暂时都是0。
任务:从96 x 96px的数字组织病理学图像中识别出是否存在转移灶。也就是对于测试集中的每个id(patch),必须预测patch的中心32x32px区域“至少包含一个像素肿瘤组织”的概率。
3、数据集背景
kaggle比赛用的数据集是PCam数据集的一个子集,它将Pcam中重复的图像删除了。而PCam,也就是Patch Camelyon,来自Camelyon16挑战赛数据集。可以理解为原来的图像是非常大的,将它切片(相当于已经做了预处理)以后,得到的Patch Camelyon。下面三个链接是对camelyon数据集的介绍,可以了解处理病理学图像(WSI)的流程。
camelyon数据集介绍:
https://zhuanlan.zhihu.com/p/50672544
camelyon16冠军方案:
https://zhuanlan.zhihu.com/p/51247262
camelyon17冠军方案:
https://zhuanlan.zhihu.com/p/51735826
4、平台
本来想把kaggle上的数据集通过API放在colab上运行,遇到两个问题。1.colab每次重启以后,数据集都会清空。2.实际运行的时候,内存消耗太大,没办法跑完。所以后来还是在实验室的服务器上跑。
5、参考代码
原作者代码地址:
A complete ML pipeline (Fast.ai) | Kaggle
在这个基础上,把可视化的部分删去了,路径修改一下,注意安装环境时fastai的版本应该就可以跑通。
6、安装环境
在服务器上装了anaconda,然后conda安装了各个库/包,应该都没什么问题。关于fastai的安装,试了一下conda用命令安装一直不成功。即使成功了运行代码时也会有一些报错。最后:
pip install fastai==1.0.50.post1
跑着就顺利了。
6、修改后的乞丐脚本(路径需要修改)
import numpy as np
import pandas as pd
import os
import cv2
from sklearn.utils import shuffle
from tqdm.notebook import tqdm
data = pd.read_csv('/kaggle/input/train_labels.csv')
train_path = '/kaggle/input/train/'
test_path = '/kaggle/input/test/'
# random sampling
shuffled_data = shuffle(data)
# 数据增强
import random
ORIGINAL_SIZE = 96 # 原始尺寸 original size of the images - do not change
# AUGMENTATION VARIABLES 增强变量
CROP_SIZE = 90 # final size after crop 剪裁后的尺寸 输入尺寸?
RANDOM_ROTATION = 3 # range (0-180), 180 allows all rotation variations, 0=no change 旋转角度 0 90 180
RANDOM_SHIFT = 2 # center crop shift in x and y axes, 0=no change. This cannot be more than (ORIGINAL_SIZE 96 - CROP_SIZE 90)//2 = 3 转变?
RANDOM_BRIGHTNESS = 7 # range (0-100), 0=no change 增加亮度
RANDOM_CONTRAST = 5 # range (0-100), 0=no change 增加对比度
RANDOM_90_DEG_TURN = 1 # 0 or 1= random turn to left or right ?
def readCroppedImage(path, augmentations=True):
# augmentations parameter is included for counting statistics from images, where we don't want augmentations
# 统计的时候不需要增强
# OpenCV reads the image in bgr format by default 用opencv处理图片
bgr_img = cv2.imread(path)
# We flip it to rgb for visualization purposes
b, g, r = cv2.split(bgr_img)
rgb_img = cv2.merge([r, g, b])
if (not augmentations):
return rgb_img / 255
# random rotation 随机旋转 旋转 0 90 180
rotation = random.randint(-RANDOM_ROTATION, RANDOM_ROTATION)
if (RANDOM_90_DEG_TURN == 1):
rotation += random.randint(-1, 1) * 90
M = cv2.getRotationMatrix2D((48, 48), rotation, 1) # the center point is the rotation anchor 中心点是旋转锚
rgb_img = cv2.warpAffine(rgb_img, M, (96, 96))
# random x,y-shift x y 坐标互换? 随机翻转?
x = random.randint(-RANDOM_SHIFT, RANDOM_SHIFT)
y = random.randint(-RANDOM_SHIFT, RANDOM_SHIFT)
# crop to center and normalize to 0-1 range 裁剪到中心并归一为0-1的范围
start_crop = (ORIGINAL_SIZE - CROP_SIZE) // 2 #(96-90)// 2 整数除 应该得到3
end_crop = start_crop + CROP_SIZE # 90+3 = 93
rgb_img = rgb_img[(start_crop + x):(end_crop + x), (start_crop + y):(end_crop + y)] / 255
# Random flip 随机翻转
flip_hor = bool(random.getrandbits(1))
flip_ver = bool(random.getrandbits(1))
if (flip_hor):
rgb_img = rgb_img[:, ::-1]
if (flip_ver):
rgb_img = rgb_img[::-1, :]
# Random brightness 随机亮度
br = random.randint(-RANDOM_BRIGHTNESS, RANDOM_BRIGHTNESS) / 100.
rgb_img = rgb_img + br
# Random contrast 随机对比度
cr = 1.0 + random.randint(-RANDOM_CONTRAST, RANDOM_CONTRAST) / 100.
rgb_img = rgb_img * cr
# clip values to 0-1 range 将数值夹在0-1范围内
rgb_img = np.clip(rgb_img, 0, 1.0)
return rgb_img # 返回处理好的图片
# 计算图像统计数据(在这里不要用增强)
# 计算统计数据将给出通道平均数[0.702447, 0.546243, 0.696453],以及标准差[0.238893, 0.282094, 0.216251]。
# As we count the statistics, we can check if there are any completely black or white images
# 当我们统计时,我们可以检查是否有任何完全黑色或白色的图像
dark_th = 10 / 255 # If no pixel reaches this threshold, image is considered too dark 如果没有像素达到这个阈值,则认为图像太暗。
bright_th = 245 / 255 # If no pixel is under this threshold, image is considerd too bright 如果没有像素达到这个阈值,则认为图像太亮。
too_dark_idx = []
too_bright_idx = []
x_tot = np.zeros(3)
x2_tot = np.zeros(3)
counted_ones = 0
for i, idx in tqdm(enumerate(shuffled_data['id']), 'computing statistics...(220025 it total)'):
path = os.path.join(train_path, idx)
imagearray = readCroppedImage(path + '.tif', augmentations=False).reshape(-1, 3)
# is this too dark 判断是否太暗
if (imagearray.max() < dark_th):
too_dark_idx.append(idx)
continue # do not include in statistics 太暗的数据不要加入到统计数据里面
# is this too bright 判断是否太亮
if (imagearray.min() > bright_th):
too_bright_idx.append(idx)
continue # do not include in statistics 太亮的数据不要加入到统计数据里面
x_tot += imagearray.mean(axis=0)
x2_tot += (imagearray ** 2).mean(axis=0)
counted_ones += 1
# 计算通道的均值和标准差
channel_avr = x_tot / counted_ones
channel_std = np.sqrt(x2_tot / counted_ones - channel_avr ** 2)
# 结果应该是1张过暗,6张过亮
print('There was {0} extremely dark image'.format(len(too_dark_idx)))
print('and {0} extremely bright images'.format(len(too_bright_idx)))
print('Dark one:')
print(too_dark_idx)
print('Bright ones:')
print(too_bright_idx)
# 划分数据集
# 将训练数据分成90%的训练部分和10%的验证部分。我们希望在训练和测试两部分中保持相等的阴性/阳性比例(60/40)
from sklearn.model_selection import train_test_split
# we read the csv file earlier to pandas dataframe, now we set index to id so we can perform
# 我们先前将csv文件读取到pandas数据框架中,现在我们将索引设置为id,这样我们就可以执行
train_df = data.set_index('id')
# If removing outliers, uncomment the four lines below
# 移除异常值
# 移除以前,有多少样本;移出过暗的/过亮的以后,有多少样本
print('Before removing outliers we had {0} training samples.'.format(train_df.shape[0]))
train_df = train_df.drop(labels=too_dark_idx, axis=0)
train_df = train_df.drop(labels=too_bright_idx, axis=0)
print('After removing outliers we have {0} training samples.'.format(train_df.shape[0]))
train_names = train_df.index.values
train_labels = np.asarray(train_df['label'].values)
# split, this function returns more than we need as we only need the validation indexes for fastai
# 分割,这个函数返回的数据比我们需要的多,因为我们只需要fastai的验证索引。
tr_n, tr_idx, val_n, val_idx = train_test_split(train_names, range(len(train_names)), test_size=0.1,
stratify=train_labels, random_state=123)
# 使用fastai库
# fastai 1.0
from fastai import *
from fastai.vision import *
from torchvision.models import * # import *=all the models from torchvision
# 设置超参数
arch = densenet169 # specify model architecture, densenet169 seems to perform well for this data but you could experiment(实验)
BATCH_SIZE = 128 # specify batch size, hardware restrics (硬件会限制这个参数)this one. Large batch sizes may run out of GPU memory(过大会占据完GPU内存)
sz = CROP_SIZE # input size is the crop size (输入大小(输入进网络的大小)
MODEL_PATH = str(arch).split()[1] # this will extract the model name as the model file name e.g. 'resnet50' 提取模型名称作为模型文件的名称
# 我们将图像加载到一个ImageDataBunch中进行训练。这个fastai的数据对象很容易定制,可以使用我们自己的readCroppedImage函数加载图像。我们只需要对ImageList进行子类化。
# create dataframe for the fastai loader
# dataframe
# 训练字典(文件+标签)
train_dict = {'name': train_path + train_names, 'label': train_labels}
df = pd.DataFrame(data=train_dict)
# create test dataframe 创建 test dataframe
# test_names放着测试文件的路径名
test_names = []
for f in os.listdir(test_path):
test_names.append(test_path + f)
df_test = pd.DataFrame(np.asarray(test_names), columns=['name'])
# 继承
# Subclass ImageList to use our own image opening function
class MyImageItemList(ImageList):
def open(self, fn:PathOrStr)->Image:
img = readCroppedImage(fn.replace('/./','').replace('//','/'))
# This ndarray image has to be converted to tensor before passing on as fastai Image, we can use pil2tensor
# 这个ndarray图像在作为fastai图像传递之前必须转换为张量,我们可以使用pil2tensor
return vision.Image(px=pil2tensor(img, np.float32))
# Create ImageDataBunch using fastai data block API
# 数据束 创建databunch
imgDataBunch = (MyImageItemList.from_df(path='/', df=df, suffix='.tif')
#Where to find the data? 打开path路径中的后缀为.tif文件,dataframe为df
.split_by_idx(val_idx)
#How to split in train/valid?
.label_from_df(cols='label')
#Where are the labels?
.add_test(MyImageItemList.from_df(path='/', df=df_test))
#dataframe pointing to the test set?
.transform(tfms=[[],[]], size=sz)
# We have our custom transformations implemented in the image loader but we could apply transformations also here
# Even though we don't apply transformations here, we set two empty lists to tfms. Train and Validation augmentations
.databunch(bs=BATCH_SIZE)
# convert to databunch
.normalize([tensor([0.702447, 0.546243, 0.696453]), tensor([0.238893, 0.282094, 0.216251])])
# Normalize with training set stats. These are means and std's of each three channel and we calculated these previously in the stats step.
)
# 训练
# Next, we create a convnet learner object
# ps = dropout percentage (0-1) in the final layer
# 我们定义了一个convnet学习者对象,在这里我们设置了模型架构和我们的数据束。data bunch
# fastai的create_cnn
def getLearner():
return create_cnn(imgDataBunch, arch, pretrained=True, path='.', metrics=accuracy, ps=0.5, callback_fns=ShowGraph)
# 构造了一个学习器
learner = getLearner()
# 1周期策略
# We can use lr_find with different weight decays and record all losses so that we can plot them on the same graph
# Number of iterations is by default 100, but at this low number of itrations, there might be too much variance
# from random sampling that makes it difficult to compare WD's. I recommend using an iteration count of at least 300 for more consistent results.
# 我们可以使用不同权重衰减的lr_find,并记录所有的损失,这样我们就可以把它们绘制在同一张图上。
# 迭代次数默认为100次,但在这么低的迭代次数下,可能会有太多的差异
# 来自随机抽样,使我们难以比较WD的情况。我建议使用至少300次的迭代次数以获得更一致的结果。
lrs = []
losses = []
wds = [] #weight decay
iter_count = 600
# WEIGHT DECAY = 1e-6 权重下降
learner.lr_find(wd=1e-6, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-6')
learner = getLearner() #reset learner - this gets more consistent starting conditions #重设学习者--这可以获得更一致的启动条件
# WEIGHT DECAY = 1e-4
learner.lr_find(wd=1e-4, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-4')
learner = getLearner() #reset learner - this gets more consistent starting conditions
# WEIGHT DECAY = 1e-2
learner.lr_find(wd=1e-2, num_it=iter_count)
lrs.append(learner.recorder.lrs)
losses.append(learner.recorder.losses)
wds.append('1e-2')
learner = getLearner() #reset learner
max_lr = 2e-2
wd = 1e-4
interp = ClassificationInterpretation.from_learner(learner)
learner.save(MODEL_PATH + '_stage1')
# 对基线模型进行微调 (更低的lr)
# 接下来,我们可以解冻模型的所有可训练参数,并继续其训练。
# 这个模型已经表现得很好了,现在,当我们解冻已经用大量的一般图像预先训练过的底层,以检测常见的形状和模式时,所有的权重大多被调整了。我们现在应该以更低的学习率进行训练。
# load the baseline model
learner.load(MODEL_PATH + '_stage1')
# unfreeze and run learning rate finder again 解冻可训练参数
learner.unfreeze()
learner.lr_find(wd=wd)
# Now, smaller learning rates. This time we define the min and max lr of the cycle
learner.fit_one_cycle(cyc_len=12, max_lr=slice(4e-5,4e-4))
interp = ClassificationInterpretation.from_learner(learner)
# 存储为第二个模型
learner.save(MODEL_PATH + '_stage2')
# Validation and analysis 验证和分析
preds,y, loss = learner.get_preds(with_loss=True)
# get accuracy
acc = accuracy(preds, y)
print('The accuracy is {0} %.'.format(acc))
# roc和auc:记住,AUC是用于评估提交的指标。我们可以在这里计算它的验证集,但它很可能会与最终的分数不同。got
from sklearn.metrics import roc_curve, auc
# probs from log preds e为底的指数函数 preds,y, loss = learner.get_preds(with_loss=True)
probs = np.exp(preds[:,1])
# Compute ROC curve 计算roc曲线
fpr, tpr, thresholds = roc_curve(y, probs, pos_label=1)
# Compute ROC area 计算roc面积(auc)
roc_auc = auc(fpr, tpr)
print('ROC area is {0}'.format(roc_auc))
# 应该得到0.99
# TTA
# 为了评估该模型,我们在所有测试图像上运行推理。由于我们有测试时间的增强,如果我们对每张图像进行多次预测并对结果进行平均,我们的结果可能会有所改善。
# make sure we have the best performing model stage loaded
# 加载效果最好的模型
learner.load(MODEL_PATH + '_stage2')
# Fastai has a function for this but we don't want the additional augmentations it does (our image loader has augmentations) so we just use the get_preds
#preds_test,y_test=learner.TTA(ds_type=DatasetType.Test)
# We do a fair number of iterations to cover different combinations of flips and rotations.
# The predictions are then averaged.
# 我们做了相当多的迭代,以涵盖不同的翻转和旋转组合。
# 然后对预测结果进行平均化。
n_aug = 12
preds_n_avg = np.zeros((len(learner.data.test_ds.items),2))
for n in tqdm(range(n_aug), 'Running TTA...'):
preds,y = learner.get_preds(ds_type=DatasetType.Test, with_loss=False)
preds_n_avg = np.sum([preds_n_avg, preds.numpy()], axis=0)
preds_n_avg = preds_n_avg / n_aug
# Next, we will transform class probabilities to just tumor class probabilities
# 接下来,我们将把类的概率转化为只是肿瘤类的概率
print('Negative and Tumor Probabilities: ' + str(preds_n_avg[0]))
tumor_preds = preds_n_avg[:, 1]
print('Tumor probability: ' + str(tumor_preds[0]))
# If we wanted to get the predicted class, argmax would get the index of the max
# 如果我们想得到预测的类别,argmax会得到最大的索引。
class_preds = np.argmax(preds_n_avg, axis=1)
classes = ['Negative','Tumor']
print('Class prediction: ' + classes[class_preds[0]])
# 提交对模型的评估;对每一个测试样本,给出0-1的癌症可能性
# get test id's from the sample_submission.csv and keep their original order 得到id并保留他们原始的顺序
SAMPLE_SUB = '/home/wls/kaggle/input/sample_submission.csv'
sample_df = pd.read_csv(SAMPLE_SUB)
sample_list = list(sample_df.id)
# List of tumor preds.
# These are in the order of our test dataset and not necessarily in the same order as in sample_submission
pred_list = [p for p in tumor_preds]
# To know the id's, we create a dict of id:pred
pred_dic = dict((key, value) for (key, value) in zip(learner.data.test_ds.items, pred_list))
# Now, we can create a new list with the same order as in sample_submission
pred_list_cor = [pred_dic['/home/wls/kaggle/input/test/' + id + '.tif'] for id in sample_list]
# Next, a Pandas dataframe with id and label columns.
df_sub = pd.DataFrame({'id':sample_list,'label':pred_list_cor})
# Export to csv
df_sub.to_csv('{0}_submission.csv'.format(MODEL_PATH), header=True, index=False)
7、实验结果