百度基于Paddle深度学习基础第二周实践作业完成过程
1. 作业要求
本代码旨在于使用ResNet进行眼睑筛查,代码已经完成,可以直接运行。
题目要求:
- 通过查阅API,使用衰减学习率,通过多次调参数,找到一个最佳的衰减步长,使得loss比原代码中下降的更快
- 请自行绘制修改学习率前后的loss衰减图
注意:
- 原代码中仅需要更改学习率部分
- 若loss下降效果不明显,可自行调大epoch_num至10
2. 总体说明:
程序做了如下修改,并在程序中做了对应的标注:
1) 修改了 epoch_num 为 10,增加运行轮数
2) 定义学习率变量,使用 paddle API,使学习自动衰减
3) 定义 losses 集合与 iters 集合,方便绘制 loss 图形
4) 定义了 绘制 loss 图的函数,方便调用与图形绘制
2.1 添加的代码如下:
# 4. 绘制 loss 变化情况
# 【参考: 课本2,项目8: 可视化分析】
# 4.1. 引入绘图库
import matplotlib.pyplot as plt
# 在 jupyter 中能绘制图形
%matplotlib inline
# 4.2. 定义绘制 loss 变化曲线的函数
'''
@param
iters: 横坐标
losses_train: 训练losses
'''
def plot_change_loss(iters, losses_train):
#画出训练过程中Loss的变化曲线
plt.figure()
plt.title("train loss", fontsize=24)
plt.xlabel("iter", fontsize=14)
plt.ylabel("loss", fontsize=14)
plt.plot(iters, losses_train,color='red',label='train loss')
plt.grid()
plt.show()
2.2 修改训练过程代码如下:
# 定义训练过程
def train(model):
with fluid.dygraph.guard():
print('start training ... ')
model.train()
epoch_num = 10 # 1.修改这里的epoch_num 由1修改为10
# 2.1. 定义学习率,并加载优化器参数到模型中,【参考 课节9,项目9. 模型加载及恢复训练-->可视化分析】
# 眼疾数据,包含1200个受试者的眼底视网膜图片,训练、验证和测试数据集各400张
BATCH_SIZE = 10
# total_steps = (int(400//BATCH_SIZE) + 1) * epoch_num
total_steps = (int(400//BATCH_SIZE) + 1) * 2.5
lr = fluid.dygraph.PolynomialDecay(0.005, total_steps, 0.001)
# 3.1. 添加迭代记数器及迭代记数列表、loss列表 以方便绘制图形
iter_count = 0
iters = []
losses_train = [] # 训练的loss
# 定义优化器
# opt = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9, parameter_list=model.parameters())
# 2.2. 将固定学习率,修改为动态学习率
opt = fluid.optimizer.Momentum(learning_rate=lr, momentum=0.9, parameter_list=model.parameters())
# 定义数据读取器,训练数据读取器和验证数据读取器
train_loader = data_loader(DATADIR, batch_size=10, mode='train')
valid_loader = valid_data_loader(DATADIR2, CSVFILE)
for epoch in range(epoch_num):
for batch_id, data in enumerate(train_loader()):
x_data, y_data = data
img = fluid.dygraph.to_variable(x_data)
label = fluid.dygraph.to_variable(y_data)
# 运行模型前向计算,得到预测值
logits = model(img)
# 进行loss计算
loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, label)
avg_loss = fluid.layers.mean(loss)
if batch_id % 10 == 0:
print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, avg_loss.numpy()))
# 3.2. 添加迭代次数记数并填充集合
iters.append(iter_count)
losses_train.append(avg_loss.numpy())
iter_count += 10
# 反向传播,更新权重,清除梯度
avg_loss.backward()
opt.minimize(avg_loss)
model.clear_gradients()
model.eval()
accuracies = []
losses = []
for batch_id, data in enumerate(valid_loader()):
x_data, y_data = data
img = fluid.dygraph.to_variable(x_data)
label = fluid.dygraph.to_variable(y_data)
# 运行模型前向计算,得到预测值
logits = model(img)
# 二分类,sigmoid计算后的结果以0.5为阈值分两个类别
# 计算sigmoid后的预测概率,进行loss计算
pred = fluid.layers.sigmoid(logits)
loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, label)
# 计算预测概率小于0.5的类别
pred2 = pred * (-1.0) + 1.0
# 得到两个类别的预测概率,并沿第一个维度级联
pred = fluid.layers.concat([pred2, pred], axis=1)
acc = fluid.layers.accuracy(pred, fluid.layers.cast(label, dtype='int64'))
accuracies.append(acc.numpy())
losses.append(loss.numpy())
print("[validation] accuracy/loss: {}/{}".format(np.mean(accuracies), np.mean(losses)))
model.train()
# save params of model
fluid.save_dygraph(model.state_dict(), 'palm')
# save optimizer state
fluid.save_dygraph(opt.state_dict(), 'palm')
# 4.3. 调用绘制 loss 变化曲线函数进行绘制
plot_change_loss(iters, losses_train)
3. 小结
1) 通过测试,发现每次的运行过程特别的漫长,稍微修改一点内容,就要等待很长的时间,感觉算力真的很重要。
2) 查 paddle API 和查以前的课程内容相结合,很多相关的内容和设计思路,老师在前面的课程中已经介绍了,或在前面的课程的程序中已经做了演示,如果单独看 API 的例子,不确定怎么写程序的话,可以参考老师在前面课程中已经写完类似的代码去实现自己的目标。
3) 因为随机批次读取数据,所以相同的参数不同的运行,也会得到不同的运行效果。
4. 下一节各个测试用例的运行效果的说明:
完成程序的修改后,做了如下的运行测试,每次运行程序大概 45 分钟,有些记录了详细的 过程,有些只记录了最终的 loss 图形
1) (其中测试1)是在固定学习率0.001(也就是老师程序的学习率)下,运行了程序,发现准确率能达到 93%;
2) (测试2-6)是修改了不同的学习率的衰减范围,运行了程序,准确率最高能达到 95-96%;
3) (测试7之后)是修改学习率的衰减的步数(思路是先使用较大的步数,然后使用较小的步数,查看它们的效果,然后不断使用2分法,向中间查找对应频数的运行效果), 修改参数运行了程序后,发现(8 的测试步数)loss下降较快,因此采用的步数为 total_steps = (int(400//BATCH_SIZE) + 1) * 2.5,学习率从0.005 到 0.001, 准确率最高能达到:95-96%,
5. 测试过程
5.1 测试1. 未修改学习率时, 运行10轮的效果如下:
运行时长: 44分44秒478毫秒
结束时间: 2020-08-22 08:54:13
start training ...
epoch: 0, batch_id: 0, loss is: [0.61399436]
epoch: 0, batch_id: 10, loss is: [0.5397954]
epoch: 0, batch_id: 20, loss is: [0.6404487]
epoch: 0, batch_id: 30, loss is: [0.6857702]
[validation] accuracy/loss: 0.7849999666213989/0.48529601097106934
epoch: 1, batch_id: 0, loss is: [0.6386715]
epoch: 1, batch_id: 10, loss is: [0.4865201]
epoch: 1, batch_id: 20, loss is: [0.50081044]
epoch: 1, batch_id: 30, loss is: [0.34162915]
[validation] accuracy/loss: 0.7024999856948853/0.6229860186576843
epoch: 2, batch_id: 0, loss is: [0.25662675]
epoch: 2, batch_id: 10, loss is: [1.7949547]
epoch: 2, batch_id: 20, loss is: [0.19568667]
epoch: 2, batch_id: 30, loss is: [0.19662617]
[validation] accuracy/loss: 0.9149999618530273/0.24236293137073517
epoch: 3, batch_id: 0, loss is: [0.84582233]
epoch: 3, batch_id: 10, loss is: [0.12374055]
epoch: 3, batch_id: 20, loss is: [0.39764705]
epoch: 3, batch_id: 30, loss is: [0.2201365]
[validation] accuracy/loss: 0.862500011920929/0.3080523610115051
epoch: 4, batch_id: 0, loss is: [0.11742544]
epoch: 4, batch_id: 10, loss is: [0.33280876]
epoch: 4, batch_id: 20, loss is: [0.13732623]
epoch: 4, batch_id: 30, loss is: [1.1103892]
[validation] accuracy/loss: 0.8575000762939453/0.3615248501300812
epoch: 5, batch_id: 0, loss is: [0.13193114]
epoch: 5, batch_id: 10, loss is: [0.5138872]
epoch: 5, batch_id: 20, loss is: [0.3979571]
epoch: 5, batch_id: 30, loss is: [0.42524424]
[validation] accuracy/loss: 0.9350000619888306/0.18516869843006134
epoch: 6, batch_id: 0, loss is: [0.21446273]
epoch: 6, batch_id: 10, loss is: [0.29208523]
epoch: 6, batch_id: 20, loss is: [0.71075696]
epoch: 6, batch_id: 30, loss is: [0.16396093]
[validation] accuracy/loss: 0.9299999475479126/0.2387014776468277
epoch: 7, batch_id: 0, loss is: [0.31918693]
epoch: 7, batch_id: 10, loss is: [0.05028909]
epoch: 7, batch_id: 20, loss is: [0.16989382]
epoch: 7, batch_id: 30, loss is: [0.13365436]
[validation] accuracy/loss: 0.9350000619888306/0.2305712103843689
epoch: 8, batch_id: 0, loss is: [0.13404241]
epoch: 8, batch_id: 10, loss is: [0.14186636]
epoch: 8, batch_id: 20, loss is: [0.03710499]
epoch: 8, batch_id: 30, loss is: [0.12268938]
[validation] accuracy/loss: 0.9325000047683716/0.19645212590694427
epoch: 9, batch_id: 0, loss is: [0.23441052]
epoch: 9, batch_id: 10, loss is: [0.07912876]
epoch: 9, batch_id: 20, loss is: [0.29366916]
epoch: 9, batch_id: 30, loss is: [0.34971437]
[validation] accuracy/loss: 0.9325000643730164/0.19805531203746796
5.2 测试2. 动态衰减学习率0.01-->0.001时的运行2轮的效果如下:
运行2轮的情况:
运行时长: 8分58秒267毫秒
结束时间: 2020-08-21 23:48:37
start training ...
epoch: 0, batch_id: 0, loss is: [0.84265745]
epoch: 0, batch_id: 10, loss is: [1.1629268]
epoch: 0, batch_id: 20, loss is: [0.529528]
epoch: 0, batch_id: 30, loss is: [0.69051445]
[validation] accuracy/loss: 0.7024999856948853/2.064648389816284
epoch: 1, batch_id: 0, loss is: [0.16901067]
epoch: 1, batch_id: 10, loss is: [0.32736033]
epoch: 1, batch_id: 20, loss is: [0.3370896]
epoch: 1, batch_id: 30, loss is: [0.01843431]
[validation] accuracy/loss: 0.8575000762939453/0.371820330619812
5.3 测试3. 动态衰减学习率0.01-->0.001,运行10轮的效果如下:
运行时长: 44分30秒125毫秒
结束时间: 2020-08-22 00:35:34
start training ...
epoch: 0, batch_id: 0, loss is: [0.7350406]
epoch: 0, batch_id: 10, loss is: [1.5526597]
epoch: 0, batch_id: 20, loss is: [1.4809185]
epoch: 0, batch_id: 30, loss is: [1.2971458]
[validation] accuracy/loss: 0.4749999940395355/3.1085333824157715
epoch: 1, batch_id: 0, loss is: [1.6478646]
epoch: 1, batch_id: 10, loss is: [0.87493515]
epoch: 1, batch_id: 20, loss is: [0.433599]
epoch: 1, batch_id: 30, loss is: [0.1780614]
[validation] accuracy/loss: 0.7949999570846558/0.5167998671531677
epoch: 2, batch_id: 0, loss is: [0.08465642]
epoch: 2, batch_id: 10, loss is: [0.7266342]
epoch: 2, batch_id: 20, loss is: [0.05047063]
epoch: 2, batch_id: 30, loss is: [0.09345372]
[validation] accuracy/loss: 0.887499988079071/0.6409258842468262
epoch: 3, batch_id: 0, loss is: [0.27393234]
epoch: 3, batch_id: 10, loss is: [2.053777]
epoch: 3, batch_id: 20, loss is: [0.07622384]
epoch: 3, batch_id: 30, loss is: [0.28289387]
[validation] accuracy/loss: 0.9050000309944153/0.33014366030693054
epoch: 4, batch_id: 0, loss is: [0.8107643]
epoch: 4, batch_id: 10, loss is: [4.6561575]
epoch: 4, batch_id: 20, loss is: [0.4564219]
epoch: 4, batch_id: 30, loss is: [0.35920316]
[validation] accuracy/loss: 0.9149999618530273/0.20150217413902283
epoch: 5, batch_id: 0, loss is: [0.5040611]
epoch: 5, batch_id: 10, loss is: [0.24063113]
epoch: 5, batch_id: 20, loss is: [0.5784434]
epoch: 5, batch_id: 30, loss is: [0.11537111]
[validation] accuracy/loss: 0.8899999856948853/0.3055678606033325
epoch: 6, batch_id: 0, loss is: [0.32148445]
epoch: 6, batch_id: 10, loss is: [0.10687976]
epoch: 6, batch_id: 20, loss is: [0.01122489]
epoch: 6, batch_id: 30, loss is: [0.69238436]
[validation] accuracy/loss: 0.9424999952316284/0.2373284101486206
epoch: 7, batch_id: 0, loss is: [0.22534744]
epoch: 7, batch_id: 10, loss is: [0.21905744]
epoch: 7, batch_id: 20, loss is: [0.24537715]
epoch: 7, batch_id: 30, loss is: [0.02578714]
[validation] accuracy/loss: 0.9624999761581421/0.18944025039672852
epoch: 8, batch_id: 0, loss is: [0.06122933]
epoch: 8, batch_id: 10, loss is: [1.0541042]
epoch: 8, batch_id: 20, loss is: [0.66766155]
epoch: 8, batch_id: 30, loss is: [0.03150726]
[validation] accuracy/loss: 0.9475000500679016/0.16870255768299103
epoch: 9, batch_id: 0, loss is: [0.03649266]
epoch: 9, batch_id: 10, loss is: [0.10476176]
epoch: 9, batch_id: 20, loss is: [0.10077281]
epoch: 9, batch_id: 30, loss is: [0.16713764]
[validation] accuracy/loss: 0.9524999856948853/0.1519453078508377
5.3.1 测试3.1. 动态衰减学习率0.05-->0.001,运行10轮的效果如下:
5.4 测试4. 动态衰减学习率0.1-->0.001,运行10轮的效果如下:
5.5 测试5. 动态衰减学习率0.005-->0.001,运行10轮的效果如下:
5.6 测试6. 动态衰减学习率0.005-->0.0005,运行10轮的效果如下:
但是运行很不稳定,忽高忽低
上面的学习率衰减步数为:
total_steps = (int(400//BATCH_SIZE) + 1) * epoch_num
基于 BATCH_SIZE=10, epoch_num = 10
,下面修改衰减步数:
5.7 动态衰减学习率0.005-->0.001,步数为 total_steps = (int(400//BATCH_SIZE) + 1) ,运行10轮的效果如下:
5.8 测试8. 动态衰减学习率0.005-->0.001,步数为 total_steps = (int(400//BATCH_SIZE) + 1) *2.5,运行10轮的效果如下:
这个的效果比较好
再次运行时的效果如下:
5.9 测试9. 动态衰减学习率0.005-->0.001,步数为 total_steps = (int(400//BATCH_SIZE) + 1) *5,运行10轮的效果如下:
5.10 测试10. 动态衰减学习率0.005-->0.001,步数为 total_steps = (int(400//BATCH_SIZE) + 1) *3.5,运行10轮的效果如下:
验证集的准确率能达到97%