本项目为疫情期间网民情绪识别比赛的解决方案。使用了PaddleHub和ERNIE实现对疫情期间微博文本的情绪识别。
数据分析
先将数据集解压
# 解压数据集 !cd data/data22724 && unzip -o test_dataset.zip !cd data/data22724 && unzip -o "train_ dataset.zip" Archive: test_dataset.zip inflating: nCov_10k_test.csv Archive: train_ dataset.zip inflating: nCoV_100k_train.labled.csv inflating: nCoV_900k_train.unlabled.csv
由于数据采用GB2312编码,因此先将数据读出,转换为utf8编码再重新写入,方便后续pandas库的使用处理。
# 转换编码 def re_encode(path): with open(path, 'r', encoding='GB2312', errors='ignore') as file: lines = file.readlines() with open(path, 'w', encoding='utf-8') as file: file.write(''.join(lines)) re_encode('data/data22724/nCov_10k_test.csv') re_encode('data/data22724/nCoV_100k_train.labled.csv')
数据预览
读取数据,查看数据大小、列名
# 读取数据 import pandas as pd train_labled = pd.read_csv('data/data22724/nCoV_100k_train.labled.csv', engine ='python') test = pd.read_csv('data/data22724/nCov_10k_test.csv', engine ='python')
print(train_labled.shape) print(test.shape) print(train_labled.columns)
train_labled.head(3) 微博id 微博发布时间 发布人账号 \ 0 4456072029125500 01月01日 23:50 存曦1988 1 4456074167480980 01月01日 23:58 LunaKrys 2 4456054253264520 01月01日 22:39 小王爷学辩论o_O 微博中文内容 \ 0 写在年末冬初孩子流感的第五天,我们仍然没有忘记热情拥抱这2020年的第一天。带着一丝迷信,早... 1 开年大模型…累到以为自己发烧了腰疼膝盖疼腿疼胳膊疼脖子疼#Luna的Krystallife#? 2 邱晨这就是我爹,爹,发烧快好,毕竟美好的假期拿来养病不太好,假期还是要好好享受快乐,爹,新年... 微博图片 微博视频 情感倾向 0 ['https://ww2.sinaimg.cn/orj360/005VnA1zly1gah... [] 0 1 [] [] -1 2 ['https://ww2.sinaimg.cn/thumb150/006ymYXKgy1g... [] 1
(100000, 7) (10000, 6) Index(['微博id', '微博发布时间', '发布人账号', '微博中文内容', '微博图片', '微博视频', '情感倾向
标签分类
问题标签分为三类,分别为:1(积极),0(中性)和-1(消极)
# 标签分布 %matplotlib inline train_labled['情感倾向'].value_counts(normalize=True).plot(kind='bar');
<Figure size 432x288 with 1 Axes>
# 清除异常标签数据 train_labled = train_labled[train_labled['情感倾向'].isin(['-1','0','1'])]
文本长度
训练集文本长度最大为 241,平均为 87。
train_labled['微博中文内容'].str.len().describe() count 99560.000000 mean 87.276416 std 49.355898 min 1.000000 25% 42.000000 50% 86.000000 75% 139.000000 max 241.000000 Name: 微博中文内容, dtype: float64
数据整理
我们将数据划分为训练集和验证集,比例为8:2, 然后保存为文本文件,两列需用tab分隔符隔开。
# 划分验证集,保存格式 text[\t]label from sklearn.model_selection import train_test_split train_labled = train_labled[['微博中文内容', '情感倾向']] train, valid = train_test_split(train_labled, test_size=0.2, random_state=2020) train.to_csv('/home/aistudio/data/data22724/train.txt', index=False, header=False, sep='\t') valid.to_csv('/home/aistudio/data/data22724/valid.txt', index=False, header=False, sep='\t')
自定义数据加载
加载文本类自定义数据集,用户仅需要继承基类BaseNLPDatast,修改数据集存放地址以及类别即可。
这里我们没有带标签的测试集,所以test_file直接用验证集代替 "valid.txt" 。
# 自定义数据集 import os import codecs import csv from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset class MyDataset(BaseNLPDataset): """DemoDataset""" def __init__(self): # 数据集存放位置 self.dataset_dir = "/home/aistudio/data/data22724" super(MyDataset, self).__init__( base_path=self.dataset_dir, train_file="train.txt", dev_file="valid.txt", test_file="valid.txt", train_file_with_header=False, dev_file_with_header=False, test_file_with_header=False, # 数据集类别集合 label_list=["-1", "0", "1"]) dataset = MyDataset() for e in dataset.get_train_examples()[:3]: print("{}\t{}\t{}".format(e.guid, e.text_a, e.label)) 0 【#吃中药后新冠肺炎患者紧张心理得到缓解#生病时,你一般吃中药还是西药?】国家中医药管理局医疗救治专家组组长、中国工程院院士、中国中医科学院院长黄璐琦14日介绍,在新冠肺炎患者救治中,老百姓对中医药有种迫切需求,吃了中医药后紧张心理得到一定程度缓解。生病时,你一般吃中药还是西药? 0 1 又是上班的一天今天依然很爱很爱我的宝贝? 1 2 //@淡梦就是爱吐槽:还有患渐冻症不离一线的张定宇,高龄赶赴现场的钟南山、李兰娟,抗击疫病的医生,
加载模型
# 加载模型 import paddlehub as hub module = hub.Module(name="chinese-roberta-wwm-ext") [2020-08-30 17:40:14,124] [ INFO] - Installing chinese-roberta-wwm-ext module Downloading chinese-roberta-wwm-ext [==================================================] 100.00% Uncompress /home/aistudio/.paddlehub/tmp/tmpdrngwfdm/chinese-roberta-wwm-ext [==================================================] 100.00% [2020-08-30 17:52:25,274] [ INFO] - Successfully installed chinese-roberta-wwm-ext-1.0.0
构建Reader
构建一个文本分类的reader,reader负责将dataset的数据进行预处理,首先对文本进行切词,接着以特定格式组织并输入给模型进行训练。
# 构建Reader reader = hub.reader.ClassifyReader( dataset=dataset, vocab_path=module.get_vocab_path(), sp_model_path=module.get_spm_path(), word_dict_path=module.get_word_dict_path(), max_seq_len=128) [2020-08-30 17:52:25,291] [ INFO] - Dataset label map = {'-1': 0, '0': 1, '1': 2}
finetune策略
# finetune策略 strategy = hub.AdamWeightDecayStrategy( weight_decay=0.01, warmup_proportion=0.1, learning_rate=5e-5)
运行配置
设置训练时的epoch,batch_size,模型储存路径等参数。
这里我们设置训练轮数 num_epoch = 1,模型保存路径 checkpoint_dir="model", 每100轮(eval_interval)对验证集验证一次评分,并保存最优模型。
# 运行配置 config = hub.RunConfig( use_cuda=True, num_epoch=1, checkpoint_dir="model_bert", batch_size=32, eval_interval=100, strategy=strategy) [2020-08-30 17:52:25,337] [ INFO] - Checkpoint dir: model_bert
组建Finetune Task
对于文本分类任务,我们需要获取模型的池化层输出,并在后面接上全连接层实现分类。
因此,我们先获取module的上下文环境,包括输入和输出的变量,并从中获取池化层输出作为文本特征。再接入一个全连接层,生成Task。
比赛评价指标为F1,因此设置metrics_choices=["f1"]
# Finetune Task inputs, outputs, program = module.context( trainable=True, max_seq_len=128) # Use "pooled_output" for classification tasks on an entire sentence. pooled_output = outputs["pooled_output"] feed_list = [ inputs["input_ids"].name, inputs["position_ids"].name, inputs["segment_ids"].name, inputs["input_mask"].name, ] cls_task = hub.TextClassifierTask( data_reader=reader, feature=pooled_output, feed_list=feed_list, num_classes=dataset.num_labels, config=config, metrics_choices=["f1"]) [2020-08-30 17:52:29,000] [ INFO] - Load pretraining parameters from /home/aistudio/.paddlehub/modul
开始finetune
我们使用finetune_and_eval接口就可以开始模型训练,finetune过程中,会周期性的进行模型效果的评估。
# finetune run_states = cls_task.finetune_and_eval() [2020-08-30 17:52:32,906] [ INFO] - Strategy with warmup, linear decay, slanted triangle learning rate, weight decay regularization, /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py:795: UserWarning: The current program is empty. warnings.warn(error_info) [2020-08-30 17:52:38,879] [ INFO] - Try loading checkpoint from model_bert/ckpt.meta [2020-08-30 17:52:38,881] [ INFO] - PaddleHub model checkpoint not found, start from scratch... [2020-08-30 17:52:38,957] [ INFO] - PaddleHub finetune start [2020-08-30 17:52:42,695] [ TRAIN] - step 10 / 2497: loss=1.01098 f1=0.85000 [step/sec: 2.88] [2020-08-30 17:52:45,246] [ TRAIN] - step 20 / 2497: loss=0.97925 f1=0.87943 [step/sec: 3.92] [2020-08-30 17:52:47,831] [ TRAIN] - step 30 / 2497: loss=0.98047 f1=0.87037 [step/sec: 3.89] [2020-08-30 17:52:50,411] [ TRAIN] - step 40 / 2497: loss=1.01527 f1=0.84422 [step/sec: 3.89] [2020-08-30 17:52:53,007] [ TRAIN] - step 50 / 2497: loss=0.97841 f1=0.85195 [step/sec: 3.87] [2020-08-30 17:52:55,597] [ TRAIN] - step 60 / 2497: loss=0.84828 f1=0.90353 [step/sec: 3.87] [2020-08-30 17:52:58,183] [ TRAIN] - step 70 / 2497: loss=0.81307 f1=0.88778 [step/sec: 3.88] [2020-08-30 17:53:00,820] [ TRAIN] - step 80 / 2497: loss=0.73095 f1=0.89620 [step/sec: 3.82] [2020-08-30 17:53:03,420] [ TRAIN] - step 90 / 2497: loss=0.75200 f1=0.84367 [step/sec: 3.86] [2020-08-30 17:53:06,008] [ TRAIN] - step 100 / 2497: loss=0.78772 f1=0.86647 [step/sec: 3.88] [2020-08-30 17:53:06,009] [ INFO] - Evaluation on dev dataset start share_vars_from is set, scope is ignored. [2020-08-30 17:53:55,638] [ EVAL] - [dev dataset evaluation result] loss=0.69270 f1=0.87430 [step/sec: 12.82] [2020-08-30 17:53:55,639] [ EVAL] - best model saved to model_bert/best_model [best f1=0.87430] [2020-08-30 17:54:00,583] [ TRAIN] - step 110 / 2497: loss=0.71252 f1=0.86544 [step/sec: 3.87] [2020-08-30 17:54:03,182] [ TRAIN] - step 120 / 2497: loss=0.71817 f1=0.86262 [step/sec: 3.85] [2020-08-30 17:54:05,778] [ TRAIN] - step 130 / 2497: loss=0.67883 f1=0.88298 [step/sec: 3.86] [2020-08-30 17:54:08,372] [ TRAIN] - step 140 / 2497: loss=0.71395 f1=0.87613 [step/sec: 3.86] [2020-08-30 17:54:10,967] [ TRAIN] - step 150 / 2497: loss=0.63634 f1=0.90230 [step/sec: 3.86] [2020-08-30 17:54:13,561] [ TRAIN] - step 160 / 2497: loss=0.69675 f1=0.89634 [step/sec: 3.86] [2020-08-30 17:54:16,157] [ TRAIN] - step 170 / 2497: loss=0.65184 f1=0.89164 [step/sec: 3.86] [2020-08-30 17:54:18,755] [ TRAIN] - step 180 / 2497: loss=0.65731 f1=0.89655 [step/sec: 3.85] [2020-08-30 17:54:21,347] [ TRAIN] - step 190 / 2497: loss=0.57865 f1=0.89736 [step/sec: 3.86] [2020-08-30 17:54:23,948] [ TRAIN] - step 200 / 2497: loss=0.62045 f1=0.87647 [step/sec: 3.85] [2020-08-30 17:54:23,949] [ INFO] - Evaluation on dev dataset start [2020-08-30 17:55:11,823] [ EVAL] - [dev dataset evaluation result] loss=0.62273 f1=0.88890 [step/sec: 13.06] [2020-08-30 17:55:11,824] [ EVAL] - best model saved to model_bert/best_model [best f1=0.88890] [2020-08-30 17:55:17,542] [ TRAIN] - step 210 / 2497: loss=0.67096 f1=0.89655 [step/sec: 3.82] [2020-08-30 17:55:20,158] [ TRAIN] - step 220 / 2497: loss=0.64818 f1=0.87460 [step/sec: 3.84] [2020-08-30 17:55:22,767] [ TRAIN] - step 230 / 2497: loss=0.62531 f1=0.87834 [step/sec: 3.84] [2020-08-30 17:55:25,377] [ TRAIN] - step 240 / 2497: loss=0.63616 f1=0.85993 [step/sec: 3.84] [2020-08-30 17:55:27,994] [ TRAIN] - step 250 / 2497: loss=0.72842 f1=0.88949 [step/sec: 3.84] [2020-08-30 17:55:30,605] [ TRAIN] - step 260 / 2497: loss=0.62879 f1=0.88462 [step/sec: 3.84] [2020-08-30 17:55:33,227] [ TRAIN] - step 270 / 2497: loss=0.58715 f1=0.88136 [step/sec: 3.84] [2020-08-30 17:55:35,815] [ TRAIN] - step 280 / 2497: loss=0.69536 f1=0.89914 [step/sec: 3.87] [2020-08-30 17:55:38,425] [ TRAIN] - step 290 / 2497: loss=0.64157 f1=0.85897 [step/sec: 3.83] [2020-08-30 17:55:41,027] [ TRAIN] - step 300 / 2497: loss=0.58968 f1=0.87135 [step/sec: 3.85] [2020-08-30 17:55:41,029] [ INFO] - Evaluation on dev dataset start [2020-08-30 17:56:29,689] [ EVAL] - [dev dataset evaluation result] loss=0.65638 f1=0.88808 [step/sec: 12.85] [2020-08-30 17:56:32,335] [ TRAIN] - step 310 / 2497: loss=0.62983 f1=0.89944 [step/sec: 3.81] [2020-08-30 17:56:34,964] [ TRAIN] - step 320 / 2497: loss=0.66090 f1=0.85512 [step/sec: 3.83] [2020-08-30 17:56:37,585] [ TRAIN] - step 330 / 2497: loss=0.58643 f1=0.92183 [step/sec: 3.83] [2020-08-30 17:56:40,207] [ TRAIN] - step 340 / 2497: loss=0.63230 f1=0.86726 [step/sec: 3.83] [2020-08-30 17:56:42,832] [ TRAIN] - step 350 / 2497: loss=0.57614 f1=0.91560 [step/sec: 3.82] [2020-08-30 17:56:45,450] [ TRAIN] - step 360 / 2497: loss=0.60857 f1=0.88827 [step/sec: 3.83] [2020-08-30 17:56:48,085] [ TRAIN] - step 370 / 2497: loss=0.68148 f1=0.86053 [step/sec: 3.82] [2020-08-30 17:56:50,723] [ TRAIN] - step 380 / 2497: loss=0.61660 f1=0.88050 [step/sec: 3.80] [2020-08-30 17:56:53,342] [ TRAIN] - step 390 / 2497: loss=0.70684 f1=0.87390 [step/sec: 3.83] [2020-08-30 17:56:55,983] [ TRAIN] - step 400 / 2497: loss=0.56253 f1=0.90237 [step/sec: 3.80] [2020-08-30 17:56:55,985] [ INFO] - Evaluation on dev dataset start [2020-08-30 17:57:44,236] [ EVAL] - [dev dataset evaluation result] loss=0.61991 f1=0.89283 [step/sec: 12.96] [2020-08-30 17:57:44,237] [ EVAL] - best model saved to model_bert/best_model [best f1=0.89283] [2020-08-30 17:57:49,938] [ TRAIN] - step 410 / 2497: loss=0.61865 f1=0.91613 [step/sec: 3.79] [2020-08-30 17:57:52,559] [ TRAIN] - step 420 / 2497: loss=0.65938 f1=0.87536 [step/sec: 3.83] [2020-08-30 17:57:55,190] [ TRAIN] - step 430 / 2497: loss=0.66385 f1=0.91077 [step/sec: 3.83] [2020-08-30 17:57:57,813] [ TRAIN] - step 440 / 2497: loss=0.56204 f1=0.92079 [step/sec: 3.83] [2020-08-30 17:58:00,443] [ TRAIN] - step 450 / 2497: loss=0.60971 f1=0.90265 [step/sec: 3.81] [2020-08-30 17:58:03,069] [ TRAIN] - step 460 / 2497: loss=0.61614 f1=0.87385 [step/sec: 3.83] [2020-08-30 17:58:05,690] [ TRAIN] - step 470 / 2497: loss=0.58309 f1=0.89286 [step/sec: 3.83] [2020-08-30 17:58:08,314] [ TRAIN] - step 480 / 2497: loss=0.67171 f1=0.87385 [step/sec: 3.82] [2020-08-30 17:58:10,931] [ TRAIN] - step 490 / 2497: loss=0.70945 f1=0.83735 [step/sec: 3.83] [2020-08-30 17:58:13,558] [ TRAIN] - step 500 / 2497: loss=0.56513 f1=0.91971 [step/sec: 3.83] [2020-08-30 17:58:13,559] [ INFO] - Evaluation on dev dataset start [2020-08-30 17:59:02,593] [ EVAL] - [dev dataset evaluation result] loss=0.59540 f1=0.89030 [step/sec: 12.75] [2020-08-30 17:59:05,216] [ TRAIN] - step 510 / 2497: loss=0.58815 f1=0.86782 [step/sec: 3.83] [2020-08-30 17:59:07,839] [ TRAIN] - step 520 / 2497: loss=0.56237 f1=0.88254 [step/sec: 3.82] [2020-08-30 17:59:10,465] [ TRAIN] - step 530 / 2497: loss=0.56160 f1=0.88485 [step/sec: 3.82] [2020-08-30 17:59:13,087] [ TRAIN] - step 540 / 2497: loss=0.64510 f1=0.90196 [step/sec: 3.82] [2020-08-30 17:59:15,716] [ TRAIN] - step 550 / 2497: loss=0.53364 f1=0.91831 [step/sec: 3.82] [2020-08-30 17:59:18,370] [ TRAIN] - step 560 / 2497: loss=0.53434 f1=0.87948 [step/sec: 3.80] [2020-08-30 17:59:20,996] [ TRAIN] - step 570 / 2497: loss=0.63068 f1=0.86624 [step/sec: 3.82] [2020-08-30 17:59:23,630] [ TRAIN] - step 580 / 2497: loss=0.54734 f1=0.89908 [step/sec: 3.81] [2020-08-30 17:59:26,263] [ TRAIN] - step 590 / 2497: loss=0.60503 f1=0.90032 [step/sec: 3.81] [2020-08-30 17:59:28,896] [ TRAIN] - step 600 / 2497: loss=0.68594 f1=0.82779 [step/sec: 3.81] [2020-08-30 17:59:28,897] [ INFO] - Evaluation on dev dataset start [2020-08-30 18:00:17,374] [ EVAL] - [dev dataset evaluation result] loss=0.60199 f1=0.88333 [step/sec: 12.90] [2020-08-30 18:00:20,015] [ TRAIN] - step 610 / 2497: loss=0.55796 f1=0.91071 [step/sec: 3.80] [2020-08-30 18:00:22,661] [ TRAIN] - step 620 / 2497: loss=0.61698 f1=0.87948 [step/sec: 3.79] [2020-08-30 18:00:25,299] [ TRAIN] - step 630 / 2497: loss=0.66055 f1=0.87879 [step/sec: 3.80] [2020-08-30 18:00:27,950] [ TRAIN] - step 640 / 2497: loss=0.61488 f1=0.89865 [step/sec: 3.79] [2020-08-30 18:00:30,603] [ TRAIN] - step 650 / 2497: loss=0.62346 f1=0.86740 [step/sec: 3.78] [2020-08-30 18:00:33,252] [ TRAIN] - step 660 / 2497: loss=0.51533 f1=0.91922 [step/sec: 3.79] [2020-08-30 18:00:35,897] [ TRAIN] - step 670 / 2497: loss=0.64104 f1=0.89444 [step/sec: 3.78] [2020-08-30 18:00:38,553] [ TRAIN] - step 680 / 2497: loss=0.54168 f1=0.92222 [step/sec: 3.80] [2020-08-30 18:00:41,180] [ TRAIN] - step 690 / 2497: loss=0.55047 f1=0.89614 [step/sec: 3.82] [2020-08-30 18:00:43,814] [ TRAIN] - step 700 / 2497: loss=0.60918 f1=0.87293 [step/sec: 3.81] [2020-08-30 18:00:43,816] [ INFO] - Evaluation on dev dataset start [2020-08-30 18:01:32,832] [ EVAL] - [dev dataset evaluation result] loss=0.57981 f1=0.89660 [step/sec: 12.76] [2020-08-30 18:01:32,838] [ EVAL] - best model saved to model_bert/best_model [best f1=0.89660] [2020-08-30 18:01:38,531] [ TRAIN] - step 710 / 2497: loss=0.63056 f1=0.86957 [step/sec: 3.79] [2020-08-30 18:01:41,160] [ TRAIN] - step 720 / 2497: loss=0.57240 f1=0.89021 [step/sec: 3.82] [2020-08-30 18:01:43,785] [ TRAIN] - step 730 / 2497: loss=0.58741 f1=0.89544 [step/sec: 3.82] [2020-08-30 18:01:46,422] [ TRAIN] - step 740 / 2497: loss=0.62016 f1=0.88089 [step/sec: 3.80] [2020-08-30 18:01:49,049] [ TRAIN] - step 750 / 2497: loss=0.56928 f1=0.87791 [step/sec: 3.82] [2020-08-30 18:01:51,673] [ TRAIN] - step 760 / 2497: loss=0.63097 f1=0.87838 [step/sec: 3.82] [2020-08-30 18:01:54,294] [ TRAIN] - step 770 / 2497: loss=0.64109 f1=0.88506 [step/sec: 3.85] [2020-08-30 18:01:56,924] [ TRAIN] - step 780 / 2497: loss=0.63887 f1=0.88219 [step/sec: 3.82] [2020-08-30 18:01:59,551] [ TRAIN] - step 790 / 2497: loss=0.63866 f1=0.88764 [step/sec: 3.83] [2020-08-30 18:02:02,175] [ TRAIN] - step 800 / 2497: loss=0.63746 f1=0.88166 [step/sec: 3.82] [2020-08-30 18:02:02,177] [ INFO] - Evaluation on dev dataset start [2020-08-30 18:02:52,189] [ EVAL] - [dev dataset evaluation result] loss=0.57340 f1=0.89606 [step/sec: 12.51] [2020-08-30 18:02:54,873] [ TRAIN] - step 810 / 2497: loss=0.53743 f1=0.90544 [step/sec: 3.75] [2020-08-30 18:02:57,530] [ TRAIN] - step 820 / 2497: loss=0.67242 f1=0.87037 [step/sec: 3.78] [2020-08-30 18:03:00,225] [ TRAIN] - step 830 / 2497: loss=0.60064 f1=0.87540 [step/sec: 3.73] [2020-08-30 18:03:02,910] [ TRAIN] - step 840 / 2497: loss=0.65536 f1=0.90244 [step/sec: 3.74] [2020-08-30 18:03:05,574] [ TRAIN] - step 850 / 2497: loss=0.65970 f1=0.88184 [step/sec: 3.77] [2020-08-30 18:03:08,221] [ TRAIN] - step 860 / 2497: loss=0.51493 f1=0.92958 [step/sec: 3.79] [2020-08-30 18:03:10,889] [ TRAIN] - step 870 / 2497: loss=0.60079 f1=0.88587 [step/sec: 3.78] [2020-08-30 18:03:13,551] [ TRAIN] - step 880 / 2497: loss=0.61113 f1=0.88089 [step/sec: 3.78] [2020-08-30 18:03:16,196] [ TRAIN] - step 890 / 2497: loss=0.55213 f1=0.90426 [step/sec: 3.79] [2020-08-30 18:03:18,831] [ TRAIN] - step 900 / 2497: loss=0.61766 f1=0.89256 [step/sec: 3.81] [2020-08-30 18:03:18,833] [ INFO] - Evaluation on dev dataset start [2020-08-30 18:04:08,740] [ EVAL] - [dev dataset evaluation result] loss=0.56590 f1=0.89698 [step/sec: 12.53] [2020-08-30 18:04:08,741] [ EVAL] - best model saved to model_bert/best_model [best f1=0.89698] [2020-08-30 18:04:14,517] [ TRAIN] - step 910 / 2497: loss=0.51508 f1=0.91343 [step/sec: 3.79] [2020-08-30 18:04:17,177] [ TRAIN] - step 920 / 2497: loss=0.59504 f1=0.89973 [step/sec: 3.77] [2020-08-30 18:04:19,842] [ TRAIN] - step 930 / 2497: loss=0.61012 f1=0.89937 [step/sec: 3.77] [2020-08-30 18:04:22,505] [ TRAIN] - step 940 / 2497: loss=0.59346 f1=0.87697 [step/sec: 3.77] [2020-08-30 18:04:25,163] [ TRAIN] - step 950 / 2497: loss=0.62892 f1=0.90789 [step/sec: 3.77] [2020-08-30 18:04:27,767] [ TRAIN] - step 960 / 2497: loss=0.66066 f1=0.90341 [step/sec: 3.85] [2020-08-30 18:04:30,411] [ TRAIN] - step 970 / 2497: loss=0.58365 f1=0.89595 [step/sec: 3.79] [2020-08-30 18:04:33,056] [ TRAIN] - step 980 / 2497: loss=0.64528 f1=0.86957 [step/sec: 3.81] [2020-08-30 18:04:35,682] [ TRAIN] - step 990 / 2497: loss=0.58250 f1=0.90860 [step/sec: 3.82] [2020-08-30 18:04:38,320] [ TRAIN] - step 1000 / 2497: loss=0.61420 f1=0.87147 [step/sec: 3.80] [2020-08-30 18:04:38,324] [ INFO] - Evaluation on dev dataset start [2020-08-30 18:05:27,889] [ EVAL] - [dev dataset evaluation result] loss=0.57238 f1=0.88845 [step/sec: 12.62] [2020-08-30 18:05:30,513] [ TRAIN] - step 1010 / 2497: loss=0.51664 f1=0.90588 [step/sec: 3.82] [2020-08-30 18:05:33,149] [ TRAIN] - step 1020 / 2497: loss=0.61562 f1=0.87943 [step/sec: 3.80] [2020-08-30 18:05:35,772] [ TRAIN] - step 1030 / 2497: loss=0.59241 f1=0.87240 [step/sec: 3.83] [2020-08-30 18:05:38,417] [ TRAIN] - step 1040 / 2497: loss=0.57977 f1=0.85714 [step/sec: 3.79] [2020-08-30 18:05:41,051] [ TRAIN] - step 1050 / 2497: loss=0.53221 f1=0.90566 [step/sec: 3.81] [2020-08-30 18:05:43,684] [ TRAIN] - step 1060 / 2497: loss=0.55388 f1=0.90208 [step/sec: 3.81] [2020-08-30 18:05:46,311] [ TRAIN] - step 1070 / 2497: loss=0.56919 f1=0.88372 [step/sec: 3.82] [2020-08-30 18:05:48,949] [ TRAIN] - step 1080 / 2497: loss=0.55938 f1=0.88068 [step/sec: 3.81] [2020-08-30 18:05:51,576] [ TRAIN] - step 1090 / 2497: loss=0.58166 f1=0.90306 [step/sec: 3.82] [2020-08-30 18:05:54,205] [ TRAIN] - step 1100 / 2497: loss=0.54135 f1=0.91553 [step/sec: 3.81] [2020-08-30 18:05:54,207] [ INFO] - Evaluation on dev dataset start [2020-08-30 18:06:43,573] [ EVAL] - [dev dataset evaluation result] loss=0.57407 f1=0.89583 [step/sec: 12.67] [2020-08-30 18:06:46,207] [ TRAIN] - step 1110 / 2497: loss=0.70064 f1=0.88050 [step/sec: 3.82] [2020-08-30 18:06:48,840] [ TRAIN] - step 1120 / 2497: loss=0.60383 f1=0.87097 [step/sec: 3.81] [2020-08-30 18:06:51,484] [ TRAIN] - step 1130 / 2497: loss=0.57912 f1=0.86310 [step/sec: 3.80] [2020-08-30 18:06:54,124] [ TRAIN] - step 1140 / 2497: loss=0.53215 f1=0.90341 [step/sec: 3.81] [2020-08-30 18:06:56,759] [ TRAIN] - step 1150 / 2497: loss=0.56618 f1=0.88636 [step/sec: 3.81] [2020-08-30 18:06:59,397] [ TRAIN] - step 1160 / 2497: loss=0.54743 f1=0.88515 [step/sec: 3.81] [2020-08-30 18:07:02,035] [ TRAIN] - step 1170 / 2497: loss=0.55637 f1=0.90361 [step/sec: 3.80] [2020-08-30 18:07:04,671] [ TRAIN] - step 1180 / 2497: loss=0.57800 f1=0.90704 [step/sec: 3.81] [2020-08-30 18:07:07,294] [ TRAIN] - step 1190 / 2497: loss=0.60205 f1=0.88474 [step/sec: 3.82] [2020-08-30 18:07:09,922] [ TRAIN] - step 1200 / 2497: loss=0.53961 f1=0.90303 [step/sec: 3.81] [2020-08-30 18:07:09,924] [ INFO] - Evaluation on dev dataset start
预测
finetune完成后,调用predict接口即可完成预测
预测数据格式为二维的list:
[['第一条文本'], ['第二条文本'], [...], ...]
# 预测 import numpy as np inv_label_map = {val: key for key, val in reader.label_map.items()} # Data to be prdicted data = test[['微博中文内容']].fillna(' ').values.tolist() run_states = cls_task.predict(data=data) results = [run_state.run_results for run_state in run_states]
生成结果
# 生成预测结果 proba = np.vstack([r[0] for r in results]) prediction = list(np.argmax(proba, axis=1)) prediction = [inv_label_map[p] for p in prediction] submission = pd.DataFrame() submission['id'] = test['微博id'].values submission['id'] = submission['id'].astype(str) + ' ' submission['y'] = prediction np.save('proba.npy', proba) submission.to_csv('result_bert.csv', index=False) submission.head()
submission['text'] = test[['微博中文内容']].fillna(' ').values submission['label'] = submission['y'].map({-1: '消极', 0: '中性', 1: '积极'}) display(submission[['text', 'label']][176:181])
可以看到网民情绪可以被模型成功识别。最后提交结果的线上分数为 F1 = 0.7128。
总结
经过这一段时间的暑期实践的学习,我学习到了关于python的基本知识,也学会了在paddlepaddle上运行一些基本的项目。虽然中间遇到了一些困难,但是经过在网上查找资料,逐渐解决了这些困难。但是还有许多不足的地方,所以以后要继续努力,多多练习。