python实现文本分割

文本分割是自然语言理解数据预处理中的重要步骤,本段程序实现的是用",。?!…”分割文章,并且分割子句单句成行

import re  
pattern = r"([,。?!…])" #正则匹配模式
flags = [",","。","?","!","…"]
sentence_txt = []
with open("./train.txt","r",encoding="utf-8") as reader_file:
    for line in reader_file:#一行就是一篇文章
        spilt_list = re.split(pattern=pattern, string=line)
        segment = ""
        for segment_i in spilt_list:            
            segment += segment_i
            if segment_i in flags:
                #去除分割子句中的空格,\n,\t等符号,并加上"\r"回车符换行
                sentence_txt.append("".join(segment.split())+"\r")
                segment = ""
        sentence_txt.append("\r")
with open("./spilt.txt","w",encoding="utf-8") as writer_file:
    writer_file.writelines(sentence_txt)
    print(sentence_txt.__len__())
上一篇:(Easy) Goat Latin - LeetCode


下一篇:【Python第25课】字符串的分割与拼接