谷歌翻译 google translation API github开源 实践

目前找到的唯一可用的(翻译几万条以上的句子)google API 是github上开源的:https://github.com/Saravananslb/py-googletranslation
目前亲测翻译了40000+的语料,可用,在实践中也发现了源码中没有解决的几个小问题,于是做了一些修改,新代码发布在:https://github.com/lu-tomato/py-googletranslation

下面记录一下使用过程和遇到的问题:

A 使用条件

*(VPN)

B 使用流程

从上面的链接下载开源项目,本地安装:

pip install pygoogletranslation

之后就可以使用啦!

我的使用过程:
在文件根目录下创建文件translate.py,先简单测试一下是否能翻译:

from pygoogletranslation import Translator
def trans_single():
    text="a trench is a type of excavation or depression in the ground that is generally deeper than it is wide ( as opposed to a wider gully, or ditch ), and narrow compared with its length ( as opposed to a simple hole ). in geology, trenches are created as a result of erosion by rivers or by geological movement of tectonic plates. in the civil engineering field, trenches are often created to install underground infrastructure or utilities ( such as gas mains, water mains or telephone lines ), or later to access these installations. trenches have also often been dug for military defensive purposes. in archaeology, the trench method is used for searching and excavating ancient ruins or to dig into strata of sedimented material."
    lang="zh-cn"

    translator = Translator(retry_messgae=True)
    t = translator.translate(text, dest=lang)
    translation = t.text
    print(text, translation)
# trans_single()

C 在测试过程中发现一些问题

1)在翻译法语时,会出现两个返回值以致返回失败的情况

eg:
正常情况下,如将英文单词“good”翻译为中文:
谷歌翻译 google translation API github开源 实践

但同样的情况翻译为法语:
谷歌翻译 google translation API github开源 实践
我们发现,法文中一些单词会因为性别不用而翻译为不同的单词,这导致一些简短的句子在代码中翻译时报错,原因是该代码没有处理这种情况。

2)句子翻译不完整,只翻译了第一句话

在翻译长句子的时候发现,无论句子多长,源码只会返回第一句的翻译结果,就以上两个问题,我的主要修改在pygoogletranslation/utils.py

def format_translation(translated,dest):
    text = ''
    pron = ''
    for _translated in translated:
        try:
            text += _translated[0][2][1][0][0][5][0][0]
            try:
                if dest=="fr":
                    tran_list = _translated[0][2][1][0][0][0] if _translated[0][2][1][0][0][2]=='(masculine)' else _translated[0][2][1][0][1][0] #当翻译为法语时,如果返回阳性和阴性两个答案,默认返回阴性
                else:
                    tran_list = _translated[0][2][1][0][0][5]
                for tmp_tran in tran_list: #遍历所有句子的翻译
                    text += tmp_tran[0]
            except:
                tran_list = _translated[0][2][1][0]
                for tmp_tran in tran_list:
                    text += tmp_tran[0]

3)部分影响翻译的标点符号未考虑

源码中已经考虑了几种影响翻译的中文标点,在这里,我通过实践,又新增了几种,修改在pygoogletranslation/translate.py中:

            text = text.replace('"', '')
            text = text.replace("'", "")
            text = text.replace("“", "")
            text = text.replace("”", "")
            text = text.replace("‘", "")  #新增
            text = text.replace("’", "") #新增
            text = text.replace("\\", "") #新增
            text = text.replace("?", "") #新增

以上全部修改更新在https://github.com/lu-tomato/py-googletranslation中,如果需要可以直接下载使用

D 最后整体翻译我的数据集

最后将它批量用于翻译我的数据集:

from pygoogletranslation import Translator
import jsonlines
import json
import time
def trans(dataset,path,lang):
    translator = Translator(retry_messgae=True)
    row_path="datapath/"

    with open(row_path+dataset+path+".jsonl","r",encoding="utf-8") as f:
        lines=jsonlines.Reader(f)
        write_in_list=[]
        for line in lines:
            id=line["id"]
            text=line["text"]
            # text=line["text"].strip().replace("[","").replace(']','').replace("'",'').replace("‘","").replace("’","").replace("»","").replace("?","").replace('\\',"").replace("(","( ").replace(")"," )").replace("`","").replace('"','').replace("!","") #处理一些因特殊符号翻译失败的情况
            try: #用try except来处理一些无法翻译的特殊情况
                t = translator.translate(text, dest=lang)
                translation = t.text
                if translation.strip()=="":
                    print(id,path,text)

                else:
                    write_in = dict()
                    write_in["id"] = id
                    write_in["text"] = text
                    write_in["translation"] = translation

                    write_in_list.append(write_in)

                    count+=1
            except:
                print(id,path,text)

            if count%200==0: #每200写入一次文件
                with open(row_path + dataset + path + lang+"_trans.jsonl", "a", encoding="utf-8") as wf:
                    for write_in in write_in_list:
                        wf.write(json.dumps(write_in, ensure_ascii=False) + "\n")
                write_in_list=[]
                time.sleep(3)
        if write_in_list:
            with open(row_path + dataset + path + lang + "_trans.jsonl", "a", encoding="utf-8") as wf:
                for write_in in write_in_list:
                    wf.write(json.dumps(write_in, ensure_ascii=False) + "\n")
            write_in_list = []
            time.sleep(3)
    print(count)
    time.sleep(5)

for lang in ["de","ja","fr","zh-cn"]:
   for path in ["dev","test","train"]:
        print("load dataset",path,lang)
        trans("dataset/", path, lang) #逐个翻译数据集文件中的训练、验证、测试

上一篇:前端显示时间格式带T


下一篇:第10讲 元件的替换与更新