用CNTK搞深度学习（二）训练基于RNN的自然语言模型 ( language model )

2022-06-23 20:36:05

前一篇文章用 CNTK 搞深度学习（一）入门介绍了用CNTK构建简单前向神经网络的例子。现在假设读者已经懂得了使用CNTK的基本方法。现在我们做一个稍微复杂一点，也是自然语言挖掘中很火的一个模型：用递归神经网络构建一个语言模型。

递归神经网络（RNN），用图形化的表示则是隐层连接到自己的神经网络（当然只是RNN中的一种）：

不同于普通的神经网络，RNN假设样例之间并不是独立的。例如要预测“上”这个字的下一个字是什么，那么在“上”之前出现过的字就很重要，如果之前出现过“工作”，那么很可能是在说“上班”; 如果之前出前过“家乡”，那么很可能就是“上海”。 RNN就可以很好的学习出时序的特征。简单的说，RNN把前一时刻的隐层的值也作为一类feature，作为下一时刻输入的一部分。

我们这里构建这样一种language model：给定一个单词，预测下一个可能出现的单词。

这个RNN的输入是dim维的，dim等于词汇量的大小。输入向量只有在代表这个单词的分量上是1，其余为0，即[0,0,0,...0,1,0,...0]。输出也是dim维的向量，表示每个单词出现的概率。

CNTK上构建RNN模型，主要有两点与普通的神经网络很不一样：

（1）输入格式。此时输入的是按句子分开的文本，同一个句子内部的单词是有顺序的。所以输入要指定成 LMSequenceReader 的格式。这个格式很麻烦（再吐槽一下，我也不是很懂，就不详细解释了，大家可以按照格式自行领悟）

（2）模型：要使用递归模型。主要是Delay() 函数的使用

一个可用的代码如下（再次被官方教程坑了好久，现代码改编自 CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\Config ）：

# Parameters can be overwritten on the command line

# for example: cntk configFile=myConfigFile RootDir=../..

# For running from Visual Studio add

# currentDirectory=$(SolutionDir)/<path to corresponding data folder>

RootDir = ".."

ConfigDir = "$RootDir$/Config"

DataDir = "$RootDir$/Data"

OutputDir = "$RootDir$/Output"

ModelDir = "$OutputDir$/Models"

# deviceId=- for CPU, >= for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available

deviceId = "-1"

command = writeWordAndClassInfo:train

#command = write

precision = "float"

traceLevel =

modelPath = "$ModelDir$/rnn.dnn"

# uncomment the following line to write logs to a file

stderr=$OutputDir$/rnnOutput

type = double

numCPUThreads = 

confVocabSize =

confClassSize = 

#trainFile = "ptb.train.txt"

trainFile = "review_tokens_split_first5w_lines.txt"

#validFile = "ptb.valid.txt"

testFile = "review_tokens_split_first10_lines.txt"

writeWordAndClassInfo = [

    action = "writeWordAndClass"

    inputFile = "$DataDir$/$trainFile$"

    outputVocabFile = "$ModelDir$/vocab.txt"

    outputWord2Cls = "$ModelDir$/word2cls.txt"

    outputCls2Index = "$ModelDir$/cls2idx.txt"

    vocabSize = "$confVocabSize$"

    nbrClass = "$confClassSize$"

    cutoff =

    printValues = true

]

#######################################

#  TRAINING CONFIG                    #

#######################################

train = [

    action = "train"

    minibatchSize =

    traceLevel =

    epochSize =

    recurrentLayer =

    defaultHiddenActivity = 0.1

    useValidation = true

    rnnType = "CLASSLM"

     # uncomment below and comment SimpleNetworkBuilder section to use NDL to train RNN LM

     NDLNetworkBuilder=[

        networkDescription="D:\tools\Deep Learning\CNTK-2016-02-08-Windows-64bit-CPU-Only\cntk\Examples\Text\PennTreebank\AdditionalFiles\RNNLM\rnnlm.ndl"

     ]

    SGD = [

        learningRatesPerSample = 0.1

        momentumPerMB =

        gradientClippingWithTruncation = true

        clippingThresholdPerSample = 15.0

        maxEpochs =

        unroll = false

        numMBsToShowResult =

        gradUpdateType = "none"

        loadBestModel = true

        # settings for Auto Adjust Learning Rate

        AutoAdjust = [

            autoAdjustLR = "adjustAfterEpoch"

            reduceLearnRateIfImproveLessThan = 0.001

            continueReduce = false

            increaseLearnRateIfImproveMoreThan =

            learnRateDecreaseFactor = 0.5

            learnRateIncreaseFactor = 1.382

            numMiniBatch4LRSearch =

            numPrevLearnRates =

            numBestSearchEpoch =

        ]

        dropoutRate = 0.0

    ]

    reader = [

        readerType = "LMSequenceReader"

        randomize = "none"

        nbruttsineachrecurrentiter = 

        # word class info

        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file

        # if the binary file exists, we will use it instead of parsing this file

        # writerType=BinaryReader

        # write definition

        wfile = "$OutputDir$/sequenceSentence.bin"

        # wsize - inital size of the file in MB

        # if calculated size would be bigger, that is used instead

        wsize = 

        # wrecords - number of records we should allocate space for in the file

        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file

        wrecords = 

        # windowSize - number of records we should include in BinaryWriter window

        windowSize = "$confVocabSize$"

        file = "$DataDir$/$trainFile$"

        # additional features sections

        # for now store as expanded category data (including label in)

        features = [

            # sentence has no features, so need to set dimension to zero

            dim =

            # write definition

            sectionType = "data"

        ]

        # sequence break table, list indexes into sequence records, so we know when a sequence starts/stops

        sequence = [

            dim =

            wrecords =

            # write definition

            sectionType = "data"

        ]

        #labels sections

        labelIn = [

            dim =

            labelType = "Category"

            beginSequence = "</s>"

            endSequence = "</s>"

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.txt"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = "categoryLabels"

            ]

        ]

        # labels sections

        labels = [

            dim =

            labelType = "NextWord"

            beginSequence = "O"

            endSequence = "O"

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = categoryLabels

            ]

        ]

    ]

]

write = [

    action = "write"

    outputPath = "$OutputDir$/Write"

    #outputPath = "-"                    # "-" will write to stdout; useful for debugging

    outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word" # when processing one sentence per minibatch, this is the sentence posterior

    #format = [

        #sequencePrologue = "log P(W)="    # (using this to demonstrate some formatting strings)

        #type = "real"

    #]

    minibatchSize =               # choose this to be big enough for the longest sentence

    # need to be small since models are updated for each minibatch

    traceLevel =

    epochSize = 

    reader = [

        # reader to use

        readerType = "LMSequenceReader"

        randomize = "none"              # BUGBUG: This is ignored.

        nbruttsineachrecurrentiter =   # one sentence per minibatch

        cacheBlockSize =               # workaround to disable randomization

        # word class info

        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file

        # if the binary file exists, we will use it instead of parsing this file

        # writerType = "BinaryReader"

        # write definition

        wfile = "$OutputDir$/sequenceSentence.bin"

        # wsize - inital size of the file in MB

        # if calculated size would be bigger, that is used instead

        wsize = 

        # wrecords - number of records we should allocate space for in the file

        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file

        wrecords = 

        # windowSize - number of records we should include in BinaryWriter window

        windowSize = "$confVocabSize$"

        file = "$DataDir$/$testFile$"

        # additional features sections

        # for now store as expanded category data (including label in)

        features = [

            # sentence has no features, so need to set dimension to zero

            dim =

            # write definition

            sectionType = "data"

        ]

        #labels sections

        labelIn = [

            dim = 

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.txt"

            labelType = "Category"

            beginSequence = "</s>"

            endSequence = "</s>"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = "categoryLabels"

            ]

        ]

        #labels sections

        labels = [

            dim =

            labelType = "NextWord"

            beginSequence = "O"

            endSequence = "O"

            # vocabulary size

            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"

            # Write definition

            # sizeof(unsigned) which is the label index type

            elementSize =

            sectionType = "labels"

            mapping = [

                # redefine number of records for this section, since we don't need to save it for each data record

                wrecords =

                # variable size so use an average string size

                elementSize =

                sectionType = "labelMapping"

            ]

            category = [

                dim =

                # elementSize = sizeof(ElemType) is default

                sectionType = "categoryLabels"

            ]

        ]

    ]

]

rnnlm.ndl:

run=ndlCreateNetwork

ndlCreateNetwork=[

    # vocabulary size

    featDim=

    # vocabulary size

    labelDim=

    # hidden layer size

    hiddenDim=

    # number of classes

    nbrClass=

    initScale=

    features=SparseInput(featDim, tag="feature")

    # labels in classbasedCrossEntropy is dense and contain  values for each sample

    labels=Input(, tag="label")

    # define network

    WFeat2Hid=Parameter(hiddenDim, featDim, init="uniform", initValueScale=initScale)

    WHid2Hid=Parameter(hiddenDim, hiddenDim, init="uniform", initValueScale=initScale)

    # WHid2Word is special that it is hiddenSize X labelSize

    WHid2Word=Parameter( hiddenDim,labelDim,  init="uniform", initValueScale=initScale)

     WHid2Class=Parameter(nbrClass, hiddenDim, init="uniform", initValueScale=initScale)

    PastHid = Delay(hiddenDim, HidAfterSig, delayTime=, needGradient=true)

    HidFromHeat = Times(WFeat2Hid, features)

    HidFromRecur = Times(WHid2Hid, PastHid)

    HidBeforeSig = Plus(HidFromHeat, HidFromRecur)

    HidAfterSig = Sigmoid(HidBeforeSig)

    Out = TransposeTimes(WHid2Word, HidAfterSig)  #word part

    ClassProbBeforeSoftmax=Times(WHid2Class, HidAfterSig)

    cr = ClassBasedCrossEntropyWithSoftmax(labels, HidAfterSig, WHid2Word, ClassProbBeforeSoftmax, tag="criterion")

    EvalNodes=(Cr)

    OutputNodes=(Cr)

]

从代码上看，CNTK会让人花很大一部分精力在Data Reader上。

writeWordAndClassInfo 是简单的对所有词汇做个统计，并对单词聚类。 这里用的class based RNN，主要是为了加速计算，先把单词分成不相交的几类。 这个模块输出的文件有4列，分别是单词索引，出现频率，单词，类别。
Train 当然就是训练模型了，文本量大的话，训练还是很慢的。
Write 是输出模块，注意看这一行：  outputNodeNames = "Out,WFeat2Hid,WHid2Hid,WHid2Word"

我想最多人关心的应该是对于一个句子，运行这个训练好的RNN之后，如何得到隐层的值吧？我的做法是把训练好的RNN的参数给保存下来，然后...然后无论是用java还是用python的人，都能根据这个参数还原一个RNN网络，然后我们想干嘛就能干嘛了。

Train中我是用了自己定义的模型：NDLNetworkBuilder 。也可以用通用的递归模型，此时只要简单地规定一个参数就行了，例如

SimpleNetworkBuilder=[

        trainingCriterion=classcrossentropywithsoftmax

        evalCriterion=classcrossentropywithsoftmax

        nodeType=Sigmoid

        initValueScale=6.0

        layerSizes=::

        addPrior=false

        addDropoutNodes=false

        applyMeanVarNorm=false

        uniformInit=true;

        # these are for the class information for class-based language modeling

        vocabSize=

        nbrClass=

    ]

我这里使用自己定义的网络，主要是为了日后想改成LSTM结构。

原创博客，未经允许，请勿转载。

码农公寓

相关文章