Formal Grammars of English -10 chapter（Speech and Language Processing）

2023-10-09 22:53:04

determiner 限定词 DET

propernoun 专有名词

NP (or noun phrase)

mass noun 不可数名词

Det Nouns 限定词名词

relative pronoun 关系代词

transitive verbs 及物动词

intransitive不及物

conjunction 连词

10.1Constituency

noun phrase：groups of words behaving as a single units, or constituents
preposed
postposed
例子：
On September seventeenth, I’d like to fly from Atlanta to Denver
I’d like to fly on September seventeenth from Atlanta to Denver
I’d like to fly from Atlanta to Denver on September seventeenth

CFG:
对英语等语言的组成结构建模的正式系统中，使用最广泛的是上下文无关文法CFG，也被称为短语结构文法Phrase-Structure Grammars，形式主义相当于Backus Naur Form或BNF。
rules：
CFG包含rules和productions，每一个都表达了语言符号被一起分组或排序的方法，以及词汇和符号词典
terminal：和语言中的单词相关的如 the、nightclub
lexicon所有词汇：the lexicon is the set of rules that introduce these terminal symbols.也就是一个大辞典，大辞典里规定了terminal符号转换成实际的单词的规则
non-terminal：代表terminals的抽象概念的符号
--》在CFG中，-》右边是terminals或者non-terminals，左边是non-terminals来表达聚类或者泛化
CFG的两个用途：
a、生成句子
b、给给定的句子分配结构
derivation：一般用parse tree来表示derivation，即CFG步骤中逐步解释的过程
dominates：如the node NP ~ all the nodes in the tree
start symbol：The formal language defined by a CFG is the set of strings that are derivable
start symbol from the designated start symbol. Each grammar must have one designated start
symbol, which is often called S.
VP：verb phrase
PP:介词短语(prepositional phrase)
bracket notation：parse tree表示的紧凑形式， [S [NP [Pro I]] [VP [V prefer] [NP [Det a] [Nom [N morning] [Nom [N flight]]]]]]
formal language：a set of strings
grammatical：可以从一种语法中提取的句子（strings of words）就是由语法定义的formal language
ungrammatical：不能从给定的正式语法中提取的句子就不是由该语法定义的语言，称之为~
generative grammer：使用formal language对natural language建模称为~，因为这种语言是由该语法生成的句子集合所定义的。

CFG的定义：四个参数~N; 一部戏弄; R; S~4 tuple
N：non-terminal符号的集合或者variables
一部戏弄：terminal符号
R：规则的集合
S：start symbol

10.3
英语结构分为四种：
a、陈述句：名词短语+动词短语
b、祈使句：动词开头，无主语
c、yes-no疑问句：助动词+名词短语+动词短语 ~疑问、请求、建议。。
d、疑问句：
d1：wh-subject-question --和陈述句相同，除了名词短语中包含wh-word(who, whose, when, where, what, which, how, why) ，S -》 Wh-NP VP
d2：wh-non-subject-question --wh-phrase不再是主语，有助动词 S -》 Wh-NP Aux NP VP
long-distance dependencies :例如d2中Wh-NPwhat flight距离谓语动词have很远
trace or empty category ：分析long-distance dependencies 时，如在VP后加上~syntactic relation

clause：
S也可以出现的语法规则的右边，因此可以嵌入到更长的句子中。使得句子结构例如S规则和剩下的语法不同的时：他们从某种意义上是完整的。在这个方面，他们和clause的概念对立，传统语法常常被形容为形成一个完整的概念。一种使得“completel thought”的概念更准确的概念是：S是parse tree的一个节点，S的主要动词的所有的论元都在该节点以下。

The Noun Phrase：
最常用的名字短语类型是：代词、专有名词和NP-》Det Nominal结构
这些名词短语有一个head和中心名词组成，伴有各种修饰词出现在head noun前或者后。

determiners限定词：
名词短语由限定词开头如a、the、driver's、mother's等,限定词的作用是由一个名词短语组成的所有格表达，后面跟一个's作为所有格标记。Det ->NP 's

before the Head Noun：
cardinal numbers基数词如one
ordinal numnumbers 序数词如first
quantifiers 量词如many、a few、few、several
AP 即adjective phrase，如 the least expensive far中的least expensive

after the Head Noun：
3种nominal postmodifiers：
a 、prepositional phrases 即PP ，介词短语~ all flights from Cleveland
b、non-finite clauses 非限定从句~any flights arriving after eleven a.m.
gerundive动名词、-ed、infinitive不定式
c、relative clauses 关系从句~a flight that serves breakfast

before the Noun Phrase：
predeterminer 即PreDet前置限定词如all

the Verb Phrase：
VP-》Verb或Verb NP或Verb NP PP或Verb PP或Verb S即sentential complements句子补语或VP

coordination
通过conjunctions来获得coordinate phrase的能力，作为对constituency（选区）的测试

metarules超规则
将语法规则更加泛化，如GPSG广义短语结构，使用超规则

10.4 Treebanks

treebank：
由CFG规则组成的高效强健的语法可以用来给任何句子建立一个parse tree，这就意味这建立集合种所有句子和相应的passe tree组成的pair的语料是可能的，这样一个语法标注的语料就是~。
~在和语法现象中的语言研究中起到重要作用
treebank的种类有很多，是通过parse自动解析每个句子，然后人工修正后得到的。
很多treebanks使用了十三章要介绍的依存表示，包括很多the Universal Dependencies project的部分。
Penn Treebank项目的treebank是由Brown, Switchboard, ATIS, and Wall Street Journal corpora of English, as well as treebanks in Arabic and Chinese得到的。
表示方法：LISP风格的树括号表示法、括号表示法、标准的node-and-line树表示。

traces，syntactic movement：
使用traces（-NONE-节点）来标记长距离依赖（语言中相联系的词和短语在一句话中未必紧接在一起）或者syntactic movement，例如quatations引语常常跟着一个引用动词如say。但是在这个例子中，引语“We should have to wait until we have collected on those assets”在he said之前。一个只包含唯一的节点-NONE的空的S标志着said之后的位置，也就是引用的句子通常出现的位置。这个空的节点在Treebanks II和III中用index 2表示，因为句子的开头是S。这样的co-indexing联合索引可以使一些parser解析器更容易恢复这一事实即前面的

或主题化的引语是动词said的补语。

grammar：例如S->NP VP ,PP->IN NP
lexicon:PRP->we|he,DT->the|that|those

Penn Treebank II和III版本的加上了更多信息来更简单的恢复谓语和论元的关系。在特殊文本类别例如headlines和titles的某些短语标注了tags来标志短语的语法功能例如surface subject, logical topic, cleft, non-VP predicates和它们的语义功能例如temporal phrases时间短语和locations地点~surface subject即SBJ：he ，temporal phrase即TMP：until，PRD tag：不是VPs的谓语

用来parse the Penn Treebank的语法相对来说比较直接，所以会产生很多很长的规则。例如光是扩展VPs的规则就有4500多条。
光是Penn Treebank III华尔街日报语料就有一百万个词，也有大约一百万个非词汇的规则tokens，包含了17500个不同规则类型。由于巨多的规则，treebank语法给概率parsing算法带来了问题，所以更常见的方法是对来自treebank的语法做各种修改，这将在第十二章讨论。

Heads和head findings：
语法成分是和一个lexical head相关联的，N是NP的head，V是VP的head，这种对于每个成分的head的想法可以追溯到Bloomfiled。它是基于成分的语法规则的核心例如Head-Driven Phrase Structure Grammar，也是我们将要在第十三章讨论的基于依赖的语法方法的核心。head和head-dependent关系也会在计算语言学中起到核心作用，用在probabilistic parsing概率句法分析上。
the head是短语中语法上最重要的词汇。heads在parse tree上传递，因此，在parse tree中每个non-terminal都由单一词汇注释，这个词汇就是它的lexical head主导词。
更实际的找到head的方法不是在语法中定义头的规则，而是heads在给定句子的树的上下文下动态的识别，也就是说，一旦一个句子被parsed，生成的树的每个node就会由适当的head。当下更多的系统依赖于一个简单的手工规则集合，例如Penn Treebank语法中的一个可实践的规则~举了一个找到一个NP的head的例子，即if else else。。。

10.5 Grammar Equivalence and Normal Form句法对等和范式

一种formal正式语言是由字符串集合定义的（可能是无限的），这就表明我们可以通过判断它们是否生成相同的字符串集合来判断两种语法是否相等。事实上，让两个不同的CF*生相同的语言是可能的。
两种语法对等：weak equivalence和strong equivalence，后者~当它们产生相同的字符串，而且它们给每个句子分配相同的短语结构（只允许重命名non-terminal符号），前者~产生相同字符串但是每个句子没有分配相同的短语结构。
有时语法由一个normal form正式形式是有好处的，这样每个生成的东西都有一个特定的形式。例如，一个CFG就属于CNF即Chomsky normal form，它产生的语法形式是binary branching的，也就是二分树。我们在CKY parsing算法种利用binary branching的特性。
任何CFG都可以转换为一个weakly equivalent Chomsky normal语法。
Chomsky-adjunction：A-》A B

10.6 lexicalized grammars

迄今为止提出的语法方法强调短语结构规则，尽量减少词汇的作用。然后，在长距离依赖、agreement、subcategorization等上面，这种解决方法会使得产生的grammars冗余、难处理等。为了克服这种问题，发展了很多更好利用lexicon的方法。如LFG、HPSG、TAG、CCG等，这些方法的不同之处就是如何lexicalized-即它们多大程度上依赖lexicon而不是用句法结构来捕捉语言的事实。
CCG~基于语法和句法的重度lexicalized的方法
dependency grammars：完全消除短语结构规则

conbinatory categorial grammar：
分类方法包括三个主要元素：类别集合、词汇和类别联系起来的lexicon词典、控制这文本种类别如何结合的规则集合

Summary

This chapter has introduced a number of fundamental concepts in syntax through the use of context-free grammars.

• In many languages, groups of consecutive words act as a group or a constituent, which can be modeled by context-free grammars (which are also
known as phrase-structure grammars).
• A context-free grammar consists of a set of rules or productions, expressed over a set of non-terminal symbols and a set of terminal symbols. Formally, a particular context-free language is the set of strings that can be derived from a particular context-free grammar.
• A generative grammar is a traditional name in linguistics for a formal language that is used to model the grammar of a natural language.
• There are many sentence-level grammatical constructions in English; declarative, imperative, yes-no question, and wh-question are four common types;
these can be modeled with context-free rules.
• An English noun phrase can have determiners, numbers, quantifiers, and adjective phrases preceding the head noun, which can be followed by a number of postmodifiers; gerundive VPs, infinitives VPs, and past participial VPs are common possibilities.
• Subjects in English agree with the main verb in person and number.
• Verbs can be subcategorized by the types of complements they expect. Simple subcategories are transitive and intransitive; most grammars include
many more categories than these.
• Treebanks of parsed sentences exist for many genres of English and for many languages. Treebanks can be searched with tree-search tools.
• Any context-free grammar can be converted to Chomsky normal form, in which the right-hand side of each rule has either two non-terminals or a single terminal.
• Lexicalized grammars place more emphasis on the structure of the lexicon, lessening the burden on pure phrase-structure rules.
• Combinatorial categorial grammar (CCG) is an important computationally relevant lexicalized approach

码农公寓

相关文章