论文链接于此:http://arxiv.org/abs/2010.12789
并在此陆续连载论文中英文双语版本。
II. NEW CLASSIFICATION OF LEXICAL CHUNKS, INFORMATION ARCHITECTURE
第二章 – 词汇的新分类和信息架构
According to the grammar function, morphological standard, and the meaning standard, modern Chinese lexicons have been divided into nouns, pronouns, verbs, adjectives, adverbs, prepositions, quantifiers, onomatopoeia, etc. [8]. On this basis, the author dismantles and analyzes the information carried by each type of lexical chunks, then reclassifying them into the data, structure, pointer, and task chunks, according to their functions and the roles they played in information transmission. As shown in Fig.2, the task chunk is build up by data, structure, and pointer chunks.
根据语法功能、形态标准、意思标准,现代汉语词典分为名词、代词、动词、形容词、副词、介词、量词、词组、词组等[8]。在此基础上,作者根据各词法区块的功能和在信息传输中所起的作用,对每种词法区块所携带的信息进行拆解和分析,然后将它们重新分类为数据块、结构块、指针块和任务区块。如图 2 所示,任务区块由数据、结构和指针区块生成。
图2 – 词汇块的新分类
Due to the constraint of capability and time, the author cannot list all categories, items, and usage scenarios, but only present the approach for reference and discussion. The items marked with an asterisk in Fig.2 are expounded in more detail below.
由于能力和时间的限制,作者不能列出所有类别、项目和使用方案,而只能提供供参考和讨论的方法。下面将更详细地阐述图2中标有星号的项目。
A. Data Chunks
2.1 – 数据块
Most of the words in natural language are data-oriented. To avoid misunderstanding, we use the data chunk to name this type of words and phrases instead of information chunk. There have obvious hierarchical structure between data chunks, the concept of elements and sets in discrete mathematics can perfectly express this kind of hierarchical structures. The classification of the data chunks in Fig.3 is structurally based, for the sake of understanding, we first understand each sub-classification through examples, and then elaborate them on the perspective of structure. The hierarchical structures of the data chunk are the key to NLU.
自然语言中的大多数单词都是面向数据的。为了避免误解,我们使用数据块来命名这种类型的单词和短语,而不是信息区块。数据块之间有明显的层次结构,离散数学中元素和集合的概念可以完美地表达这种分层结构。图 3 中数据区块的分类基于结构,为了理解,我们首先通过示例了解每个子分类,然后从结构的角度详细阐述它们。数据块的分层结构是 NLU 的关键。
.
- Attribute Chunks: Attribute chunk can be words or phrases, they represent the attribute information that abstracted from entities. According to the different abstraction levels and methods, they can be further divided into the following three types.
属性块:属性块可以是单词或短语,它们表示从实体提取的属性信息。根据不同的抽象层次和方法,可以进一步分为以下三种类型。
a) Descriptive and Positional Attribute Chunks: This type of words and phrases are the representation of attribute information. The attribute information which perceived by human neural systems is the most basal kind of attribute information. They are the positional attribute information and the descriptive attribute information, which map the information of space, time, and matter in the real-world. The author prefers to classify the descriptive attribute based on the study of the basic information perception system in the neurobiology of the brain. The basic information perception systems are vision system (color, shape), chemical senses system (taste, or gustation, and smell, or olfaction), auditory and vestibular system, and somatic sensory system [1]. The widely used neural-network-based population coding mechanism in the cerebral cortex allows for myriad activities of recognition, classification, and memory in human brains [1].
- 描述属性块和定位属性块:这种类型的单词和短语是属性信息的表示形式。人类神经系统感知的属性信息是最基础的属性信息。它们是位置属性信息和描述性属性信息,它们映射了现实世界中的空间、时间和物质信息。作者倾向于在研究大脑神经生物学中的基本信息感知系统的基础上对描述性属性进行分类。基本信息感知系统是视觉系统(颜色、形状)、化学感官系统(味觉、阵风、嗅觉、嗅觉)、听觉和前庭系统以及体细胞感知系统[1]。大脑皮层中广泛使用的基于神经网络的人口编码机制允许在人脑中进行无数的识别、分类和记忆活动[1]。
It is easy to classify the related attribute words to the above sensory systems, see examples in below:
很容易将相关属性词分类到上述感觉系统中,请参阅下面的示例:
- Color {red, blue, green, orange}
- Shape {square, round, cubic}
- Taste {sweet, salty, sour, bitter}
- Smell {smelly, balmy, pungent, apple-flavored}
- Somatic sensation {smooth, soft, furry}
The attribute words in the above braces are adjectives. Besides the above descriptive attribute information, human beings also invent lots of extended attribute information, such as:
上面的大括号中的属性词是形容词。除了上述描述性属性信息外,人类还发明了许多扩展属性信息,例如:
- Kinds {fruit, vegetable, meat}
- Kindship {father, mother, son, wife}
- Profession {teacher, doctor, police, chef}
- Titles {professor, teaching assistant, student}
One thing to notice is that most of the extended attribute words in the above braces are not adjectives but nouns。
需要注意的是,上述大括号中的大多数扩展属性词不是形容词,而是名词。
b) Attribute Space Chunks (ASCs): ASCs are the representation of specific ASs. In the above paragraph, the underlined words outside the braces are the ASCs. The ASC is a kind of set that all its elements have similar properties, the sets which have this kind of constituent form are called lengthwise set.
2) 属性空间块 (ASC):ASC 是特定 AS 的表示形式。在上面的段落中,大括号外的下划线单词是 ASC。ASC 是一种集合,它的所有元素都具有类似的属性,具有这种组成形式的集合称为纵向集合。
c) Verbs: Verbs are the representation of change features which abstract out from sequences of attribute information. Finally, the abstracted change features are recorded and encode as verbs, but the corresponding sequences of attribute information won’t be recorded in memory. Thus, this type of set is called change features sets.
3) 动词:动词是从属性信息序列中抽象出来的变化特征的表征。最后,将抽象变化特征记录为动词并编码为动词,但相应的属性信息序列不会记录在记忆表单中中。因此,这种类型的集合称为变化特征集合
- Fall – represent the change features which abstract from a sequence of spatial-position attribute information in spatial AS.
- Sweetened – represent the change features that abstract from a sequence of taste attribute information in taste AS.
- Run – represent the change features that abstract from sequences of spatial-position information, distance attribute information, body posture information, etc. in corresponding ASs.
d) Measurement Chunks: Compare to the basic and extended attribute chunks, measurement chunks once again abstract out the group characteristics of the same kind basic attribute information, the extended attribute information, or the selected measuring clusters. There are various abstraction methods and standards: according to the different frame of reference, it can be divided into the subjective measurement and the objective measurement, according to the number of measurement dimensions (or ASs), it can be divided into the single-dimension measurement and the multi-dimension measurement. And if the measuring result only has two values, such as “like” and “dislike”, “agree” and “not agree”, “yes” and “no”, they can be called the binary measurement; if the measuring result has multiple values, such as “good”, “better”, “best” and “fast”, “faster”, “fastest”, they can be called the distribution measurement. Some examples are given below for better understanding.
4) 测量区块:与基本属性块和扩展属性区块进行比较,测量区块再次抽象出同类型基本属性信息、扩展属性信息或所选测量群集的组特征。有多种抽象方法和标准:根据不同的参照系,可分为主观测量和客观测量,根据测量尺寸(或 AS)的数量,可分为单维测量和多维测量。如果测量结果只有两个值,如"喜欢"和"不喜欢",“同意"和"不同意”,“是"和"否”,它们可以称为二元测量;如果测量结果具有多个值,如"好",“更好”,“最好"和"快”,“快”,“最快”,它们可以被称为分布测量。下面提供了一些示例,以便更好地理解。
- Distribution measurement: Describe the distribution area of target objects after performing the statistical analysis on the attribute information in the selected measuring area. A distribution model is given in Table I to describe the data distribution, and each distribution area has the corresponding measurement words to describe it. Some examples of distribution measurement words are given in Table I.
-
分布度量:在对所选测量区域的属性信息执行统计分析后,描述目标对象的分布区域。表一中给出一个分布模型来描述数据分布,每个分布区域都有相应的度量词词来描述数据分布。表一中提供了分布度量词的一些示例。
- Subjective measurement: This type of measurement is adopted unified measuring standard to minimize the recognition tolerance of the same thing between individuals. E.g., Area (km2, m2), Speed (m/s, km/h), Temperature (℃, ℉), Weight (g, kg, ton), Pressure (Pa), etc.
- 主观度量:采用统一测量标准,最大限度地降低不同个体间对同一事物的识别误差。 例如:面积(km2, m2), 速度(m/s, km/h), 温度(℃, ℉), 重量(g, kg, ton), 压力(Pa), 等。
- Quantity/Order/Ranking measurement: We assume that the quantity measurement is based on the shape and spatial attribute, the order measurement is based on the quantity, time, spatial-position and other attributes, and the ranking measurement is based on the order and other attributes.
- 数量/顺序/排名度量:我们假设数量度量基于形状和空间属性,顺序度量基于数量、时间、空间位置和其他属性,排名度量基于顺序和其他属性。
Looking back at all the above Attribute chunks, the ASC and verbs are the one more time abstract representation of the basic attribute information cluster. The measurement chunks are the one more time classification of basic attribute information, the ASC, and verbs. Therefore, all the attribute chunks in natural language also is a classification coding system of attribute information. The understanding of the attribute information coding mechanism is the key to NLU. Meanwhile, the application of the coding mechanism in natural language greatly reduces the number of words and improve expression efficiency.
回顾上述所有属性区块,ASC 和谓词是基本属性信息群集的再次抽象表征。度量词块是在基本属性信息、ASC 和动词的再一次抽象分类。因此,自然语言中的所有属性区块也是属性信息的分类编码系统。对属性信息编码机制的理解是NLU的关键。同时,编码机制在自然语言中的应用,大大减少了单词的数量,提高了表达效率。