Python与正则表达式[0] -> re 模块的正则表达式匹配

2022-10-11 16:04:51

正则表达式 / Regular Expression

正则表达式模式
re 模块简介
使用正则表达式进行匹配

正则表达式RE(Regular Expression, Regexp, Regex)，又称为正规表示法，正规表达式，规则表达式，常规表达式，常规表示法，常简写为regex,regexp或RE。计算机科学的一个概念。正则表达式使用单个字符串来描述或匹配一系列符合某个句法规则的字符串。在许多文本编辑器中，正则表达式常被用于检索、替换那些匹配某个模式的文本。

1 正则表达式模式 / RE Pattern

对于正则表达式，其核心与基础是建立起一个正则表达式模式，而一个正则表达式通常由一些特殊字符和符号构成，下面介绍相关的特殊表达式符号与字符。

字符	描述
\	将下一个字符标记为一个特殊字符、或一个原义字符、或一个向后引用、或一个八进制转义符。例如，“n”匹配字符“n”。“\n”匹配一个换行符。串行“\\”匹配“\”而“\(”则匹配“(”。
^	匹配输入字符串的开始位置。如果设置了RegExp对象的Multiline属性，^也匹配“\n”或“\r”之后的位置。
$	匹配输入字符串的结束位置。如果设置了RegExp对象的Multiline属性，$也匹配“\n”或“\r”之前的位置。
*	匹配前面的子表达式零次或多次。例如，zo能匹配“z”以及“zoo”。等价于{0,}。
+	匹配前面的子表达式一次或多次。例如，“zo+”能匹配“zo”以及“zoo”，但不能匹配“z”。+等价于{1,}。
?	匹配前面的子表达式零次或一次。例如，“do(es)?”可以匹配“does”或“does”中的“do”。?等价于{0,1}。
{n}	n是一个非负整数。匹配确定的n次。例如，“o{2}”不能匹配“Bob”中的“o”，但是能匹配“food”中的两个o。
{n,}	n是一个非负整数。至少匹配n次。例如，“o{2,}”不能匹配“Bob”中的“o”，但能匹配“foooood”中的所有o。“o{1,}”等价于“o+”。“o{0,}”则等价于“o*”。
{n,m}	m和n均为非负整数，其中n<=m。最少匹配n次且最多匹配m次。例如，“o{1,3}”将匹配“fooooood”中的前三个o。“o{0,1}”等价于“o?”。请注意在逗号和两个数之间不能有空格。
?	当该字符紧跟在任何一个其他限制符（,+,?，{n}，{n,}，{n,m}）后面时，匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串，而默认的贪婪模式则尽可能多*的匹配所搜索的字符串。例如，对于字符串“oooo”，“o+?”将匹配单个“o”，而“o+”将匹配所有“o”。
.	匹配除“\n”之外的任何单个字符。要匹配包括“\n”在内的任何字符，请使用像“(.\|\n)”的模式。
(pattern)	匹配pattern并获取这一匹配。所获取的匹配可以从产生的Matches集合得到，在VBScript中使用SubMatches集合，在JScript中则使用$0…$9属性。要匹配圆括号字符，请使用“$”或“$”。
(?:pattern)	匹配pattern但不获取匹配结果，也就是说这是一个非获取匹配，不进行存储供以后使用。这在使用或字符“(\|)”来组合一个模式的各个部分是很有用。例如“industr(?:y\|ies)”就是一个比“industry\|industries”更简略的表达式。
(?=pattern)	正向肯定预查，在任何匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如，“Windows(?=95\|98\|NT\|2000)”能匹配“Windows2000”中的“Windows”，但不能匹配“Windows3.1”中的“Windows”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始。
(?!pattern)	正向否定预查，在任何不匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如“Windows(?!95\|98\|NT\|2000)”能匹配“Windows3.1”中的“Windows”，但不能匹配“Windows2000”中的“Windows”。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始
(?<=pattern)	反向肯定预查，与正向肯定预查类似，只是方向相反。例如，“(?<=95\|98\|NT\|2000)Windows”能匹配“2000Windows”中的“Windows”，但不能匹配“3.1Windows”中的“Windows”。
(?<!pattern)	反向否定预查，与正向否定预查类似，只是方向相反。例如“(?<!95\|98\|NT\|2000)Windows”能匹配“3.1Windows”中的“Windows”，但不能匹配“2000Windows”中的“Windows”。
(?P<name>)	具名匹配子组，以name为标记名的一个匹配子组
(?P=name)	获取具名子组的匹配内容，注意，并非表达式而是匹配成功的内容
(?aiLmsux)	特殊标记参数，可以在正则表达式前嵌入，用于表达式的模式控制，例如“(?im)Python”可以进行忽略大小写，多行模式匹配
(?(id/name)Y\|N)	如果分组中提供的id或name存在，则返回正则表达式的条件匹配Y，否则返回N，\|N是可选项。
x\|y	匹配x或y。例如，“z\|food”能匹配“z”或“food”。“(z\|f)ood”则匹配“zood”或“food”。
[xyz]	字符集合。匹配所包含的任意一个字符。例如，“[abc]”可以匹配“plain”中的“a”。
[^xyz]	负值字符集合。匹配未包含的任意字符。例如，“[^abc]”可以匹配“plain”中的“p”。
[a-z]	字符范围。匹配指定范围内的任意字符。例如，“[a-z]”可以匹配“a”到“z”范围内的任意小写字母字符。
[^a-z]	负值字符范围。匹配任何不在指定范围内的任意字符。例如，“[^a-z]”可以匹配任何不在“a”到“z”范围内的任意字符。
\b	匹配一个单词边界，也就是指单词和空格间的位置。例如，“er\b”可以匹配“never”中的“er”，但不能匹配“verb”中的“er”。
\B	匹配非单词边界。“er\B”能匹配“verb”中的“er”，但不能匹配“never”中的“er”。
\cx	匹配由x指明的控制字符。例如，\cM匹配一个Control-M或回车符。x的值必须为A-Z或a-z之一。否则，将c视为一个原义的“c”字符。
\d	匹配一个数字字符。等价于[0-9]。
\D	匹配一个非数字字符。等价于[^0-9]。
\f	匹配一个换页符。等价于\x0c和\cL。
\n	匹配一个换行符。等价于\x0a和\cJ。
\r	匹配一个回车符。等价于\x0d和\cM。
\s	匹配任何空白字符，包括空格、制表符、换页符等等。等价于[ \f\n\r\t\v]。
\S	匹配任何非空白字符。等价于[^ \f\n\r\t\v]。
\t	匹配一个制表符。等价于\x09和\cI。
\v	匹配一个垂直制表符。等价于\x0b和\cK。
\w	匹配包括下划线的任何单词字符。等价于“[A-Za-z0-9_]”。
\W	匹配任何非单词字符。等价于“[^A-Za-z0-9_]”。
\xn	匹配n，其中n为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如，“\x41”匹配“A”。“\x041”则等价于“\x04&1”。正则表达式中可以使用ASCII编码。.
\num	匹配num，其中num是一个正整数。对所获取的匹配的引用。例如，“(.)\1”匹配两个连续的相同字符。
\n	标识一个八进制转义值或一个向后引用。如果\n之前至少n个获取的子表达式，则n为向后引用。否则，如果n为八进制数字（0-7），则n为一个八进制转义值。
\nm	标识一个八进制转义值或一个向后引用。如果\nm之前至少有nm个获得子表达式，则nm为向后引用。如果\nm之前至少有n个获取，则n为一个后跟文字m的向后引用。如果前面的条件都不满足，若n和m均为八进制数字（0-7），则\nm将匹配八进制转义值nm。
\nml	如果n为八进制数字（0-3），且m和l均为八进制数字（0-7），则匹配八进制转义值nml。
\un	匹配n，其中n是一个用四个十六进制数字表示的Unicode字符。例如，\u00A9匹配版权符号（©）。

2 re 模块 / re Module

2.1 常量 / Constants

2.1.1 I / IGNORECASE

常量数值: 2

常量功能:忽略大小写的正则表达式模式

2.1.2 L / LOCALE

常量数值: 4

常量功能:使\w, \W, \b, \B取决于当前环境

2.1.3 M / MULTILINE

常量数值: 8

常量功能:多行模式匹配的正则表达式模式

2.1.4 S / DOTALL

常量数值: 16

常量功能:“.”可以匹配换行符“\n”的正则表达式模式

2.1.5 U / UNICODE

常量数值: 32

常量功能:根据Unicode字符集解析字符，影响\w, \W, \b, \B，为默认值，可设为ASCII

2.1.6 X / VERBOSE

常量数值: 64

常量功能:可以抑制空白符的匹配(#为注释符)，从而生成更易读的正则表达式模式

2.1.7 A / ASCII

常量数值: 256

常量功能:使\w, \W, \b, \B, \d, \D匹配对应ASCII字符而不是默认的所有Unicode字符

2.2 函数 / Function

2.2.1 compile()函数

函数调用: pt = re.compile(pattern, flags=0)

函数功能:对一个正则表达式进行编译，返回一个正则表达式编译对象

传入参数: pattern, flags

pattern: str类型，需要编译的正则表达式

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: obj类型，编译后的正则表达式类型

Note: 通过预编译可以将字符串正则表达式编译成字节码，这将提升使用时的性能，当一个正则表达式多次反复使用时，可进行编译，避免每次使用都要通过字符串进行反复调用与编译。

2.2.2 match()函数

函数调用: pt = re.match(pattern, string, flags=0)

函数功能:从待匹配字符串的开头进行匹配

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.2.3 fullmatch()函数

函数调用: pt = re.fullmatch(pattern, string, flags=0)

函数功能:从待匹配字符串的开头进行完全匹配（匹配对象与表达式完全一致）

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.2.4 search()函数

函数调用: pt = re.search(pattern, string, flags=0)

函数功能:对待匹配字符串进行搜索匹配

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.2.5 sub()函数

函数调用: pt = re.sub(pattern, repl, string, count=0, flags=0)

函数功能:对目标字符串进行模式匹配并替换为指定的字符

传入参数: pattern, repl, string, count, flags

pattern: str/obj类型，正则表达式字符串或对象

repl: str/function类型，替换的字符串或一个可返回字符串的函数

string: str类型，待匹配及替换的对象

count: int类型，替换的数量，默认全部匹配成功的结果都进行替换

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配及替换后的正则表达式对象

2.2.6 subn()函数

函数调用: pt = re.subn(pattern, repl, string, count=0, flags=0)

函数功能:对目标字符串进行模式匹配并替换为指定的字符，返回操作数量

传入参数: pattern, repl, string, count, flags

pattern: str/obj类型，正则表达式字符串或对象

repl: str类型，替换的字符串

string: str类型，待匹配及替换的对象

count: int类型，替换的数量，默认全部匹配成功的结果都进行替换

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: (pt, n)

pt: None/obj类型，匹配及替换后的正则表达式对象

n: int类型，返回进行替换操作的数量

2.2.7 split()函数

函数调用: pt = re.split(pattern, string, maxsplit=0, flags=0)

函数功能:对目标字符串根据正则表达式进行分割

传入参数: pattern, string, maxsplit, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配及替换的对象

maxsplit: int类型，分割的最大数量，默认分割全部匹配成的位置

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: list类型，分割后的字符串列表

2.2.8 findall()函数

函数调用: pt = re.findall(pattern, string, flags=0)

函数功能:获取所有(非重复)匹配成功的对象列表

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: list类型，匹配后的正则表达式对象列表

Note: 若正则表达式pattern中有多个子组，则返回的列表中包含的是一个元组，元组的元素为各个子组匹配的对象

2.2.9 finditer()函数

函数调用: pt = re.finditer(pattern, string, flags=0)

函数功能:获取所有(非重复)匹配成功的对象迭代器

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: iterator类型，匹配后的正则表达式对象迭代器

2.3 类 / Class

2.3.1 __Regex类

类实例化：pass

类的功能: RE类具有re模块的基本函数方法，如search/match/sub等

传入参数: pass

返回参数: pass

2.3.2 __Match类

类实例化：mt = re.match/search(pattern, string, flags=0)

类的功能:用于生成导入图片的实例

传入参数: pattern, string, flags

pattern: str/obj类型，正则表达式字符串或对象

string: str类型，待匹配的对象

flags: int类型，可传入re的aiLsumx进行模式控制

返回参数: pt

pt: None/obj类型，匹配后的正则表达式对象，匹配失败返回None

2.3.2.1 group()方法

函数调用: pt = mt.group(num=0)

函数功能:返回整个匹配对象或者指定的子组

传入参数: num

num: int类型，子组编号

返回参数: pt

pt: str/obj类型，匹配后的正则表达式对象或子组

2.3.2.2 groups()方法

函数调用: pt = mt.groups()

函数功能:返回包含整个匹配对象的元组

传入参数: 无

返回参数: pt

pt: tuple类型，匹配后的正则表达式对象或子组元组，无则返回空元组

2.3.2.3 groupdict()方法

函数调用: pt = mt.groupdict()

函数功能:返回包含整个匹配对象中所有具名子组形成的字典

传入参数: 无

返回参数: pt

pt: dict类型，匹配后的正则表达式对象或子组字典

2.3.2.4 span()方法

函数调用: pt = mt.span(group=0)

函数功能:返回一个元组，包含匹配结果在被匹配对象中的起始终止位置

传入参数: group

group: int类型，子组编号

返回参数: pt

pt: tuple类型，包含位置信息，(start, end)

3 使用正则表达式进行匹配 / RE Match

3.1 常用方法示例

下面列举一些re常用的表达式模式以及函数方法的使用示例，

完整代码如下

 import re

 """

 Regular Expression

 """

 def regexp(pattern, target, *args, grp='group', prt=True, func='search'):

     pat = re.compile(pattern)

     try:

         r = getattr(getattr(re, func)(pattern, target), grp)(*args)

     except AttributeError as e:

         r = None

         # print(e)

     if prt:

         print(r)

     return r

 # Use . to match all

 print(30*'-')

 regexp('.+', 'exit soon') # 'exit soon'

 # Use () to make sub-groups

 print(30*'-')

 regexp('(.+) (.+)', 'exit soon', 1) # 'exit'

 regexp('(.+) (.+)', 'exit soon', 2) # 'soon'

 regexp('(.+) (.+)', 'exit soon', grp='groups')  # ('exit', 'soon')

 # Use ^ to search from head

 print(30*'-')

 regexp('^The', 'The End')   # 'The'

 regexp('^The', 'In The End')    # None

 # Use \b to search boundary

 print(30*'-')

 regexp(r'\bThe\b', 'In The End') # 'The'

 regexp(r'\bThe', 'In TheEnd')   # 'The'

 regexp(r'The\b', 'In TheEnd')   # None

 # match and search

 print(30*'-')

 regexp('The', 'In The End', func='search') # 'the'

 regexp('The', 'In The End', func='match')   # None

 # findall and finditer

 # Note:

 # findall returns a list that contains string of matched result

 # finditer returns a iterator that contains obj of matched result

 # re.IGNORECASE can ignore capitalized

 print(30*'-')

 print(re.findall('The', 'In The End, these things merged', re.I)) # ['The', 'The']

 itera = re.finditer('The', 'In The End, these things merged', re.I)

 for x in itera:

     print(x.group())                                        # 'The'

 # sub and subn

 print(re.sub('X', 'LIKE', 'This is X, X is acting'))

 print(re.subn('X', 'LIKE', 'This is X, X is acting'))

 # split: split(re-expression, string)

 print(re.split(', |\n', 'This is amazing\nin the end, those things merged'))    #['This is amazing', 'in the end', 'those things merged']

 # \N: use \N to represent sub group, N is the number of sub group

 print(re.sub(r'(.{3})-(.{3})-(.{3})', r'\2-\3-\1', '123-def-789'))  # 'def-789-123'

 # (?P<name>): similar to \N, add tag name for each sub group,

 # and use \g<name> to fetch sub group

 print(re.sub(r'(?P<first>\d{3})-(?P<second>\d{3})-(?P<third>\d{3})', r'\g<second>-\g<third>-\g<first>', '123-456-789')) # 456-789-123

 # (?P=name): use this expression to reuse former sub group result

 # Note: this expression only get the matched result, not the re pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-abc-123'))    # Match obj, '123-abc-abc-123'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-def-456'))    # None

 # Note: should use (?P=name) in a re expression with former named group

 print(re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato'))  # 'YYXXXYY'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd')) # failed:(?P=char)-(?P=digit)

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))    # 'abcd-123'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))   # 'abcd-123'

 # groupdict(): return dict of named pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').groupdict())    # {'char': 'abcd', 'digit': '123'}

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('char'))  # 'abcd'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('digit')) # '123'

 # re extensions

 # use (?aiLmsux): re.A, re.I, re.L, re.M, re.S, re.X

 # use (?imsx)

 # (?i) --> re.I/re.IGNORECASE

 print(re.findall(r'(?i)yes', 'yes, Yes, YES'))  # ['yes, Yes, YES']

 # (?m) --> re.M/re.MULTILINE: match multiline, ^ for head of line, $ for end of line

 print(re.findall(r'(?im)^The|\w+e$', '''The first day, the last one,

 the great guy, the puppy love'''))  # ['The', 'the', 'love']

 # (?s) --> re.S/re.DOTALL: dot(.) can be use to replace \n(default not)

 print(re.findall(r'(?s)Th.+', '''This is the biggest suprise

 I'd ever seen'''))  # ['This is the biggest suprise\nI'd ever seen']

 # (?x) --> re.X/re.VERBOSE: make re ignore the blanks and comments after '#' of re pattern

 # This extension can make re pattern easy to read and you can add comments as you want

 print(re.search(r'''(?x)

                     \((\d{3})\) # Area code

                     [ ]         # blank

                     (\d{3})     # prefix

                     -           # connecting line

                     (\d{4})     # suffix

                     ''', '(123) 456-7890').groups())    # ['123', '456', '789']

 # (?:...): make a sub group that no need to save and never use later

 print(re.findall(r'(?:\w{3}\.)(\w+\.com)', 'www.google.com'))   # ['google.com']

 # (?=...) and (?!...)

 # (?=...): match the pattern which end with ..., pattern should place before (?=...)

 print(re.findall(r'\d{3}(?=Start)', '222Start333, this is foo, 777End666')) # ['222']

 # (?!...): match the pattern which not end with ..., pattern should place before (?!...)

 print(re.findall(r'\d{3}(?!End)', '222Start333, this is foo, 777End666')) # ['222', '333', '666']

 # (?<=...) and (?<!...)

 # (?<=...): match the pattern which start with ..., pattern should place after (?<=...)

 print(re.findall(r'(?<=Start)\d{3}', '222Start333, this is foo, 777End666')) # ['333']

 # (?<!...): match the pattern which not stert with ..., pattern should place after (?<!...)

 print(re.findall(r'(?<!End)\d{3}', '222Start333, this is foo, 777End666')) # ['222', '333', '777']

 # (?(id/name)Y|X): if sub group \id or name exists, match Y, otherwise match X

 # Below code first match the first char, if 'x' matched, store a sub group, if 'y', not to store

 # then match second char, if sub group stored('x' matched), match 'y', otherwise match 'x', finally return result

 print(re.search(r'(?:(x)|y)(?(1)y|x)', 'yx'))

 # Greedy match: '+' and '*' act greedy match, appending '?' to make no greedy match

 print(re.search(r'.+(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.+?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

 print(re.search(r'.*(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.*?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

 # match and fullmatch

 print(re.match(r'This is a full match', 'this is a full match string', re.I))   # this is a full match

 print(re.fullmatch(r'This is a full match', 'this is a full match string', re.I))   # None

 # span function

 print(re.search(r'Google', 'www.google.com', re.I).span())  # (4, 10)

分段解释

导入re模块后，为方便后续使用，首先定义一个regexp方法，实现对re的正则表达式编译，输出，以及显示的功能。默认参数下，使用 search 方法进行匹配，并对匹配结果调用 group 函数，获取返回信息，可通过传入参数修改 regexp 的默认运行参数。

Note: 此处为方便后续使用而降低了代码的可读性。

 import re

 """

 Regular Expression

 """

 def regexp(pattern, target, *args, grp='group', prt=True, func='search'):

     pat = re.compile(pattern)

     try:

         r = getattr(getattr(re, func)(pattern, target), grp)(*args)

     except AttributeError as e:

         r = None

         # print(e)

     if prt:

         print(r)

     return r

接下来利用上的 regexp 函数来完成后续的匹配工作，注释部分为输出结果。

使用“.”来匹配所有非“\n”的字符

 # Use . to match all

 print(30*'-')

 regexp('.+', 'exit soon\nHurry up.') # 'exit soon'

使用“()”来生成匹配子组

 # Use () to make sub-groups

 print(30*'-')

 regexp('(.+) (.+)', 'exit soon', 1) # 'exit'

 regexp('(.+) (.+)', 'exit soon', 2) # 'soon'

 regexp('(.+) (.+)', 'exit soon', grp='groups')  # ('exit', 'soon')

使用“^”来表示从头部匹配(“$为尾部”)

 # Use ^ to search from head

 print(30*'-')

 regexp('^The', 'The End')   # 'The'

 regexp('^The', 'In The End')    # None

使用“\b”来匹配边界(空白符等)

 # Use \b to search boundary

 print(30*'-')

 regexp(r'\bThe\b', 'In The End') # 'The'

 regexp(r'\bThe', 'In TheEnd')   # 'The'

 regexp(r'The\b', 'In TheEnd')   # None

match和search函数的对比，区别在于匹配开始的位置

 # match and search

 print(30*'-')

 regexp('The', 'In The End', func='search') # 'the'

 regexp('The', 'In The End', func='match')   # None

findall和finditer的对比，区别在于返回对象是列表还是迭代器

 # findall and finditer

 # Note:

 # findall returns a list that contains string of matched result

 # finditer returns a iterator that contains obj of matched result

 # re.IGNORECASE can ignore capitalized

 print(30*'-')

 print(re.findall('The', 'In The End, these things merged', re.I)) # ['The', 'The']

 itera = re.finditer('The', 'In The End, these things merged', re.I)

 for x in itera:

     print(x.group())                                        # 'The'

sub和subn的对比，区别在于返回对象中是否包含计数值

 # sub and subn

 print(re.sub('X', 'LIKE', 'This is X, X is acting'))

 print(re.subn('X', 'LIKE', 'This is X, X is acting'))

使用split函数切分目标字符串

 # split: split(re-expression, string)

 print(re.split(', |\n', 'This is amazing\nin the end, those things merged'))

使用\N来获取对应子组的匹配内容，(?P<name>)来生成具名子组

 # \N: use \N to represent sub group, N is the number of sub group

 print(re.sub(r'(.{3})-(.{3})-(.{3})', r'\2-\3-\1', '123-def-789'))  # 'def-789-123'

 # (?P<name>): similar to \N, add tag name for each sub group,

 # and use \g<name> to fetch sub group

 print(re.sub(r'(?P<first>\d{3})-(?P<second>\d{3})-(?P<third>\d{3})', r'\g<second>-\g<third>-\g<first>', '123-456-789')) # 456-789-123

使用(?P=name)来调用具名子组的匹配内容

Note: 此处的(?P=name)需要与(?P<name>)在同一个表达式模式内，且调用的是匹配结果而不是匹配模式。若需要在不同表达式中使用，可以利用\N或\g<name>来获取子组的匹配结果。

 # (?P=name): use this expression to reuse former sub group result

 # Note: this expression only get the matched result, not the re pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-abc-123'))    # Match obj, '123-abc-abc-123'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{3})-(?P=char)-(?P=digit)', '123-abc-def-456'))    # None

 # Note: should use (?P=name) in a re expression with former named group

 print(re.sub(r'(?P<myName>potato)(XXX)(?P=myName)', r'YY\2YY', 'potatoXXXpotato'))  # 'YYXXXYY'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'(?P=char)-(?P=digit)', '123-abcd')) # failed:(?P=char)-(?P=digit)

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\2-\1', '123-abcd'))    # 'abcd-123'

 print(re.sub(r'(?P<digit>\d{3})-(?P<char>\w{4})', r'\g<char>-\g<digit>', '123-abcd'))   # 'abcd-123'

使用groupdict()来获取匹配结果的字典

 # groupdict(): return dict of named pattern

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').groupdict())    # {'char': 'abcd', 'digit': '123'}

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('char'))  # 'abcd'

 print(re.match(r'(?P<digit>\d{3})-(?P<char>\w{4})', '123-abcd').group('digit'))

re表达式的aimLsux标签的使用

 # re extensions

 # use (?aiLmsux): re.A, re.I, re.L, re.M, re.S, re.X

 # use (?imsx)

 # (?i) --> re.I/re.IGNORECASE

 print(re.findall(r'(?i)yes', 'yes, Yes, YES'))  # ['yes, Yes, YES']

 # (?m) --> re.M/re.MULTILINE: match multiline, ^ for head of line, $ for end of line

 print(re.findall(r'(?im)^The|\w+e$', '''The first day, the last one,

 the great guy, the puppy love'''))  # ['The', 'the', 'love']

 # (?s) --> re.S/re.DOTALL: dot(.) can be use to replace \n(default not)

 print(re.findall(r'(?s)Th.+', '''This is the biggest suprise

 I'd ever seen'''))  # ['This is the biggest suprise\nI'd ever seen']

 # (?x) --> re.X/re.VERBOSE: make re ignore the blanks and comments after '#' of re pattern

 # This extension can make re pattern easy to read and you can add comments as you want

 print(re.search(r'''(?x)

                     \((\d{3})\) # Area code

                     [ ]         # blank

                     (\d{3})     # prefix

                     -           # connecting line

                     (\d{4})     # suffix

                     ''', '(123) 456-7890').groups())    # ['123', '456', '789']

使用(?:…)来生成一个无需保存复用的子组

 # (?:...): make a sub group that no need to save and never use later

 print(re.findall(r'(?:\w{3}\.)(\w+\.com)', 'www.google.com'))   # ['google.com']

使用正向肯定搜索和正向否定搜索

 # (?=...) and (?!...)

 # (?=...): match the pattern which end with ..., pattern should place before (?=...)

 print(re.findall(r'\d{3}(?=Start)', '222Start333, this is foo, 777End666')) # ['222']

 # (?!...): match the pattern which not end with ..., pattern should place before (?!...)

 print(re.findall(r'\d{3}(?!End)', '222Start333, this is foo, 777End666')) # ['222', '333', '666']

使用反向肯定搜索和反向否定搜索

 # (?<=...) and (?<!...)

 # (?<=...): match the pattern which start with ..., pattern should place after (?<=...)

 print(re.findall(r'(?<=Start)\d{3}', '222Start333, this is foo, 777End666')) # ['333']

 # (?<!...): match the pattern which not stert with ..., pattern should place after (?<!...)

 print(re.findall(r'(?<!End)\d{3}', '222Start333, this is foo, 777End666')) # ['222', '333', '777']

使用条件选择匹配

 # (?(id/name)Y|X): if sub group \id or name exists, match Y, otherwise match X

 # Below code first match the first char, if 'x' matched, store a sub group, if 'y', not to store

 # then match second char, if sub group stored('x' matched), match 'y', otherwise match 'x', finally return result

 print(re.search(r'(?:(x)|y)(?(1)y|x)', 'yx'))

贪婪匹配与非贪婪匹配

 # Greedy match: '+' and '*' act greedy match, appending '?' to make no greedy match

 print(re.search(r'.+(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.+?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

 print(re.search(r'.*(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 4-123-123

 print(re.search(r'.*?(\d+-\d+-\d+)', 'asdfgh1234-123-123').group(1)) # 1234-123-123

完全匹配与非完全匹配

 # match and fullmatch

 print(re.match(r'This is a full match', 'this is a full match string', re.I))   # this is a full match

 print(re.fullmatch(r'This is a full match', 'this is a full match string', re.I))   # None

span()函数查看匹配位置

 # span function

 print(re.search(r'Google', 'www.google.com', re.I).span())  # (4, 10)

3.2 函数替换实例

介绍一个替换实例，利用函数/匿名函数来代替替换对象，

首先赋值被匹配对象，定义double函数，该函数会获取子组并将其乘2后返回字符串，利用sub函数的repl参数传入double函数，进行匹配替换，同样可以使用lambda函数进行实现。

Note: 当sub的repl是一个函数时，该函数需接受一个参数，该参数为匹配结果的返回实例。

 import re

 s = 'AD892SDA213VC2'

 def double(matched):

     value = int(matched.group('value'))

     return str(value*2)

 print(re.sub(r'(?P<value>\d+)', double, s))

 print(re.sub(r'(?P<value>\d+)', lambda x: str(int(x.group('value'))*2), s))

 # Output: 'AD1784SDA426VC4'

码农公寓

相关文章