正则表达式

1. Rules

1.1. Classical Regex

L(ϵ) = L("") = {""}

If c is a character, L(c) = {"c"}

If R1, R2 are r.e.s, L(R1R2) = {x1x2|x1∈L(R1), x2∈L(R2)}

L(R1|R2) = L(R1)∪L(R2)

L(R∗) = L(ϵ)∪L(R)∪L(RR)∪· · ·

L((R)) = L(R)

Precedence: *, concatenation, union

Grouping: parenthesises

 

1.2. Abbreviations

character lists: [a-zA-Z]

negative character lists: [^a-z]

character classes: .(dot), \d, \s

L(R+) = L(RR∗)

L(R?) = L(ϵ |R)

 

1.3. Extensions

“capture” parenthesizes expressions

m = re.match(r’\s*(\d+)\s*,\s*(\d+)\s*’, ’12,34’), have m.group(1) == ’12’, m.group(2) == ’34’

m.group(x) means the xth pair of parenthesis

 

lazy vs. greedy quantifiers

re.match(r’(\d+).*’, ’1234ab’) makes group(1) match ’1234’

re.match(r’(\d+?).*’, ’1234ab’) makes group(1) match ’1’

 

boundaries

re.search(r’(^abc|qef)’, L) matches abc only at beginning of string, and qef anywhere

re.search(r’(?m)(^abc|qef)’, L) matches abc only at beginning of string or of any line

(?m) enables regex to read multiline data

re.search(r’rowr(?=baz)’, L) matches an instance of ‘rowr’, but only if ‘baz’ follows (does not match baz)

(?=baz) means the pattern is followed by baz

re.search(r’(?<=rowr)baz’, L) matches an instance of ‘baz’, but only if immediately preceded by ‘rowr’ (does not match rowr)

(?<=rowr) means the pattern’s precedence is followed by rowr

 

non-linear patterns

re.search(r’(\S+),\1’, L) matches a word followed by the same word after a comma

上一篇:Spark SQL Join原理分析


下一篇:Java正则表达式的简单应用