1. Rules
1.1. Classical Regex
L(ϵ) = L("") = {""}
If c is a character, L(c) = {"c"}
If R1, R2 are r.e.s, L(R1R2) = {x1x2|x1∈L(R1), x2∈L(R2)}
L(R1|R2) = L(R1)∪L(R2)
L(R∗) = L(ϵ)∪L(R)∪L(RR)∪· · ·
L((R)) = L(R)
Precedence: *, concatenation, union
Grouping: parenthesises
1.2. Abbreviations
character lists: [a-zA-Z]
negative character lists: [^a-z]
character classes: .(dot), \d, \s
L(R+) = L(RR∗)
L(R?) = L(ϵ |R)
1.3. Extensions
“capture” parenthesizes expressions
m = re.match(r’\s*(\d+)\s*,\s*(\d+)\s*’, ’12,34’), have m.group(1) == ’12’, m.group(2) == ’34’
m.group(x) means the xth pair of parenthesis
lazy vs. greedy quantifiers
re.match(r’(\d+).*’, ’1234ab’) makes group(1) match ’1234’
re.match(r’(\d+?).*’, ’1234ab’) makes group(1) match ’1’
boundaries
re.search(r’(^abc|qef)’, L) matches abc only at beginning of string, and qef anywhere
re.search(r’(?m)(^abc|qef)’, L) matches abc only at beginning of string or of any line
(?m) enables regex to read multiline data
re.search(r’rowr(?=baz)’, L) matches an instance of ‘rowr’, but only if ‘baz’ follows (does not match baz)
(?=baz) means the pattern is followed by baz
re.search(r’(?<=rowr)baz’, L) matches an instance of ‘baz’, but only if immediately preceded by ‘rowr’ (does not match rowr)
(?<=rowr) means the pattern’s precedence is followed by rowr
non-linear patterns
re.search(r’(\S+),\1’, L) matches a word followed by the same word after a comma