Python快速入门(5)正则表达式

15
Python Regular Expressions:
##Regular expressions are a powerful language for matching text patterns.
##The Python "re" module provides regular expression support.

##In Python a regular expression search is typically written as:
match = re.search(pat, str)

## search后经常会跟 if 判断语句
## following example which searches for the pattern ‘word:‘ followed by a 3 letter word (details below):
		str = ‘an example word:cat!!‘
		match = re.search(r‘word:\w\w\w‘, str)
		# If-statement after search() tests if it succeeded
		  if match:                      
			print ‘found‘, match.group() ## ‘found word:cat‘
		  else:
			print ‘did not find‘


## The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. ‘word:cat‘). 

## The ‘r‘ at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the ‘r‘ just as a habit.

基本的模式/Basic Patterns:
a, X, 9, < ##-- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
. (a period) ##-- matches any single character except newline ‘\n‘
\w ##-- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
\b ##-- boundary between word and non-word
\s ##-- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r ##-- tab, newline, return
\d ##-- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
^ = start, $ = end ##-- match the start or end of the string
\ ##-- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as ‘@‘, you can put a slash in front of it, \@, to make sure it is treated just as a character.


基本规则:
1) search的过程是从一个字符串的头到尾进行的,但发现第一个匹配项时停止
2)如果 match = re.search(pat,str) 成功了,match的值为非None值,并且 match.group()中存放着匹配项

		  ## Search for pattern ‘iii‘ in string ‘piiig‘.
		  ## All of the pattern must match, but it may appear anywhere.
		  ## On success, match.group() is matched text.
		  match = re.search(r‘iii‘, ‘piiig‘) =>  found, match.group() == "iii"
		  match = re.search(r‘igs‘, ‘piiig‘) =>  not found, match == None

		  ## . = any char but \n
		  match = re.search(r‘..g‘, ‘piiig‘) =>  found, match.group() == "iig"

		  ## \d = digit char, \w = word char
		  match = re.search(r‘\d\d\d‘, ‘p123g‘) =>  found, match.group() == "123"
		  match = re.search(r‘\w\w\w‘, ‘@@abcd!!‘) =>  found, match.group() == "abc"

重复:
+ -- ## 1 or more occurrences of the pattern to its left, e.g. ‘i+‘ = one or more i‘s
* -- ## 0 or more occurrences of the pattern to its left
? -- ## match 0 or 1 occurrences of the pattern to its lef

Ex.
			  ## i+ = one or more i‘s, as many as possible.
			  match = re.search(r‘pi+‘, ‘piiig‘) =>  found, match.group() == "piii"

			  ## Finds the first/leftmost solution, and within it drives the +
			  ## as far as possible (aka ‘leftmost and largest‘).
			  ## In this example, note that it does not get to the second set of i‘s.
			  match = re.search(r‘i+‘, ‘piigiiii‘) =>  found, match.group() == "ii"

			  ## \s* = zero or more whitespace chars
			  ## Here look for 3 digits, possibly separated by whitespace.
			  match = re.search(r‘\d\s*\d\s*\d‘, ‘xx1 2   3xx‘) =>  found, match.group() == "1 2   3"
			  match = re.search(r‘\d\s*\d\s*\d‘, ‘xx12  3xx‘) =>  found, match.group() == "12  3"
			  match = re.search(r‘\d\s*\d\s*\d‘, ‘xx123xx‘) =>  found, match.group() == "123"

			  ## ^ = matches the start of string, so this fails:
			  match = re.search(r‘^b\w+‘, ‘foobar‘) =>  not found, match == None
			  ## but without the ^ it succeeds:
			  match = re.search(r‘b\w+‘, ‘foobar‘) =>  found, match.group() == "bar"

E-mail的例子:
		  str = ‘purple alice-b@google.com monkey dishwasher‘
		  match = re.search(r‘\w+@\w+‘, str)
		  if match:
			print match.group()  ## ‘b@google‘

 这里匹配的并不好,下面的例子将对其进行改进!


方括号:[]
方括号用于表示一组字符中的一个,[abc]指的就是a或b或c,方括号中的 ‘.‘ 表示的就是一个‘.‘的意思。因此这里 [\w.-]指的就是一个字母/数字,或者一个‘.‘,或者一个‘-‘。

## Square brackets can be used to indicate a set of chars, so [abc] matches ‘a‘ or ‘b‘ or ‘c‘. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add ‘.‘ and ‘-‘ to the set of chars which can appear around the @ with the pattern r‘[\w.-]+@[\w.-]+‘ to get the whole email address:

			  match = re.search(r‘[\w.-]+@[\w.-]+‘, str)
			  if match:
				print match.group()  ## ‘alice-b@google.com‘

方括号中还可以用范围来表示,如[a-z]指的就是一个小写字母。
但[abc-]表示的就不是范围
此外,[^ab],除了a或b之外的字符

组的抽取:
组的概念允许你从匹配的文本中取出你想要的部分。比如在e-mail的例子中,我们想要抽取出用户名和主机名两个单独的部分。这里就回用到 () 。
like this: r‘([\w.-]+)@([\w.-]+)‘

小括号并不会改变匹配的模式,小括号的作用的为匹配的文本建立组。
若匹配成功的话,match.group(1)返回的是左边开始第一个小括号内的匹配文本
match.group(2)返回的是左边开始第二个小括号内的匹配文本
match.group() 仍然是匹配的全部内容
			  str = ‘purple alice-b@google.com monkey dishwasher‘
			  match = re.search(‘([\w.-]+)@([\w.-]+)‘, str)
			  if match:
				print match.group()   ## ‘alice-b@google.com‘ (the whole match)
				print match.group(1)  ## ‘alice-b‘ (the username, group 1)
				print match.group(2)  ## ‘google.com‘ (the host, group 2)
				
			##Tips: A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

findall:
顾名思义就是找到所有的匹配项的意思。上面的 re.search()找到的只是第一个匹配项。
findall() 用于匹配所有的匹配项,并将其放回到一个字符串列表中。

## findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.
			  ## Suppose we have a text with many email addresses
			  str = ‘purple alice@google.com, blah monkey bob@abc.com blah dishwasher‘

			  ## Here re.findall() returns a list of all the found email strings
			  emails = re.findall(r‘[\w\.-]+@[\w\.-]+‘, str) ## [‘alice@google.com‘, ‘bob@abc.com‘]
			  for email in emails:
				# do something with each found email string
				print email

文件与findall:
## For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):
			  # Open file
			  f = open(‘test.txt‘, ‘r‘)
			  # Feed the file text into findall(); it returns a list of all the found strings
			  strings = re.findall(r‘some pattern‘, f.read())

 通过这种方式可以直接返回一个文件中的所有的匹配项。

组 和 findall:
如果模式中有2个及以上的组, findall() 将返回一个 元组列表,每个元组匹配一个模式。
			  str = ‘purple alice@google.com, blah monkey bob@abc.com blah dishwasher‘
			  tuples = re.findall(r‘([\w\.-]+)@([\w\.-]+)‘, str)
			  print tuples  ## [(‘alice‘, ‘google.com‘), (‘bob‘, ‘abc.com‘)]
			  for tuple in tuples:
				print tuple[0]  ## username
				print tuple[1]  ## host

选项:

正则表达式的函数提供选项用来修改模式匹配的行为。
##The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., 
e.g.
re.search(pat, str, re.IGNORECASE).


IGNORECASE ##-- ignore upper/lowercase differences for matching, so ‘a‘ matches both ‘a‘ and ‘A‘.
DOTALL ##-- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up -- you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
MULTILINE ##-- Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.


替换:

## The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include ‘\1‘, ‘\2‘ which refer to the text from group(1), group(2), and so on from the original matching text.
			  str = ‘purple alice@google.com, blah monkey bob@abc.com blah dishwasher‘
			  ## re.sub(pat, replacement, str) -- returns new string with all replacements,
			  ## \1 is group(1), \2 group(2) in the replacement
			  print re.sub(r‘([\w\.-]+)@([\w\.-]+)‘, r‘\1@yo-yo-dyne.com‘, str)
			  ## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

选修学习:
see here: https://developers.google.com/edu/python/regular-expressions

Python快速入门(5)正则表达式

上一篇:C++ 文件读写操作


下一篇:解释型程序python\java与编译型程序C在IO以及运行上的效率差异