AJING NOTE: [Python] 正則表達式與re模組

正則表達式在處理字串上是經常會使用到的概念，通常用來尋找符合某句法規則的字串，使用頻率其實滿高的，但我就是記不起來...，每次使用都要花時間查怎麼用，因此下定決心做個筆記來認真學習一下。(就算還是記不起來，至少也方便找資料~)

正則表達式：

首先，要熟悉一些特殊符號的功用：

符號	說明	舉例
.	匹配任意字符，除了'\n'
^	開頭匹配	^a (字串開頭是a才匹配)
$	結尾匹配	abc$ (結尾是abc才匹配)
*	匹配至少零次	ab* (a, ab, abb...都匹配)
+	匹配至少一次
?	匹配零或一次
*?,+?,??	與*,+,?不同，非greedy
{m}	匹配m次
{m,n}	匹配m~n次，省略n代表無限，省略m代表0
{m,n}?	匹配m~n次，但會盡量匹配越少越好	a{3,5} ('aaaaa'只會匹配到'aaa')
[]	用來表示字符集合，屬於此集合的字符都匹配	[a-z] (小寫英文都匹配)
\|	a\|b，意指a或b
(...)	代表一個group	(abc)(12) (abc312會匹配得到2個group)
(?:...)	同(...)，但不會返回group
(?#...)	括弧內的內容做為註釋會被無視
(?<=...)	會匹配接在...後的字串	(?<=abc)def (abcdef會匹配得到def)
(?<!...)	會匹配不接在...後的字串	(?<!\d)a (12a不會匹配)
(?=...)	匹配要符合...，但返回的匹配不會含...增加的部分	(?=a123)a (cba123會匹配得到a)

符號	代表
\d	所有數字，等同[1-9]
\D	除了數字以外，等同[^1-9]
\w	所有字符，等同[a-zA-Z0-9_]
\W	除了字符，等同[^a-zA-Z0-9_]
\s	空白字符，等同[ \t\n\r\f\v]
\S	除了空白字符，等同[^ \t\n\r\f\v]

補充：
1. (...) vs. (?:...)：

>>>re.search(r'(ba)(?:cc)(12)', 'bbacc123').group()
'bacc12'
>>>re.search(r'(ba)(?:cc)(12)', 'bbacc123').group(1)
'ba'
>>>re.search(r'(ba)(?:cc)(12)', 'bbacc123').group(2)
'12'   ## group2不會是cc

2. .* vs .*?：
若現在有一字串 "eeeAiiZuuuuAoooZeeee"，則
● A.*Z：會找到一個匹配 ("iiZuuuuAoooZ")
● A.*?Z：會找到兩個匹配 ("AiiZ"、"AoooZ")
(參考：https://stackoverflow.com/questions/3075130/what-is-the-difference-between-and-regular-expressions)

re模組：

re模組是Python在處理正則表達式常用的模組。

重要函式：
1. re.match(pattern, string, flags=0)：
從string開頭進行匹配，找尋符合pattern的字串，若字串開頭不匹配將返回None，若成功匹配則返回一個match object。

2. re.search(pattern, string, flags=0)：
查找整個string，返回第一個成功的匹配的match object，失敗則返回None。

3. re.compile(pattern, flags=0)：
將一正則表達式轉成regular expression objects。

4. re.fullmatch(pattern, string, flags=0)：
string須完全匹配才會返回match object，否則返回None。

5. re.split(pattern, string, maxsplit=0, flags=0)：
將string以匹配的部分作分割，返回一個字串list，否則不會對其做分割但還是返回字串list。

>>> re.split(r'\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split(r'(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']  ## 若有()，則默認list中也會含有匹配的groups
>>> re.split(r'\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

6. re.findall(pattern, string, flags=0)：
找尋所有string裡符合匹配的字串，返回一個字串list。

7. re.sub(pattern, repl, string, count=0, flags=0)：
以repl替換要符合匹配的字串，返回修改過的字串，若找不到則返回沒修改過的string。

重要物件：
● Match Objects：
match objects有以下方法和特性：
1. Match.group([group1, ...])：可指定返回第幾個group，或是完整的匹配。

m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

2. Match.groups()：返回tuple型式的所有groups。

3. Match.groupdict()：返回dict型式的groups。

>>> m = re.match(r"(?P\w+) (?P\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

4. Match.start([group])、Match.end([group])、Match.span([group])：

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'
>>> m.span()
'(7, 18)'  ## equal (m.start(), m.end())

● Regular Expression Objects：
regular expression objects有search、match、sub等方法和特性，其實只是將pattern包裝成物件對象來使用 (可以從re模組方法那參考用法，所以不贅述)：
1. Pattern.match(string[, pos[, endpos]])：

>>> pattern = re.compile("o")
>>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
>>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".

2. Pattern.fullmatch(string[, pos[, endpos]])
3. Pattern.search(string[, pos[, endpos]])
4. Pattern.split(string, maxsplit=0)
5. Pattern.sub(repl, string, count=0) ...還有，其實跟re模組下方法差不多

先這樣，只寫了自己覺得比較會用到的部分

參考資料：
1. https://docs.python.org/3/library/re.html

AJING NOTE

首頁

2018年11月2日星期五

[Python] 正則表達式與re模組

正則表達式：

re模組：

沒有留言:

張貼留言

首頁

2018年11月2日 星期五

[Python] 正則表達式與re模組

正則表達式：

re模組：

沒有留言:

張貼留言

2018年11月2日星期五