0
点赞
收藏
分享

微信扫一扫

Python进阶笔记(1):正则表达式


正则表达式

  • ​​正则表达式​​
  • ​​1. 匹配符​​
  • ​​2. findall()​​
  • ​​2.1 \w和\W​​
  • ​​2.2 \s和\S​​
  • ​​2.3 \d和\D​​
  • ​​2.4 ^和$​​
  • ​​2.5 . * ?​​
  • ​​2.6 \​​
  • ​​2.7 []​​
  • ​​2.8 ()​​
  • ​​2.9 |​​
  • ​​3. search()​​
  • ​​3.1 匹配电话​​
  • ​​3.2 利用括号分组​​
  • ​​3.3 用管道匹配多个分组​​
  • ​​3.4 用问号实现可选匹配​​
  • ​​3.5 用星号匹配零次或多次​​
  • ​​3.6 用花括号匹配待定次数​​
  • ​​3.7 贪心和非贪心匹配​​
  • ​​3.8 练习​​
  • ​​3.9 实践:匹配电话号码和E-mail地址​​

前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住分享一下给大家。点击跳转:​​人工智能从入门到精通教程​​

正则表达式

需要导入:​​import re​

1. 匹配符

常用普通字符的含义见下表

普通字符

含义

\W

匹配非数字、字母、下划线、汉字

\w

匹配数字、字母、下划线、汉字

\S

匹配任意非空白字符

\s

匹配任意空白字符

\D

匹配非数字

\d

匹配数字

常用元字符的含义见下表

元字符

含义

.

匹配任意字符(除换行符\r,\n)

^

匹配字符串的开始位置

$

匹配字符串的结束位置

*

匹配该元字符的前一个字符任意出现次数(包括0次)


匹配该元字符的前一个字符0次或1次

\

转义字符,其后的一个元字符失去特殊含义,匹配字符本身

()

()中的表达式称为一个组,组匹配到的字符能被取出

[]

字符集,范围内的所有字符都能被匹配

|

将匹配条件进行逻辑或运算

[abc]

匹配括号内任意字符

[^abc]

匹配非括号内任意字符

2. findall()

使用re模块里面的findall()函数进行查找匹配,返回一个列表。

2.1 \w和\W

import re
str1 = '123Qwe!_@#你我他'
print(re.findall('\w', str1)) # 匹配数字、字母、下划线、汉字
print(re.findall('\W', str1)) # 匹配非数字、字母、下划线、汉字

['1', '2', '3', 'Q', 'w', 'e', '_', '你', '我', '他']
['!', '@', '#']

2.2 \s和\S

import re
str2 = "123Qwe!_@#你我他\t \n\r"
print(re.findall('\s', str2)) # 匹配任意空白字符,如空格、换行符\r
print(re.findall('\S', str2)) # 匹配任意非空白字符

['\t', ' ', '\n', '\r']
['1', '2', '3', 'Q', 'w', 'e', '!', '_', '@', '#', '你', '我', '他']

2.3 \d和\D

import re
str3 = "123Qwe!_@#你我他\t \n\r"
print(re.findall('\d', str3)) # 匹配数字
print(re.findall('\D', str3)) # 匹配非数字

['1', '2', '3']
['Q', 'w', 'e', '!', '_', '@', '#', '你', '我', '他', '\t', ' ', '\n', '\r']

2.4 ^和$

import re
str4 = '你好吗,我很好'
print(re.findall('^你好', str4)) # 匹配字符串的开始的 你好
str5 = '我很好,你好'
print(re.findall('你好$', str5)) # 匹配字符串的结束的 你好

['你好']
['你好']

2.5 . * ?

import re
str6 = 'abcaaabb'
print(re.findall('a.b', str6)) # 匹配任意一个字符(除换行符\r,\n)
print(re.findall('a?b', str6)) # 匹配字符a0次或1次
print(re.findall('a*b', str6)) # 匹配字符a任意次数(包括0次)
print(re.findall('a.*b', str6)) # 匹配任意字符任意次数 (贪婪匹配) 能匹配多长就多长
print(re.findall('a.*?b', str6))# 匹配任意字符任意次数 (非贪婪匹配) 能匹配多短就多短

['aab']
['ab', 'ab', 'b']
['ab', 'aaab', 'b']
['abcaaabb']
['ab', 'aaab']

2.6 \

import re
str7 = '\t123456'
print(re.findall('t', str7)) # 匹配不到字符t,因为\t有特殊含义,是一个先整体
str8 = '\\t123456'
print(re.findall('t', str8)) # 使用转义字符后,\t变为无特殊含义的普通字符,能匹配到字符t
str9 = r'\t123456'
print(re.findall('t', str9)) # 在字符串前加r也可以对字符串进行转义

[]
['t']
['t']

2.7 []

import re
str10 = 'aab abb acb azb a1b'
print(re.findall('a[a-z]b', str10)) # 只要中间的字符在字母a~z之间就能匹配到
print(re.findall('a[0-9]b', str10)) # 只要中间的字符在数字0~9之间就能匹配到
print(re.findall('a[ac1]b', str10)) # 只要中间的字符是[ac1]的成员就能匹配到

['aab', 'abb', 'acb', 'azb']
['a1b']
['aab', 'acb', 'a1b']

2.8 ()

import re
str11 = '123qwer'
print(re.findall('(\w+)q(\w+)', str11)) # \w+ 代表匹配一个或多个数字、字母、下划线、汉字

[('123', 'wer')]

2.9 |

import re
str12 = '你好,女士们先生们,大家好好学习呀'
print(re.findall('女士|先生', str12)) # 匹配 先生 或 女士

['女士', '先生']

3. search()

3.1 匹配电话

实例:查找电话

def isPhoneNumber(text):
"""查找\d\d\d-\d\d\d-\d\d\d\d类型的电话的函数 非正则版"""

if len(text) != 12:
return False
for i in range(0,3):
if not text[i].isdecimal():
return False
if text[3] != '-':
return False
for i in range(4,7):
if not text[i].isdecimal():
return False
if text[7] != '-':
return False
for i in range(8,12):
if not text[i].isdecimal():
return False
return True

message = "Call me at 415-555-1011 tomorrow. 415-555-9999 is my office"
for i in range(len(message)):
chunk = message[i:i+12]
if isPhoneNumber(chunk):
print("Phone number found: " + chunk)
print("Done")

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done

实例:用正则表达式查找电话号

import re

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search("My number is 415-555-4242.")
print("Phone number found: " + mo.group())

Phone number found: 415-555-4242

正则表达式使用步骤总结:

  1. 用​​import re​​导入正则表达式模块
  2. 用​​re.compile()​​函数创建一个Regex对象(记得使用原始字符串)
  3. 向Regex对象的search()方法传入想查找的字符串。它返回一个Match对象
  4. 调用Match对象的group()方法,返回实际匹配文本的字符串

3.2 利用括号分组

import re

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search("My number is 415-555-4242.")
print(mo.group())
print(mo.group(1))
print(mo.group(2))

print(mo.groups()) # 获取所有的分组
areaCode, mainNumber = mo.groups()
print(areaCode, mainNumber)

415-555-4242
415
555-4242
('415', '555-4242')
415 555-4242

3.3 用管道匹配多个分组

字符|称为”管道“,希望匹配许多表达式中的一个,就用它。

第一次出现的匹配文本,将作为Match对象返回。

heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print(mo1.group()) # 查找第一次出现的 findall()则是查找所有

mo2 = heroRegex.search("Tina Fey and Batman.")
print(mo2.group())

Batman
Tina Fey

batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Batmobile
mobile

3.4 用问号实现可选匹配

batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())

mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())

Batman
Batwoman

例2

phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
print(mo1.group())

mo2 = phoneRegex.search('My number is 555-4242')
print(mo2.group())

415-555-4242
555-4242

3.5 用星号匹配零次或多次

batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())

mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())

mo3 = batRegex.search('The Adventures of Batwowowowoman')
print(mo3.group())

Batman
Batwoman
Batwowowowoman

3.6 用花括号匹配待定次数

(Ha){3}将匹配字符串HaHaHa
(Ha){3,5}将匹配字符串HaHaHa | HaHaHaHa | HaHaHaHaHa
(Ha){3,}匹配3次及以上
(Ha){,5}匹配0到5次

haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
print(mo1.group())

mo2 = haRegex.search('Ha')
print(mo2 == None) # 因为(Ha){3}匹配HaHaHa,不匹配Ha,所以返回None

HaHaHa
True

3.7 贪心和非贪心匹配

# 贪心
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1)

# 非贪心
greedyHaRegex = re.compile(r'(Ha){3,5}?')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1)

<re.Match object; span=(0, 10), match='HaHaHaHaHa'>
<re.Match object; span=(0, 6), match='HaHaHa'>

3.8 练习

例:search和findall的区别

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
print(mo.group())

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

415-555-9999
['415-555-9999', '212-555-0000']
[('415', '555', '9999'), ('212', '555', '0000')]

例:匹配元音字符

  • ​[abc]​​:匹配括号内任意字符
  • ​[^abc]​​:匹配不再括号内的任意字符

# 匹配所有元音字符
voweRgegx = re.compile(r'[aeiouAEIOU]')
print(voweRgegx.findall('RoboCop eats baby food. BABY FOOD.'))

# 匹配所有非元音字符
consonantRgegx = re.compile(r'[^aeiouAEIOU]')
print(consonantRgegx.findall('RoboCop eats baby food. BABY FOOD.'))

例:插入字符

beginWithHello = re.compile(r'^Hello')
print(beginWithHello.search('Hello world!'))
print(beginWithHello.search('He said hello.') == None)

<re.Match object; span=(0, 5), match='Hello'>
True

例:美元字符

endWithNumber = re.compile(r'\d$')
print(endWithNumber.search('Your number is 42'))
print(endWithNumber.search("Your number is forty two") == None)

<re.Match object; span=(16, 17), match='2'>
True

例:匹配以字符开始字符结束的

wholeStringIsNum = re.compile(r'^\d+$')
print(wholeStringIsNum.search('123456789'))
print(wholeStringIsNum.search('12345xyz678') == None)
print(wholeStringIsNum.search('123 456789') == None)

<re.Match object; span=(0, 9), match='123456789'>
True
True

例:通配字符

atRegex = re.compile(r'.at')
print(atRegex.findall('The cat in the hat sat on the flat mat.'))

['cat', 'hat', 'sat', 'lat', 'mat']

例:用点-星匹配所有字符

nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search("First Name: A1 Last Name: Sweigart")
print(mo.group(1))
print(mo.group(2))

A1
Sweigart

例:点-星的贪心和非贪心模式

# 非贪心
nongreedyRegex = re.compile(r'<.*?>')
print(nongreedyRegex.search('<To serve man> for dinner.>'))

# 贪心
greedyRegex = re.compile(r'<.*>')
print(greedyRegex.search('<To serve man> for dinner.>'))

<re.Match object; span=(0, 14), match='<To serve man>'>
<re.Match object; span=(0, 27), match='<To serve man> for dinner.>'>

例:用句点字符匹配换行

# 正常情况不匹配换行符
noNewlineRegex = re.compile('.*')
print(noNewlineRegex.search("Serve the public trust.\nProtect the innocent.\nUphold the law.").group())

# 添加第二个参数,匹配换行符
NewlineRegex = re.compile('.*', re.DOTALL)
print(NewlineRegex.search("aaa.\nbbb.").group())

Serve the public trust.
aaa.
bbb.

例:匹配不区分大小写

# 传入第二个参数,匹配不区分大小写
robocop = re.compile(r'robocop', re.I)
print(robocop.search('RoboCop is part manchine, all cop.').group())
print(robocop.search('ROBOcop is part manchine, all cop.').group())

RoboCop
ROBOcop

例:用sub()方法替换字符串

namesRegex = re.compile(r'Agent \w+')
print(namesRegex.sub("CENSORED", "Agent Alice gave the secret documents to Agent Bob."))

CENSORED gave the secret documents to CENSORED.

agentNamesRegex = re.compile(r'Agent (\w)\w*')
print(agentNamesRegex.sub(r'\1****', "Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent."))

A**** told C**** that E**** knew B**** was a double agent.

3.9 实践:匹配电话号码和E-mail地址

import pyperclip, re

phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)?
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(ext|x|ext\.)\s*(\d{2,5}))?
)''', re.VERBOSE)

emailRegex = re.compile(r'''(
[a-zA-Z0-9._%=-]+
@
[a-zA-Z0-9.-]+
(\.[a-zA-Z]{2,4})
)''', re.VERBOSE)

text = str(pyperclip.paste()) # 读取剪切板
matches = []
for groups in phoneRegex.findall(text):
phoneNum = '-'.join([groups[1], groups[3], groups[5]]) # 拼接电话
if groups[8] != '':
phoneNum += ' x' + groups[8]
matches.append(phoneNum)
for groups in emailRegex.findall(text):
matches.append(groups[0])

if len(matches) > 0:
pyperclip.copy('\n'.join(matches))
print('找到以下电话号码和电子邮箱::')
print('\n'.join(matches))
else:
print("未发现电话号码和电子邮箱!")

Python进阶笔记(1):正则表达式_正则表达式


举报

相关推荐

0 条评论