凤凰山笔记

正则表达式快速入门教程

摘要:
看到一本书写的正则表达式教程非常好,特地整理出来,本教程结合linux的grep命令,可以让大家迅速掌握正则表达式。正则在nginx配置和linux命令中应用非常广泛。这个正则教程尽量写的简单,肯定可以看懂,如果碰到一个很繁琐的正则表达式,只要耐心分析肯定可以看懂,因为正则表达式都是一段一段的,不像复杂抽象的程序逻辑。

grep是常用的linux命令,用于字符串数据的对比,将符合条件的字符串打印出来。

1
grep '搜寻字符串' filename

一个栗子:

1
2
grep 'root' /etc/passwd
root:x:0:0:root:/root:/bin/bash

为了显示突出显示效果也就是高亮效果,可以定义grep别名:

1
grep='grep --color=auto'

范例文件r.txt

在linux可以通过下列命令获取:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
wget http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt
mv regular_express.txt r.txt
cat r.txt
"Open Source" is a good mechanism to develop programs.
apple is my favorite food.
Football game is not use feet only.
this dress doesn't fit me.
However, this dress is about $ 3183 dollars.
GNU is free air not free beer.
Her hair is very beauty.
I can't finish the test.
Oh! The soup taste good.
motorcycle is cheap than car.
This window is clear.
the symbol '*' is represented as start.
Oh! My god!
The gd software is a library for drafting programs.
You are the best is mean you are the no. 1.
The world <Happy> is the same with "glad".
I like dog.
google is the best tools for search keyword.
goooooogle yes!
go! go! Let's go.
# I am VBird

这文件一共22行,最后一行是空白行。

基础正则表达式的练习

例一:

1
2
3
4
5
6
7
# grep -n 'the' r.txt

8:I can't finish the test.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

例二:用中括号[]来查找

如果想查找test或taste这两个单词,发现它们的共同点是’t?st’。可以这样查找:

1
2
3
4
# grep -n 't[ae]st' r.txt

8:I can't finish the test.
9:Oh! The soup taste good.

[]不论有几个字符,它都只代表某“一个”字符。如果想查找有oo的字符:

1
2
3
4
5
6
7
8
# grep -n 'oo' r.txt

1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

如果不想要oo前面有g的话:

1
2
3
4
5
# grep -n '[^g]oo' r.txt
2:apple is my favorite food.
3:Football game is not use feet only.
18:google is the best tools for search keyword.
19:goooooogle yes!

如果想要oo前面有小写字母:

1
2
# grep -n '[^a-z]oo' r.txt
3:Football game is not use feet only.

类似想法还有:[a-z]、[A-Z]、[0-9]、[a-zA-Z0-9]等,例如:

1
2
3
# grep -n '[0-9]' r.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

例三:行首与行尾字符^$

只列出行首有the的行:

1
2
# grep -n '^the' r.txt
12:the symbol '*' is represented as start.

列出行首是小写字母的行:

1
2
3
4
5
6
7
8
# grep -n '^[a-z]' r.txt
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

如果要列出行首不是英文字母的行:

1
2
3
# grep -n '^[^a-zA-Z]' r.txt
1:"Open Source" is a good mechanism to develop programs.
21:# I am VBird

注意:^符号用在方括号[]里外是不同的。在[]内表示“反向选择”,在[]外则表示定位在行首。
要找出结尾是小数点(.)的行:

1
2
3
4
5
6
7
8
9
10
11
12
13
# grep -n '\.$' r.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
11:This window is clear.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
17:I like dog.
18:google is the best tools for search keyword.
20:go! go! Let's go.

小数点在正则表达式中有特殊含义(下面讲),需要用反斜线()转义。第5到9行的结尾也是小数点,怎么没有打印出来?用cat -A将5到9行打印出来:

1
2
3
4
5
6
7
# cat -An r.txt | head -n 10 | tail -n 6
5 However, this dress is about $ 3183 dollars.^M$
6 GNU is free air not free beer.^M$
7 Her hair is very beauty.^M$
8 I can't finish the test.^M$
9 Oh! The soup taste good.^M$
10 motorcycle is cheap than car.$

5~9行是windows(DOS)格式的断行字符(^M$),而第10行是linux格式断行字符。通过这个也就理解了为啥用$符号表示行尾。如果想找出空白行:

1
2
# grep -n '^$' r.txt
22:

linux的配置文件中有大量以#开始的注释,如果想不显示空行和注释:

1
2
3
4
5
6
7
# grep -v '^$' /etc/deluser.conf | grep -v '^#'
REMOVE_HOME = 0
REMOVE_ALL_FILES = 0
BACKUP = 0
BACKUP_TO = "."
ONLY_IF_EMPTY = 0
EXCLUDE_FSTYPES = "(proc|sysfs|usbfs|devpts|tmpfs|afs)"

例四:任意一个字符.与重复字符*

.(小数点):表示一定有一个任意字符;
*(星号):表示重复前一个字符0到无穷次;
假设要找出g??d的字符串:

1
2
3
4
# grep -n 'g..d' r.txt
1:"Open Source" is a good mechanism to develop programs.
9:Oh! The soup taste good.
16:The world <Happy> is the same with "glad".

假如要列出oo,ooo,oooo等数据,需要用到星号。需要注意的是’o‘表示’’,’o’,’oo’,’ooo’等,即空字符也用’o‘表示。而’oo‘,表示’o’,’oo’,’ooo’等,即至少有一个o。同理,想表示至少两个o用’ooo*’:

1
2
3
4
5
6
7
# grep -n 'ooo*' r.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

如何查找两个g之间至少一个o,即gog,goog,gooog等:

1
2
3
# grep -n 'goo*g' r.txt
18:google is the best tools for search keyword.
19:goooooogle yes!

如果要查找以g开头以g结尾的字符串,是’gg’吗?正确的应是’g.g’:

1
2
3
4
5
6
# grep -n 'g.*g' r.txt
1:"Open Source" is a good mechanism to develop programs.
14:The gd software is a library for drafting programs.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

如果只留下英文单词,则:

1
2
3
# grep -n 'g[a-zA-Z]*g' r.txt
18:google is the best tools for search keyword.
19:goooooogle yes!

如果查找任意数字:

1
2
3
# grep -n '[0-9][0-9]*' r.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

例五:限定连续RE字符范围{}

之前,用.和*来设置0个到无限个重复字符,如果需要限定重复次数呢?这需要用到限定范围的字符{}了。由于在shell中{}有特殊含义,需要用反斜线\进行转义。假如要找到两个o的字符串:

1
2
3
4
5
6
7
# grep -n 'o\{2\}' r.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

假设要要查找g后面2到5个o,然后再接一个g的字符串,则:

1
2
# grep -n 'go\{2,5\}g' r.txt
18:google is the best tools for search keyword.

第19行由于有6个o,导致没有被选择上。

基础正则表达式的总结

RE字符 含义
^word 带查找的字符串(word)在行首
word$ 带查找的字符串(word)在行尾
. 代表一定有一个任意字符
\ 转义字符
* 重复0次到无穷次的前一个字符
[list] 列举出想要选取的字符,如’a[al]y’表示可以查找aay,aly。
[n1-n2] 列举出想要选取的字符范围,如’[0-9]’表示十进制数字字符
[^list] 定义不要的字符或范围,如’[^A-Z]’表示不要大写字符
{n,m} 连续n到m个前一个RE字符

扩展正则表达式

grep使用扩展正则表达式要加-E参数或直接使用egrep别名命令。

RE字符 含义
+ 重复1次到无穷次的前一个字符
? 代表0个或1个任意字符
| 用或(or)的方式找出数个字符串.例如,egrep -n ‘gd| good’ r.txt
() 找出”组”字符串。如查找glad或good, egrep -n ‘g(la| oo)d’ r.txt
()+ 重复1次到无穷次前面的组。如查找”AxyzxyzxyzxyzC”,echo ‘AxyzxyzxyzxyzC’ | egrep ‘A(xyz)+C’

需要强调的是感叹号!在正则表达式中并不是特殊字符。

以上,希望有帮助

cloudroc wechat
欢迎您扫一扫上面的微信公众号,订阅我的博客!
很惭愧,只做了些微小的工作,您的支持将鼓励我继续努力创作!