Linux文本处理与分析工具

  • 文件读取:cat、head、tail
  • 字符处理:tr、sed、awk
  • 字符截取:cut
  • 搜索过滤:grep、egrep、fgrep、zgrep、zegrep、zfgrep
  • 统计聚合:sort、uniq、wc

OpenBSD源码地址:https://github.com/openbsd/src
GNU项目的coreutils源码地址:https://github.com/coreutils/coreutils

cat,head,tail

cat是最常用的读取文件的命令,有时候还可以用-n参数输出每一行数据的行号:

1
cat -n file

head用来读取文件的前几行,-n参数可以指定行数(默认10行):

1
head -n 100 file

tail用来读取文件末尾的几行,和head一样可以-n参数指定行数(默认10行):

1
tail -n 100 file

tail命令通过-f参数可以读取文件最新的输入,按Ctrl+C可以中止:

1
tail -f file

tr

tr命令translate characters的缩写,用来单个字符或字符集。

最简单的用法是直接传两个参数,将第一个参数字符替换成第二个参数字符。

1
echo 'abcdefg' | tr 'a' 'b'                  # bbcdefg

或者将一个字符集替换成另一个字符集:

1
2
echo 'abcdefg' | tr 'abcd' 'jkmn'            # jkmnefg
echo 'abcdefg' | tr "[:lower:]" "[:upper:]" # ABCDEFG

如果带上-d参数的话,可以删除(delete)参数里指定的字符:

1
echo 'abcdefg' | tr -d 'abc'                 # defg

如果带上-c参数的话,表示取反:

1
2
3
4
# 只保留aceg,其他替换为b
echo 'abcdefg' | tr -c 'aceg' 'b' # abcbebgb
# 只保留aceg,其他全部删除
echo 'abcdefg-abcdefg' | tr -cd 'aceg' # acegaceg

sed

sed命令stream editor的缩写,和tr有点类似,但是功能更强大。

sed命令,用法如下

1
sed command [file ...]

其中command表示editing commands,命令语法如下:

1
[address[,address]]function[arguments]

address是可选的,是命令执行的范围

function是sed的功能函数

arguments是函数的参数。

下面看最常用的s——替换(substitute)函数。

s替换函数

1
echo 'dcbaefg-dcbaefg' | sed 's/dcba/abcd/g'    # abcdefg-abcdefg

上面例子是:s/regular expression/replacement/flagss表示**替换(substitute)**函数,g表示全局(global)替换的flag。

flag

flag除了最常用的g以外,还有i(ignore,忽略大小写),还可以指定替换具体第几个匹配的字符串。

1
2
echo 'DCBAefg-dcbaefg' | sed 's/dcba/abcd/gi'   # abcdefg-abcdefg
echo 'DCBAefg-dcbaefg' | sed 's/dcba/abcd/2i' # DCBAefg-abcdefg

正则替换

注意s是支持regular expression(正则表达式)的,替换(replacement)的字符支持占位符。

1
2
echo 'DCBAefg-dcbaefg' | sed 's/^\([A-Z]*\).*-/\1/g'
# DCBA

上面^\([A-Z]*\).*正则匹配大写字母开头的字符,\1把整个匹配的字符串第一个匹配组的替换整个匹配组。

1
2
echo 'DCBAefg-dcbaefg' | sed 's/^\([A-Z]*\).*-\([a-d]*\).*/\1-\2/g'
# DCBA-dcba

把正则写的再复杂一点,后面的-\([a-d]*\).*匹配-后面的a到d的字符,然后使用\1-\2替换整个匹配组。

正则的规则可以参考regex cheat sheet,这个命令使用的多好,就看正则表达式掌握的多好了。

范围替换

s命令前面还可以指定替换第几行的替换范围。

1
2
3
4
5
6
7
8
9
10
echo 'dcbaefg-dcbaefg\ndcbaefg-dcbaefg\ndcbaefg-dcbaefg\ndcbaefg-dcbaefg' > a.txt
dcbaefg-dcbaefg
dcbaefg-dcbaefg
dcbaefg-dcbaefg
dcbaefg-dcbaefg
cat a.txt | sed "2,3s/dcba/ABCD/g"
dcbaefg-dcbaefg
ABCDefg-ABCDefg
ABCDefg-ABCDefg
dcbaefg-dcbaefg

这个范围也就是address,数字表示的是行号,$表示最后一行

1
2
3
4
5
cat a.txt | sed '3,$s/dcba/abcd/g'
dcbaefg-dcbaefg
dcbaefg-dcbaefg
ABCDefg-ABCDefg
ABCDefg-ABCDefg

address还可以是正则表达式

1
2
3
4
5
6
7
8
9
10
echo 'DCBAefg-dcbaefg\ndcbaefg-dcbaefg\ndcbaefg-dcbaefg\nDCBAefg-dcbaefg' > a.txt
DCBAefg-dcbaefg
dcbaefg-dcbaefg
dcbaefg-dcbaefg
DCBAefg-dcbaefg
cat a.txt | sed '/DCBA/s/dcba/ABCD/g'
DCBAefg-ABCDefg
dcbaefg-dcbaefg
dcbaefg-dcbaefg
DCBAefg-ABCDefg

多个匹配

可以在一个表达式里写多个替换规则,用分号;隔开,比如下面的1,2s/dcba/abcd/g指定1~2行把dcba替换成abcd3,$s/DCBA/abcd/g指定第三行3之后的行DCBA替换成abcd

1
2
3
4
5
6
7
8
9
10
echo 'DCBAefg-dcbaefg\nDCBAefg-dcbaefg\nDCBAefg-dcbaefg\nDCBAefg-dcbaefg' > a.txt
DCBAefg-dcbaefg
DCBAefg-dcbaefg
DCBAefg-dcbaefg
DCBAefg-dcbaefg
cat a.txt | sed '1,2s/dcba/abcd/g; 3,$s/DCBA/abcd/g'
DCBAefg-abcdefg
DCBAefg-abcdefg
abcdefg-dcbaefg
abcdefg-dcbaefg

上面以;分割开的多个替换规则,与下面多个-e参数的方式等价:

1
sed -e '1,2s/dcba/abcd/g' -e '3,$s/DCBA/abcd/g'

d函数

d函数是用来删除(delete)指定行的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cat a.txt
1abcdefg
2abcdefg
3abcdefg
4abcdefg
cat a.txt | sed '3d'
1abcdefg
2abcdefg
4abcdefg
cat a.txt | sed '2,3d'
1abcdefg
4abcdefg
cat a.txt | sed '/3a/d'
1DCBAefg
2DCBAefg
4DCBAefg
cat a.txt | sed '/[24]a/d'
1abcdefg
3abcdefg

a函数和i函数

a函数和i函数分别表示追加(append)和插入(insert)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
cat a.txt
1abcdefg
2abcdefg
3abcdefg
4abcdefg
cat a.txt | sed '3i---'
1abcdefg
2abcdefg
---
3abcdefg
4abcdefg
cat a.txt | sed '2,3i---'
1abcdefg
---
2abcdefg
---
3abcdefg
4abcdefg
cat a.txt | sed '3a---'
1abcdefg
2abcdefg
3abcdefg
---
4abcdefg
cat a.txt | sed '2,3a---'
1abcdefg
2abcdefg
---
3abcdefg
---
4abcdefg

更多的函数使用可以参考man手册

awk

awk是一个文本处理语言,没错它是个编程语言,像其他语言一样支持if else,while循环等各种流程控制语句,也支持函数。让我们一步步地来慢慢丰满一个翔实的awk脚本。

我们就拿netstat输出结果来做示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
netstat -nap tcp
tcp4 0 0 192.168.10.101.62566 199.232.68.133.443 SYN_SENT
tcp4 0 0 192.168.10.101.62564 112.60.8.158.443 ESTABLISHED
tcp4 0 0 192.168.10.101.62563 112.60.14.170.443 ESTABLISHED
tcp4 0 0 192.168.10.101.62560 17.57.145.20.5223 ESTABLISHED
tcp4 0 0 192.168.10.101.62558 120.241.186.223.8080 ESTABLISHED
tcp4 0 0 192.168.10.101.62554 104.224.178.228.10000 ESTABLISHED
tcp4 0 0 127.0.0.1.1087 127.0.0.1.62553 ESTABLISHED
tcp4 0 0 127.0.0.1.62553 127.0.0.1.1087 ESTABLISHED
tcp4 0 0 127.0.0.1.62547 127.0.0.1.1087 CLOSE_WAIT
tcp4 0 0 127.0.0.1.1087 127.0.0.1.62542 FIN_WAIT_2
tcp4 129 0 127.0.0.1.62542 127.0.0.1.1087 CLOSE_WAIT
tcp4 0 0 192.168.10.101.62529 104.224.178.228.10000 ESTABLISHED
...

如果我们只想要远端的socket地址(也就是第5列数据),那我们可以用:

1
netstat -nap tcp | awk '{print $5}'

看起来很简单,其实这里面隐含了几个awk默认的设置:

RS(record separator),默认是换行符,也就是说awk默认会把一行当成一条记录进行处理

FS(field separator),默认是空格符,也就是说awk默认会把以空格符隔开的字符当成一个字段处理

OFS(output field separator),默认是空格,也就是说输出的时候两个字段之间用空格来分开

ORS(output record separator),默认是换行,也就是说输出的时候两条记录之间用换行分开

NR(number of record),扫描过程中当前行的行号,类似的还有NF(number of field),FNR(current file number of record)

另外,$1~`$n表示第几个字段,$0`表示整条记录。

awk更多的内置变量可以看awk的man手册

如果我只想要TCP状态为LISTEN行数据,那我们可以用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
netstat -nap tcp | awk '$6=="LISTEN"'
tcp46 0 0 *.58299 *.* LISTEN
tcp4 0 0 127.0.0.1.47833 *.* LISTEN
tcp4 0 0 127.0.0.1.1087 *.* LISTEN
tcp4 0 0 127.0.0.1.1080 *.* LISTEN
tcp46 0 0 *.11085 *.* LISTEN
tcp6 0 0 *.57977 *.* LISTEN
tcp4 0 0 *.57977 *.* LISTEN
tcp4 0 0 192.168.101.143.5786 *.* LISTEN
tcp46 0 0 *.5432 *.* LISTEN
tcp4 0 0 *.59390 *.* LISTEN
tcp4 0 0 127.0.0.1.59376 *.* LISTEN
tcp4 0 0 127.0.0.1.63342 *.* LISTEN
tcp4 0 0 127.0.0.1.6942 *.* LISTEN
tcp46 0 0 *.13231 *.* LISTEN
tcp4 0 0 127.0.0.1.8080 *.* LISTEN
tcp46 0 0 *.8080 *.* LISTEN
tcp6 0 0 *.16800 *.* LISTEN
tcp4 0 0 *.16800 *.* LISTEN

如果我们想把状态为ESTABLISHED的第4列数据和第5列数据都拿出来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
netstat -nap tcp | awk '$6=="ESTABLISHED" {print $4,$5}'
192.168.101.143.58585 54.182.0.111.443
192.168.101.143.58171 59.36.121.194.8080
192.168.101.143.58148 108.177.97.188.5228
192.168.101.143.58144 59.37.96.203.8080
192.168.101.143.58143 59.37.96.203.8080
192.168.101.143.58141 203.119.169.109.443
192.168.101.143.58124 59.82.40.137.443
192.168.101.143.58029 104.22.52.136.443
192.168.101.143.58027 113.96.202.106.8080
fe80::32b5:4c0b:.1025 fe80::4e79:e738:.9763
fe80::32b5:4c0b:.1024 fe80::4e79:e738:.1024
192.168.101.143.53477 17.57.145.132.5223
192.168.101.143.55123 199.232.68.133.443

有些地址是ipv6的,我只想要ipv4的状态为ESTABLISHED的连接,那可以用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED"'
tcp4 0 0 192.168.101.143.58654 114.55.105.214.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58653 222.73.192.244.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58652 203.208.40.34.443 ESTABLISHED
tcp4 0 0 127.0.0.1.59390 127.0.0.1.58651 ESTABLISHED
tcp4 0 0 127.0.0.1.58651 127.0.0.1.59390 ESTABLISHED
tcp4 0 0 192.168.101.143.58585 54.182.0.111.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58171 59.36.121.194.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58148 108.177.97.188.5228 ESTABLISHED
tcp4 0 0 192.168.101.143.58144 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58143 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58141 203.119.169.109.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58124 59.82.40.137.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58029 104.22.52.136.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58027 113.96.202.106.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.53477 17.57.145.132.5223 ESTABLISHED
tcp4 0 0 192.168.101.143.55123 199.232.68.133.443 ESTABLISHED

当然也可以用正则:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
netstat -nap tcp | awk '/^tcp4 / && $6=="ESTABLISHED"'
tcp4 0 0 192.168.101.143.58654 114.55.105.214.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58653 222.73.192.244.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58652 203.208.40.34.443 ESTABLISHED
tcp4 0 0 127.0.0.1.59390 127.0.0.1.58651 ESTABLISHED
tcp4 0 0 127.0.0.1.58651 127.0.0.1.59390 ESTABLISHED
tcp4 0 0 192.168.101.143.58585 54.182.0.111.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58171 59.36.121.194.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58148 108.177.97.188.5228 ESTABLISHED
tcp4 0 0 192.168.101.143.58144 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58143 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58141 203.119.169.109.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58124 59.82.40.137.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58029 104.22.52.136.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58027 113.96.202.106.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.53477 17.57.145.132.5223 ESTABLISHED
tcp4 0 0 192.168.101.143.55123 199.232.68.133.443 ESTABLISHED

也可以对某一列进行正则匹配:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
netstat -nap tcp | awk ' $6=="ESTABLISHED" && $1~/tcp4$/'
tcp4 0 0 127.0.0.1.59390 127.0.0.1.59071 ESTABLISHED
tcp4 0 0 127.0.0.1.59071 127.0.0.1.59390 ESTABLISHED
tcp4 0 0 192.168.101.143.59061 13.67.9.5.443 ESTABLISHED
tcp4 0 0 192.168.101.143.59058 54.182.0.111.443 ESTABLISHED
tcp4 0 0 192.168.101.143.59040 59.82.33.252.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58171 59.36.121.194.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58148 108.177.97.188.5228 ESTABLISHED
tcp4 0 0 192.168.101.143.58144 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58143 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58141 203.119.169.109.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58029 104.22.52.136.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58027 113.96.202.106.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.53477 17.57.145.132.5223 ESTABLISHED
tcp4 0 0 192.168.101.143.55123 199.232.68.133.443 ESTABLISHED

如果想要把表头也带上,就可以用NR变量:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED" || NR==2'
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 127.0.0.1.59390 127.0.0.1.58672 ESTABLISHED
tcp4 0 0 127.0.0.1.58672 127.0.0.1.59390 ESTABLISHED
tcp4 0 0 192.168.101.143.58670 13.67.9.5.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58669 13.67.9.5.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58665 117.18.232.200.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58652 203.208.40.34.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58585 54.182.0.111.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58171 59.36.121.194.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58148 108.177.97.188.5228 ESTABLISHED
tcp4 0 0 192.168.101.143.58144 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58143 59.37.96.203.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.58141 203.119.169.109.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58124 59.82.40.137.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58029 104.22.52.136.443 ESTABLISHED
tcp4 0 0 192.168.101.143.58027 113.96.202.106.8080 ESTABLISHED
tcp4 0 0 192.168.101.143.53477 17.57.145.132.5223 ESTABLISHED
tcp4 0 0 192.168.101.143.55123 199.232.68.133.443 ESTABLISHED

只想看两个socket地址:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED" {print $4,$5}'
127.0.0.1.59390 127.0.0.1.58699
127.0.0.1.58699 127.0.0.1.59390
192.168.101.143.58680 59.36.89.161.443
192.168.101.143.58676 203.208.41.33.443
192.168.101.143.58171 59.36.121.194.8080
192.168.101.143.58148 108.177.97.188.5228
192.168.101.143.58144 59.37.96.203.8080
192.168.101.143.58143 59.37.96.203.8080
192.168.101.143.58141 203.119.169.109.443
192.168.101.143.58124 59.82.40.137.443
192.168.101.143.58029 104.22.52.136.443
192.168.101.143.58027 113.96.202.106.8080
192.168.101.143.53477 17.57.145.132.5223
192.168.101.143.55123 199.232.68.133.443

为这两列加上表头:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED" || NR==2 {print $4,$5}'
Local Address
127.0.0.1.59390 127.0.0.1.58707
127.0.0.1.58707 127.0.0.1.59390
192.168.101.143.58705 180.163.151.34.443
192.168.101.143.58703 54.182.0.111.443
192.168.101.143.58702 13.67.9.5.443
192.168.101.143.58701 13.67.9.5.443
192.168.101.143.58171 59.36.121.194.8080
192.168.101.143.58148 108.177.97.188.5228
192.168.101.143.58144 59.37.96.203.8080
192.168.101.143.58143 59.37.96.203.8080
192.168.101.143.58141 203.119.169.109.443
192.168.101.143.58124 59.82.40.137.443
192.168.101.143.58029 104.22.52.136.443
192.168.101.143.58027 113.96.202.106.8080
192.168.101.143.53477 17.57.145.132.5223
192.168.101.143.55123 199.232.68.133.443

但是发现表头有问题,因为表头行的Local Address被当成了两列,$4变成了Local$5变成了Address。那我们手动加上表头:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED" {print $4,$5} BEGIN {print "本机地址","远程地址"}'
本机地址 远程地址
192.168.101.143.58743 113.96.233.155.443
192.168.101.143.58732 199.193.127.100.17898
127.0.0.1.1080 127.0.0.1.58731
127.0.0.1.58731 127.0.0.1.1080
192.168.101.143.58715 122.228.66.242.443
192.168.101.143.58705 180.163.151.34.443
192.168.101.143.58703 54.182.0.111.443
192.168.101.143.58171 59.36.121.194.8080
192.168.101.143.58148 108.177.97.188.5228
192.168.101.143.58144 59.37.96.203.8080
192.168.101.143.58143 59.37.96.203.8080
192.168.101.143.58141 203.119.169.109.443
192.168.101.143.58124 59.82.40.137.443
192.168.101.143.58029 104.22.52.136.443
192.168.101.143.58027 113.96.202.106.8080
192.168.101.143.53477 17.57.145.132.5223
192.168.101.143.55123 199.232.68.133.443

输出结果没有对齐,把输出用\t隔开:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED" {print $4,$5} BEGIN {print "本机地址","远程地址"}' OFS="\t"
本机地址 远程地址
192.168.101.143.58751 13.67.9.5.443
127.0.0.1.59390 127.0.0.1.58750
127.0.0.1.58750 127.0.0.1.59390
192.168.101.143.58743 113.96.233.155.443
192.168.101.143.58732 199.193.127.100.17898
127.0.0.1.1080 127.0.0.1.58731
127.0.0.1.58731 127.0.0.1.1080
192.168.101.143.58703 54.182.0.111.443
192.168.101.143.58171 59.36.121.194.8080
192.168.101.143.58148 108.177.97.188.5228
192.168.101.143.58144 59.37.96.203.8080
192.168.101.143.58143 59.37.96.203.8080
192.168.101.143.58141 203.119.169.109.443
192.168.101.143.58124 59.82.40.137.443
192.168.101.143.58029 104.22.52.136.443
192.168.101.143.58027 113.96.202.106.8080
192.168.101.143.53477 17.57.145.132.5223
192.168.101.143.55123 199.232.68.133.443

\tprint不好对齐,那换成printf命令吧:

1
2
3
4
5
6
7
8
9
10
11
12
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED" {printf "%-40s %s\n",$4,$5} BEGIN {printf "%-40s %s\n","本机地址","远程地址"}'
本机地址 远程地址
192.168.101.143.59457 13.67.9.5.443
192.168.101.143.59261 104.22.52.136.443
192.168.101.143.59260 104.22.52.136.443
192.168.101.143.59208 183.3.235.67.8080
192.168.101.143.59200 14.215.158.119.443
192.168.101.143.59182 183.3.235.67.8080
192.168.101.143.59181 183.3.235.67.8080
192.168.101.143.59176 203.119.169.22.443
192.168.101.143.53477 17.57.145.132.5223
192.168.101.143.55123 199.232.68.133.443

最后输出一下总共有多少行:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
netstat -nap tcp | awk '$1=="tcp4" && $6=="ESTABLISHED" {printf "%-40s %s\n",$4,$5; s+=1} BEGIN {printf "%-40s %s\n","本机地址","远程地址"} END {print "总连接数" s}'
本机地址 远程地址
192.168.101.143.59555 13.224.161.27.443
192.168.101.143.59554 108.177.125.188.443
192.168.101.143.59550 59.82.40.137.443
192.168.101.143.59548 222.73.192.116.443
192.168.101.143.59547 222.73.192.116.443
192.168.101.143.59208 183.3.235.67.8080
192.168.101.143.59200 14.215.158.119.443
192.168.101.143.59182 183.3.235.67.8080
192.168.101.143.59181 183.3.235.67.8080
192.168.101.143.59176 203.119.169.22.443
192.168.101.143.53477 17.57.145.132.5223
192.168.101.143.55123 199.232.68.133.443
总连接数12

最后让我们看看这个慢慢被我们丰满的awk脚本:

1
2
3
4
5
6
7
8
9
10
$1=="tcp4" && $6=="ESTABLISHED" {
printf "%-40s %s\n",$4,$5;
s+=1
}
BEGIN {
printf "%-40s %s\n","本机地址","远程地址"
}
END {
print "总连接数" s
}

BEGINEND是awk执行开始前与执行完成后的标记,$1=="tcp4" && $6=="ESTABLISHED"是一个记录过滤规则的标记,{}内的是标记里的应该执行的命令。

1
2
3
4
5
6
7
8
9
10
11
12
BEGIN {
print "netstat 统计每个连接状态的连接个数"
}

NR != 1 && NR != 2 {
map[$6]+=1;
}

END {
for (key in map)
print key ":" map[key]
}

把这个脚本存到awk.awk文件中,在用-f选项使用这个文件:

1
2
3
4
5
6
7
8
netstat -nap tcp | awk -f awk.awk
netstat 统计每个连接状态的连接个数
LISTEN:16
FIN_WAIT_1:1
FIN_WAIT_2:1
CLOSE_WAIT:2
TIME_WAIT:22
ESTABLISHED:47

Refs: Linux三剑客老大awk:https://zhuanlan.zhihu.com/p/68188159

更详细的awk语言的教程可以参考gnu的awk用户指南

cut

cut命令和上面几个相比简单很多

比如我们想对下面的结果进行截取

1
2
3
4
5
6
7
8
9
10
11
cat /dev/random | od -x | head > hex.dump
0000000 4b94 cf26 b92e 2750 1dba efed 3d03 bf9c
0000020 adcb 9faf d760 2474 71ae 773c 4a96 44aa
0000040 ec58 8cc8 8035 3bf8 8f33 7290 c97c 3e96
0000060 9c01 58d3 4d35 2330 918b b745 ed5f 368b
0000100 11d1 2129 c554 ab0d 354a 9c75 cc35 7fbb
0000120 7943 4404 bb45 fd5f 299f 11c3 426d ba39
0000140 a8d6 4c9c 22ec ab00 1ea9 a5b7 e50e 7453
0000160 ff52 c73b ad9b 2a2d 8fb8 8eff 0049 17d2
0000200 4401 8894 d714 7279 d44b 137d bd42 ad76
0000220 5af5 7747 d8b2 9842 f3c1 8f20 f156 92cd

截取每一行的1-25个字符:

1
2
3
4
5
6
7
8
9
10
11
cat hex.dump | cut -c1-25
0000000 4b94 cf26
0000020 adcb 9faf
0000040 ec58 8cc8
0000060 9c01 58d3
0000100 11d1 2129
0000120 7943 4404
0000140 a8d6 4c9c
0000160 ff52 c73b
0000200 4401 8894
0000220 5af5 7747

-c代表以字符的方式进行截取1-25代表截取的范围,这个范围可以有多个,然后用,分隔开:

1
2
3
4
5
6
7
8
9
10
11
cat hex.dump | cut -c-10,25-35,40-
0000000 6 b92e 50 1dba efed 3d03 bf9c
0000020 f d760 74 71ae 773c 4a96 44aa
0000040 8 8035 f8 8f33 7290 c97c 3e96
0000060 3 4d35 30 918b b745 ed5f 368b
0000100 9 c554 0d 354a 9c75 cc35 7fbb
0000120 4 bb45 5f 299f 11c3 426d ba39
0000140 c 22ec 00 1ea9 a5b7 e50e 7453
0000160 b ad9b 2d 8fb8 8eff 0049 17d2
0000200 4 d714 79 d44b 137d bd42 ad76
0000220 7 d8b2 42 f3c1 8f20 f156 92cd

没有开头的范围代表从1开始,没有结尾代表截到最末尾。

cut还支持以字节为单位截取每一行:

1
2
3
4
5
6
7
8
9
10
11
cat hex.dump | cut -b -10,40-
0000000 50 1dba efed 3d03 bf9c
0000020 74 71ae 773c 4a96 44aa
0000040 f8 8f33 7290 c97c 3e96
0000060 30 918b b745 ed5f 368b
0000100 0d 354a 9c75 cc35 7fbb
0000120 5f 299f 11c3 426d ba39
0000140 00 1ea9 a5b7 e50e 7453
0000160 2d 8fb8 8eff 0049 17d2
0000200 79 d44b 137d bd42 ad76
0000220 42 f3c1 8f20 f156 92cd

cut也可以和awk一样实现简单的字段截取,但是cut只支持单字符的分隔符,所以先把上面的文件转成,分割

1
cat hex.dump | sed -E 's/ +/,/g' | cut -d',' -f1,4,5

-d(delimiter)指定分隔字符,-f(fields)指定字段的范围。

grep,egrep,fgrep,zgrep,zegrep,zfgrep

grep是使用正则搜索过滤的常用工具。与之相关的还有egrep、fgrep、zgrep、zegrep、zfgrep,他们的含义如下:

  • grep: Global Regular Expressions Print
  • egrep: Extended Global Regular Expressions Print,与grep -E等价。egrep是对grep的扩展,原来的grep不支持+,?,{m,n},(a|b)等正则,egrep提供了这些支持。
  • fgrep: Fixed-string Global Regular Expressions Print,与grep -F等价。fgrep就是简单的字符匹配,不使用正则。
  • zgrep: Zip Global Regular Expressions Print,与grep -Z等价,可以对压缩文件进行匹配。
  • zegrep: zgrep 和 egrep的结合
  • zfgrep: zgrep 和 fgrep的结合

像线程堆栈这种,有时候需要查看匹配项的前几行或后几行,-A(after),-B(before),-C(context)返回匹配项上下文的几行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cat numbers.txt
1
2
3
4
5
6
7
8
9
# 后两行
cat numbers.txt | grep -A2 '5'
5
6
7
# 前两行
cat numbers.txt | grep -B2 '5'
3
4
5
# 前后各两行
cat numbers.txt | grep -C2 '5'
3
4
5
6
7

使用-v(invert)反向匹配,也就是现实不匹配规则的行

1
2
3
4
5
6
cat numbers.txt | egrep -v '(0|2|4|6|8)'
1
3
5
7
9

注意上面使用了egrep,因为grep不支持(|)的写法
使用-c参数可以现实匹配的行数

1
2
cat numbers.txt | egrep -c '(0|2|4|6|8)'
5

使用-m可以选择显示几个匹配行,-n可以现实匹配的行号

1
2
3
cat numbers.txt | egrep -nm2 '(0|2|4|6|8)'
2:2
4:4

sort

sort用来进行排序的,sort是支持大文件排序的,用到外排序的算法,所以sort的源码也是学习外排序算法的最佳工程实践。

1
cat file | sort

sort命令默认是按照字母序排序的,所以55会排在6的前面,如果使用数字序可以用-n(numeric)参数:

1
cat file | sort -n

对于包含1.6E-35这类科学记数法的数字排序,可以使用-g(general-numeric)参数

1
cat file | sort -g

逆序-r(reverse):

1
cat file | sort -r

随机序(或者叫随机打乱)-R(random):

1
cat file | sort -R

对字段(key)进行排序,比如下面对/etc/passwd文件,按照:字符进行字段分割,然后对第3个字段进行排序:

1
sort --field-separator=: -k=3n /etc/passwd

uniq

uniq通常和sort配合使用,因为重复的行如果不相邻,uniq命令无法检测重复,所以排重之前必须先排序。

文件内容去重,每个唯一的行只显示一次:

1
sort file | uniq

只显示不重复的唯一行:

1
sort file | uniq -u

只显示重复的行:

1
sort file | uniq -d

去重并显示每个唯一行出现的次数:

1
sort file | uniq -c

按照每个唯一行出现的次数倒序排序:

1
sort file | uniq -c | sort -nr

wc

wc(word count)不仅可以用来统计单词个数,还可以统计数据行数,字符个数。

行数:

1
cat file | wc -l

单词个数:

1
cat file | wc -w

字符个数:

1
2
3
4
# 单字节字符
cat file | wc -c
# 多字节字符
cat file | wc -m

实践试试看

统计nginx访问次数最多的10个ip地址:

1
cat access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 10

统计ssh登陆失败的用户与来访ip地址:

1
cat /var/log/secure | grep 'Failed password' | sed 's/.*Failed password for \(.*\) from \(.*\) port.*/\1 \2/g' | sed 's/invalid user //g' | sort

统计ssh登陆失败的最多的10个ip地址:

1
2
3
4
5
6
7
8
9
10
11
cat /var/log/secure | grep 'Failed password' | sed 's/.*Failed password for \(.*\) from \(.*\) port.*/\1 \2/g' | sed 's/invalid user //g' | awk '{print $2}' | sort | uniq -c | sort -nr | head
8801 115.84.121.234
3147 115.159.25.205
2232 175.190.126.170
1051 95.111.236.177
989 221.234.36.43
383 45.248.189.3
367 61.222.164.71
315 205.185.120.48
315 159.75.91.118
279 222.186.150.98

统计ssh暴力攻击最喜欢用那些用户名:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
cat /var/log/secure | grep 'Failed password' | sed 's/.*Failed password for \(.*\) from \(.*\) port.*/\1 \2/g' | sed 's/invalid user //g' | awk '{print $1}' | sort | uniq -c | sort -nr | head -20
34167 root
1585 admin
960 user
791 oracle
757 test
462 ftpuser
421 ubuntu
375 postgres
352 test1
323 test2
313 git
292 usuario
197 mysql
170 guest
158 hadoop
157 nagios
150 deploy
141 pi
133 jenkins
119 support

本作品采用 知识共享署名 4.0 国际许可协议 进行许可。

转载时请注明原文链接:https://blog.hufeifei.cn/2021/05/Linux/tr_grep_sed_awk/

鼓励一下
支付宝微信