Linux 平均负载深度分析

当登录 Linux 服务器时，我们不单单是通过 top 命令查看，还需要看一些什么？

一、平均负载和 uptime 有什么关系？

先了解一下 uptime 命令

1
2
3
4
5
6
7
8


$ uptime
19:41  up 41 mins, 5 users, load averages: 2.13 2.27 3.39

$ man uptime
The uptime utility displays the current time, # 显示当前的时间
the length of time the system has been up, # 系统已经运行的时间
the number of users # 用户的数量
the load average of the system over the last 1, 5, and 15 minutes # 过去1、5、15min系统平均负载

平均负载（load average）：单位时间内，系统可运行的状态和不可中断的平均进程数，也就是平均活跃进程数。

可运行的状态的进程：正在使用 cpu 或者等待 cpu 的进程。

不可中断的进程：正处于内核态关键流程中的进程，流程不可中断。不可中断状态实际上是系统对进程和硬件设备的一种保护机制。

简单的说：平均负载其实就是平均活跃进程数

平均活跃进程数，那么最理想的情况下，每个 cpu 都刚好运行一个进程，这样就可以充分利用。

比如：平均负载为 1 时

一个 cpu 的时候，则完全占用
二个 cpu 的时候，则一半的 cpu 是空闲的

tip：这里的 cpu 数量指逻辑 CPU 的数量，例如 i3 双核 4 线程，指的是 4 个 cpu

当平均负载高于 cpu 数量 70%的时候，就要留意观察分析。

二、平均负载与 cpu 使用率

平均负载，不仅包含了正在使用 CPU的进程，还包括等待CPU 和等待 I/O的进程

cpu 使用率是单位时间内 cpu 的繁忙情况统计，跟平均负载并不一定完全对应。

CPU 密集型进程，使用大量 CPU 会导致平均负载增加，这是一致的表现。
I/O 密集型进程，等待 I/O 也会导致平均负载升高，但 CPU 使用率不一定很高。
大量等待 CPU 的进程调度也会导致平均负载升高，此时的 CPU 使用率也会比较高。

三、平均负载案例分析

3.1 工具介绍：

stress : Linux 系统压力测试工具
mpstat（Multiprocessor Statistics） : 多核 cpu 性能分析工具，用于实时查看每个 cpu 的性能指标
Pidstat: 常用的进程性能分析工具，用来实时查看进程 cpu、内存、i/o 以及上下文切换等性能指标
Htop：是一款实时彩色终端 cpu、内存、i/o 性能指标工具

3.1.1 mpstat 使用

mpstat [ -P { cpu [,…] | ON | ALL } ][ -v ] [interval[count] ]

-P ：指出 cpu 的个数

internal：间隔时间

count：次数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


$ mpstat -P ALL 5 2
Linux 4.9.93-linuxkit-aufs (c6e8ee3c2c02) 	12/22/18 	_x86_64_	(2 CPU)

10:39:12     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:39:17     all    0.20    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   99.60
10:39:17       0    0.20    0.00    0.40    0.00    0.00    0.00    0.00    0.00    0.00   99.40
10:39:17       1    0.20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.80

10:39:17     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:39:22     all    0.00    0.00    0.40    0.00    0.00    0.00    0.00    0.00    0.00   99.60
10:39:22       0    0.00    0.00    0.40    0.00    0.00    0.00    0.00    0.00    0.00   99.60
10:39:22       1    0.00    0.00    0.40    0.00    0.00    0.00    0.00    0.00    0.00   99.60

3.1.2 pidstat 使用

pidstat [] [interval[count]]

-u ：显示进程的 cpu 使用情况 -r : 显示各个进程内存情况 -d : 显示各个进程的 IO 使用情况

internal：间隔时间 count：次数

1
2
3
4
5
6
7
8


$root@c6e8ee3c2c02:/# pidstat -u 5 1
Linux 4.9.93-linuxkit-aufs (c6e8ee3c2c02) 	12/22/18 	_x86_64_	(2 CPU)

10:53:53      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
10:53:58        0      1281   99.80    0.00    0.00    0.20   99.80     1  stress

Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:        0      1281   99.80    0.00    0.00    0.20   99.80     -  stress

3.2 实际分析

笔者用 docker ubuntu18.04 来分别模拟 cpu 密集型的进程、I/O 密集型进程、大量等待 CPU 的进程调度

3.2.1 CPU 密集型的情况

Termnal 1 通过 stress-ng 模拟 cpu 密集型的任务 :

1
2


$ root@c6e8ee3c2c02:/# stress-ng -c 1 --timeout 600
stress-ng: info:  [1628] dispatching hogs: 1 cpu

Termnal 2 监测平均负载动态的变化:

1
2
3
4


# -d 高亮显示变化的区域
$ watch -d uptime
Every 2.0s: uptime    c6e8ee3c2c02: Sat Dec 22 14:14:07 2018
14:14:07 up  4:06,  0 users,  load average: 0.80, 0.67, 0.45

Terminal 3 通过 mpstat 查看 cpu 的状况， pidstat 查看进程详细的信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


$ root@c6e8ee3c2c02:/# mpstat -P ALL 5 1
Linux 4.9.93-linuxkit-aufs (c6e8ee3c2c02) 	12/22/18 	_x86_64_	(2 CPU)
14:03:55     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14:04:00     all   49.85    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   49.65
14:04:00       0   92.80    0.00    0.60    0.00    0.00    0.00    0.00    0.00    0.00    6.60
14:04:00       1    6.81    0.00    0.40    0.00    0.00    0.00    0.00    0.00    0.00   92.79
$ root@c6e8ee3c2c02:/# pidstat -u 5 1
Linux 4.9.93-linuxkit-aufs (c6e8ee3c2c02) 	12/22/18 	_x86_64_	(2 CPU)

14:18:06      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:18:11        0      1629   99.80    0.20    0.00    0.00  100.00     0  stress-ng-cpu

Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:        0      1629   99.80    0.20    0.00    0.00  100.00     -  stress-ng-cpu

3.2.2 IO 密集型的情况

Terminal 1 通过 stress-ng 模拟一个 io 密集型的任务 :

1
2


$ root@c6e8ee3c2c02:/# stress-ng -i 1  --timeout 600
stress-ng: info:  [2568] dispatching hogs: 1 io

Terminal 2 监测平均负载动态的变化:

1
2
3
4


# -d 高亮显示变化的区域
$ watch -d uptime
Every 2.0s: uptime    c6e8ee3c2c02: Sat Dec 22 14:14:07 2018
14:14:07 up  4:06,  0 users,  load average: 0.80, 0.67, 0.45

Terminal 3 通过 mpstat 查看 cpu 的状况， pidstat 查看进程详细的信息

1
2
3
4
5


$ root@c6e8ee3c2c02:/# mpstat -P ALL 5
14:24:42     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14:24:47     all    0.45    0.00   12.99   21.13    0.00   23.62    0.00    0.00    0.00   41.81
14:24:47       0    0.24    0.00    9.74    9.98    0.00   49.64    0.00    0.00    0.00   30.40
14:24:47       1    0.65    0.00   15.95   31.25    0.00    0.00    0.00    0.00    0.00   52.16

3.2.3 多个进程 cpu 切换的情况

Terminal 1 通过 stress-ng 模拟 4 个进程

1
2


$ root@c6e8ee3c2c02:/# stress-ng -c 4 --timeout 600
stress-ng: info:  [3353] dispatching hogs: 4 cpu

Terminal 2 监测平均负载动态的变化，这台机器的 cpu 为 2 个，但是平均负载为 4，明显超过了

1
2
3


$ root@c6e8ee3c2c02:/# watch -d uptime
Every 2.0s:...  c6e8ee3c2c02: Sat Dec 22 14:34:41 2018
14:34:41 up  4:27,  0 users,  load average: 4.25, 2.51, 1.29

Terminal 3 通过 mpstat 发现满 cpu 的状况，再通过 pidstat 去查看占用 cpu 的进程，4 个进程去抢占 2 个 cpu，每个进程等待时间%wait 为 50%

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


$ root@c6e8ee3c2c02:/# mpstat -P ALL 5
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   99.80    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00    0.00
Average:       0   99.90    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00    0.00
Average:       1   99.70    0.00    0.30    0.00    0.00    0.00    0.00    0.00    0.00    0.00
$ root@c6e8ee3c2c02:/# pidstat -u 5
Linux 4.9.93-linuxkit-aufs (c6e8ee3c2c02) 	12/22/18 	_x86_64_	(2 CPU)
14:39:10      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:39:15        0      3354   49.50    0.00    0.00   50.30   49.50     0  stress-ng-cpu
14:39:15        0      3355   49.70    0.00    0.00   50.30   49.70     1  stress-ng-cpu
14:39:15        0      3356   49.50    0.20    0.00   50.10   49.70     0  stress-ng-cpu
14:39:15        0      3357   49.30    0.00    0.00   50.30   49.30     1  stress-ng-cpu
14:39:15        0      3565    0.00    0.20    0.00    0.00    0.20     0  watch

14:39:15      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:39:20        0      3354   49.80    0.00    0.00   50.20   49.80     0  stress-ng-cpu
14:39:20        0      3355   49.60    0.00    0.00   50.40   49.60     1  stress-ng-cpu
14:39:20        0      3356   49.60    0.00    0.00   50.40   49.60     0  stress-ng-cpu
14:39:20        0      3357   49.80    0.00    0.00   50.40   49.80     1  stress-ng-cpu

四、总结

平均负载过高，可能是 cpu 密集型进程任务，也有可能 IO 密集型任务，请通过 mpstat、pidstat、htop 工具具体分析。

笔者参考 Linux 性能优化实战做的笔记如有侵犯，请留言告知

Contents