所考虑的服务器配备了两个 CPU,每个 CPU 有 24 个物理核心,并在 Ubuntu 23.10 上运行。我观察到我的一个 Python 脚本的速度明显变慢,其速度
所考虑的服务器配备了两个 CPU,每个 CPU 有 24 个物理核心,并在 Ubuntu 23.10 上运行。我发现我的一个 Python 脚本的速度明显变慢,使用线程时速度降低到 百分之一 。考虑到 Python 中线程有时会出现问题,我用 C 编写了一个简单的测试代码来进一步调查这个问题。此代码在其计算密集型部分不共享任何内存等。
以下是相关信息:
joe@galileo:~$ cat /proc/version
Linux version 6.5.0-28-generic (buildd@lcy02-amd64-001) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.41) #29-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 28 23:46:48 UTC 2024
CPU 是:
root@galileo:~# cat /proc/cpuinfo | grep "model name" | head -1
model name : Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
测试C代码:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <stdio.h>
#include <unistd.h>
int _memsize;
int _N;
double time_usec(void) {
struct timeval tv;
if (gettimeofday(&tv, NULL) == 0) {
// Convert seconds to double and add microseconds divided by 1,000,000
return tv.tv_sec + tv.tv_usec / 1000000.0;
} else {
// gettimeofday failed, handle error as appropriate
perror("gettimeofday failed");
return 0.0;
}
}
void *test(void *arg) {
int M = 10000;
double t0, t1;
int count;
int tot;
unsigned char *mem;
long int id = (long int)arg;
int N = _N;
int memsize = _memsize;
printf("Threads=%d memsize=%d\n", N, memsize);
mem = (unsigned char *)malloc(memsize);
for (int i=0;i<memsize;i++) mem[i] = random();
t0 = time_usec();
count = 0;
tot = 0;
while (1) {
for (int i = 0; i < M; i++) {
tot += mem[(rand())%memsize];
if (tot>10000000) tot -= 10000000;
count++;
}
t1 = time_usec();
if ((t1-t0)>2.0) {
double nop_per_sec = (double)count / (t1-t0);
t0 = t1;
printf("%3ld %15.2lf nop/sec\n", id, nop_per_sec/1000000.0);
count = 0;
M = nop_per_sec*0.1;
}
}
pthread_exit(NULL);
}
int main(int argn, char *arg[]) {
pthread_t thread;
int rc;
long t;
_N = atoi(arg[1]);
_memsize = atoi(arg[2]);
for (t = 0; t < _N; t++) {
printf("Creating thread %ld\n", t);
rc = pthread_create(&thread, NULL, test, (void *)t);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
while (1) sleep(1);
return 0;
}
使用单线程时的性能:
joe@galileo:~$ ./perf 1 1000
Creating thread 0
Threads=1 memsize=1000
0 60.61 nop/sec
0 60.97 nop/sec
0 61.05 nop/sec
0 61.05 nop/sec
显示每秒 60M 次操作(任意但一致的单位)。当我使用 2 个线程时:
joe@galileo:~$ ./perf 2 1000
Creating thread 0
Creating thread 1
Threads=2 memsize=1000
Threads=2 memsize=1000
0 0.79 nop/sec
1 0.71 nop/sec
0 0.80 nop/sec
1 0.72 nop/sec
0 0.79 nop/sec
1 0.67 nop/sec
这表明速度减慢了约 1/100。CPU 负载约为 200%,正如预期的那样。当我运行两个进程,每个进程都有一个线程时,我没有观察到这种减慢:
joe@galileo:~$ ./perf 1 1000 & ./perf 1 1000
[1] 4989
Creating thread 0
Threads=1 memsize=1000
Creating thread 0
Threads=1 memsize=1000
0 60.76 nop/sec
0 60.76 nop/sec
0 61.48 nop/sec
0 61.48 nop/sec
0 61.44 nop/sec
调度值如下:
root@galileo:~# sysctl -a | grep sched
kernel.sched_autogroup_enabled = 1
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_child_runs_first = 0
kernel.sched_deadline_period_max_us = 4194304
kernel.sched_deadline_period_min_us = 100
kernel.sched_energy_aware = 1
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_util_clamp_max = 1024
kernel.sched_util_clamp_min = 1024
kernel.sched_util_clamp_min_rt_default = 1024
我尝试了调度设置以取得效果。\'perf stat\' 产生有问题的 \'LLC-load-misses\' 和 \'iTLB-load-misses\',如下所示:
joe@galileo:~$ sudo perf stat -d -d -d --timeout 10000 ./perf 2 1000
Creating thread 0
Creating thread 1
Threads=2 memsize=1000
Threads=2 memsize=1000
1 1.21 nop/sec
0 0.76 nop/sec
0 0.72 nop/sec
1 1.00 nop/sec
0 2.30 nop/sec
1 1.30 nop/sec
1 1.28 nop/sec
0 2.32 nop/sec
./perf: Terminated
Performance counter stats for './perf 2 1000':
19,875.36 msec task-clock # 1.985 CPUs utilized
24,396 context-switches # 1.227 K/sec
0 cpu-migrations # 0.000 /sec
100 page-faults # 5.031 /sec
62,051,572,900 cycles # 3.122 GHz (38.44%)
12,305,915,977 instructions # 0.20 insn per cycle (46.07%)
1,770,935,151 branches # 89.102 M/sec (46.19%)
20,833,461 branch-misses # 1.18% of all branches (46.14%)
2,886,437,444 L1-dcache-loads # 145.227 M/sec (46.08%)
62,673,587 L1-dcache-load-misses # 2.17% of all L1-dcache accesses (46.16%)
36,380,110 LLC-loads # 1.830 M/sec (30.73%)
36,335,895 LLC-load-misses # 99.88% of all L1-icache accesses (30.80%)
<not supported> L1-icache-loads
14,122,757 L1-icache-load-misses (30.91%)
2,893,359,729 dTLB-loads # 145.575 M/sec (30.77%)
15,591,131 dTLB-load-misses # 0.54% of all dTLB cache accesses (30.84%)
154,787 iTLB-loads # 7.788 K/sec (30.91%)
30,148,243 iTLB-load-misses # 19477.24% of all iTLB cache accesses (30.72%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
10.010814854 seconds time elapsed
11.662038000 seconds user
8.215249000 seconds sys
我觉得这很奇怪。使用 gcloud 或 AWS EC2 时,我没有遇到这样的减速。这似乎破坏了线程的整体目的。
有人有什么见解吗?