大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC——官方的运行配置文件中的错误修正——MPI启动配置

阅读 41

2023-11-02

官方的HPC运行配置文件:

#!/usr/bin/env bash
echo ----- print env vars -----

if [ "${CCS_ALLOC_FILE}" != "" ]; then
    echo "   "
    ls -la ${CCS_ALLOC_FILE}
    echo ------ cat ${CCS_ALLOC_FILE}
    cat ${CCS_ALLOC_FILE}
fi

export HOSTFILE=/tmp/hostfile.$$
rm -rf $HOSTFILE
touch $HOSTFILE

# parse CCS_ALLOC_FILE
## node name,  cores, tasks, task_list
#  hpcbuild002 8 1 container_22_default_00001_e01_000002
#  hpctest005 8 1 container_22_default_00000_e01_000001

ntask=`cat ${CCS_ALLOC_FILE} | awk -v fff="$HOSTFILE" '{}
{
    split($0, a, " ")
    if (length(a[1]) >0 && length(a[3]) >0) {
        print a[1]" slots="a[2] >> fff
        total_task+=a[3]
    }
}END{print total_task}'`

echo "openmpi hostfile $HOSTFILE generated:"
echo "-----------------------"
cat $HOSTFILE
echo "-----------------------"
echo "Total tasks is $ntask"
echo "mpirun -hostfile $HOSTFILE -n $ntask <your application>"

 

 

 

 

相关知识已在前文给出具体介绍,这里不再讨论,有需要可以翻阅之前blog。

 

 

===========================================

 

 

启动HPC上MPI的命令:

/opt/batch/cli/bin/dsub  -n xxxxxxx -A xxxxxxxxxxxx --priority 9999 --job_retry 10 --job_type hmpi -R "cpu=10;mem=128" -N 100  -eo error.txt -oo output.txt    xxxxxxxx.sh

可以看到,上面的命令是启动100个任务(-N 100),每个任务需要的资源为10个CPU和128MB的内存(-R "cpu=10;mem=128")。

 

 

用华为官方给出的HPC启动配置文件并将配置输出:

----- print env vars -----
   
-rw-rw----+ 1 ccs_agent ccs_agent 4777 Aug 25 08:40 /tmp/.ccscheduler/xxxxxx/mpi/allocFile_container_21302_default_00000_e01_000001
------ cat /tmp/.ccscheduler/xxxxxx/mpi/allocFile_container_21302_default_00000_e01_000001
dlhpcshare-agent-46 40 4 container_21302_default_00000_e01_000001 container_21302_default_00022_e01_000023 container_21302_default_00074_e01_000075 container_21302_default_00082_e01_000083
dlhpcshare-agent-25 10 1 container_21302_default_00039_e01_000040 
dlhpcshare-agent-49 30 3 container_21302_default_00060_e01_000061 container_21302_default_00009_e01_000010 container_21302_default_00054_e01_000055 
dlhpcshare-agent-28 70 7 container_21302_default_00047_e01_000048 container_21302_default_00023_e01_000024 container_21302_default_00094_e01_000095 container_21302_default_00053_e01_000054 container_21302_default_00018_e01_000019 container_21302_default_00079_e01_000080 container_21302_default_00013_e01_000014 
dlhpcshare-agent-21 10 1 container_21302_default_00067_e01_000068 
dlhpcshare-agent-44 20 2 container_21302_default_00099_e01_000100 container_21302_default_00077_e01_000078 
dlhpcshare-agent-61 10 1 container_21302_default_00069_e01_000070 
dlhpcshare-agent-40 10 1 container_21302_default_00033_e01_000034 
dlhpcshare-agent-41 20 2 container_21302_default_00042_e01_000043 container_21302_default_00003_e01_000004 
dlhpcshare-agent-42 30 3 container_21302_default_00034_e01_000035 container_21302_default_00037_e01_000038 container_21302_default_00015_e01_000016 
dlhpcshare-agent-20 80 8 container_21302_default_00011_e01_000012 container_21302_default_00056_e01_000057 container_21302_default_00093_e01_000094 container_21302_default_00002_e01_000003 container_21302_default_00076_e01_000077 container_21302_default_00087_e01_000088 container_21302_default_00092_e01_000093 container_21302_default_00031_e01_000032 
dlhpcshare-agent-8 60 6 container_21302_default_00050_e01_000051 container_21302_default_00058_e01_000059 container_21302_default_00030_e01_000031 container_21302_default_00055_e01_000056 container_21302_default_00012_e01_000013 container_21302_default_00014_e01_000015 
dlhpcshare-agent-6 10 1 container_21302_default_00044_e01_000045 
dlhpcshare-agent-2 20 2 container_21302_default_00081_e01_000082 container_21302_default_00064_e01_000065 
dlhpcshare-agent-14 60 6 container_21302_default_00084_e01_000085 container_21302_default_00086_e01_000087 container_21302_default_00043_e01_000044 container_21302_default_00071_e01_000072 container_21302_default_00098_e01_000099 container_21302_default_00052_e01_000053 
dlhpcshare-agent-36 100 10 container_21302_default_00095_e01_000096 container_21302_default_00038_e01_000039 container_21302_default_00061_e01_000062 container_21302_default_00091_e01_000092 container_21302_default_00090_e01_000091 container_21302_default_00001_e01_000002 container_21302_default_00078_e01_000079 container_21302_default_00085_e01_000086 container_21302_default_00066_e01_000067 container_21302_default_00007_e01_000008 
dlhpcshare-agent-59 10 1 container_21302_default_00065_e01_000066 
dlhpcshare-agent-15 40 4 container_21302_default_00051_e01_000052 container_21302_default_00072_e01_000073 container_21302_default_00073_e01_000074 container_21302_default_00068_e01_000069 
dlhpcshare-agent-39 10 1 container_21302_default_00080_e01_000081 
dlhpcshare-agent-17 40 4 container_21302_default_00057_e01_000058 container_21302_default_00070_e01_000071 container_21302_default_00075_e01_000076 container_21302_default_00028_e01_000029 
dlhpcshare-agent-54 30 3 container_21302_default_00083_e01_000084 container_21302_default_00010_e01_000011 container_21302_default_00059_e01_000060 
dlhpcshare-agent-12 50 5 container_21302_default_00097_e01_000098 container_21302_default_00096_e01_000097 container_21302_default_00048_e01_000049 container_21302_default_00063_e01_000064 container_21302_default_00089_e01_000090 
dlhpcshare-agent-34 120 12 container_21302_default_00027_e01_000028 container_21302_default_00016_e01_000017 container_21302_default_00032_e01_000033 container_21302_default_00036_e01_000037 container_21302_default_00020_e01_000021 container_21302_default_00029_e01_000030 container_21302_default_00019_e01_000020 container_21302_default_00040_e01_000041 container_21302_default_00024_e01_000025 container_21302_default_00004_e01_000005 container_21302_default_00017_e01_000018 container_21302_default_00045_e01_000046 
dlhpcshare-agent-57 30 3 container_21302_default_00006_e01_000007 container_21302_default_00049_e01_000050 container_21302_default_00088_e01_000089 
dlhpcshare-agent-13 30 3 container_21302_default_00005_e01_000006 container_21302_default_00041_e01_000042 container_21302_default_00035_e01_000036 
dlhpcshare-agent-53 60 6 container_21302_default_00021_e01_000022 container_21302_default_00062_e01_000063 container_21302_default_00046_e01_000047 container_21302_default_00008_e01_000009 container_21302_default_00025_e01_000026 container_21302_default_00026_e01_000027 

openmpi hostfile /tmp/hostfile.1217297 generated:
-----------------------
dlhpcshare-agent-46 slots=40
dlhpcshare-agent-25 slots=10
dlhpcshare-agent-49 slots=30
dlhpcshare-agent-28 slots=70
dlhpcshare-agent-21 slots=10
dlhpcshare-agent-44 slots=20
dlhpcshare-agent-61 slots=10
dlhpcshare-agent-40 slots=10
dlhpcshare-agent-41 slots=20
dlhpcshare-agent-42 slots=30
dlhpcshare-agent-20 slots=80
dlhpcshare-agent-8 slots=60
dlhpcshare-agent-6 slots=10
dlhpcshare-agent-2 slots=20
dlhpcshare-agent-14 slots=60
dlhpcshare-agent-36 slots=100
dlhpcshare-agent-59 slots=10
dlhpcshare-agent-15 slots=40
dlhpcshare-agent-39 slots=10
dlhpcshare-agent-17 slots=40
dlhpcshare-agent-54 slots=30
dlhpcshare-agent-12 slots=50
dlhpcshare-agent-34 slots=120
dlhpcshare-agent-57 slots=30
dlhpcshare-agent-13 slots=30
dlhpcshare-agent-53 slots=60
-----------------------
Total tasks is 100
mpirun -hostfile /tmp/hostfile.1217297 -n 100 <your application>

 

 

运行HPC上的MPI测试代码:

hello.py

import os
import mpi4py.MPI as MPI
import sys
import numpy as np
import time



def func1(queue, num):
    # time.sleep(num)
    # time.sleep(180)

    x = np.random.rand(100)
    for _ in range(2000000):
        x += np.random.rand(100)
    num += np.sum(x)


    queue.put(num)


def run_queue():
    from multiprocessing import Process, Queue

    ps = 10

    queue = Queue(maxsize=200)  # the following attribute can call in anywhere

    process = [Process(target=func1, args=(queue, num)) for num in range(ps)]
    [p.start() for p in process]
    [p.join() for p in process]
    return sum([queue.get() for p in process])

 
comm = MPI.COMM_WORLD
comm_rank = comm.Get_rank()
comm_size = comm.Get_size()
node_name = MPI.Get_processor_name()
 
# point to point communication
data_send = [comm_rank]*1

comm.send(data_send,dest=(comm_rank+1)%comm_size)

res = run_queue() ###

data_recv =comm.recv(source=(comm_rank-1)%comm_size)

 

结果:

my rank is 0/100, and node_name: dlhpcshare-agent-46 Ireceived:[99]1000016705.767213
my rank is 1/100, and node_name: dlhpcshare-agent-46 Ireceived:[0]999965729.3735983
my rank is 2/100, and node_name: dlhpcshare-agent-46 Ireceived:[1]1000055354.1766404
my rank is 3/100, and node_name: dlhpcshare-agent-46 Ireceived:[2]999938702.6090965
my rank is 4/100, and node_name: dlhpcshare-agent-46 Ireceived:[3]999932647.7499591
my rank is 5/100, and node_name: dlhpcshare-agent-46 Ireceived:[4]999970098.8907137
my rank is 6/100, and node_name: dlhpcshare-agent-46 Ireceived:[5]1000035017.3560394
my rank is 7/100, and node_name: dlhpcshare-agent-46 Ireceived:[6]999983266.860093
my rank is 8/100, and node_name: dlhpcshare-agent-46 Ireceived:[7]1000019754.1181364
my rank is 9/100, and node_name: dlhpcshare-agent-46 Ireceived:[8]1000028583.4510313
my rank is 10/100, and node_name: dlhpcshare-agent-46 Ireceived:[9]999948045.2254038
my rank is 11/100, and node_name: dlhpcshare-agent-46 Ireceived:[10]1000000535.3682672
my rank is 12/100, and node_name: dlhpcshare-agent-46 Ireceived:[11]1000060573.6321007
my rank is 13/100, and node_name: dlhpcshare-agent-46 Ireceived:[12]999989400.5199456
my rank is 14/100, and node_name: dlhpcshare-agent-46 Ireceived:[13]1000004276.757213
my rank is 15/100, and node_name: dlhpcshare-agent-46 Ireceived:[14]1000036378.0970109
my rank is 16/100, and node_name: dlhpcshare-agent-46 Ireceived:[15]999941671.5626221
my rank is 17/100, and node_name: dlhpcshare-agent-46 Ireceived:[16]1000016780.6566744
my rank is 18/100, and node_name: dlhpcshare-agent-46 Ireceived:[17]999976054.5543907
my rank is 19/100, and node_name: dlhpcshare-agent-46 Ireceived:[18]999981141.0937772
my rank is 20/100, and node_name: dlhpcshare-agent-46 Ireceived:[19]1000015375.1100855
my rank is 21/100, and node_name: dlhpcshare-agent-46 Ireceived:[20]1000073895.3809344
my rank is 22/100, and node_name: dlhpcshare-agent-46 Ireceived:[21]999949026.6361238
my rank is 23/100, and node_name: dlhpcshare-agent-46 Ireceived:[22]999966888.9603195
my rank is 24/100, and node_name: dlhpcshare-agent-46 Ireceived:[23]999972898.4622048
my rank is 25/100, and node_name: dlhpcshare-agent-46 Ireceived:[24]999949881.6119624
my rank is 26/100, and node_name: dlhpcshare-agent-46 Ireceived:[25]999952754.392655
my rank is 27/100, and node_name: dlhpcshare-agent-46 Ireceived:[26]1000003606.3229789
my rank is 28/100, and node_name: dlhpcshare-agent-46 Ireceived:[27]999978698.6484457
my rank is 29/100, and node_name: dlhpcshare-agent-46 Ireceived:[28]1000023654.3494583
my rank is 30/100, and node_name: dlhpcshare-agent-46 Ireceived:[29]999979493.4737692
my rank is 31/100, and node_name: dlhpcshare-agent-46 Ireceived:[30]999952955.1322489
my rank is 32/100, and node_name: dlhpcshare-agent-46 Ireceived:[31]1000023147.510355
my rank is 33/100, and node_name: dlhpcshare-agent-46 Ireceived:[32]999988621.8661311
my rank is 34/100, and node_name: dlhpcshare-agent-46 Ireceived:[33]1000032886.1024017
my rank is 35/100, and node_name: dlhpcshare-agent-46 Ireceived:[34]1000080786.645305
my rank is 36/100, and node_name: dlhpcshare-agent-46 Ireceived:[35]999985738.2168791
my rank is 37/100, and node_name: dlhpcshare-agent-46 Ireceived:[36]1000035874.901148
my rank is 38/100, and node_name: dlhpcshare-agent-46 Ireceived:[37]999986126.0029303
my rank is 39/100, and node_name: dlhpcshare-agent-46 Ireceived:[38]1000005594.235587
my rank is 40/100, and node_name: dlhpcshare-agent-25 Ireceived:[39]999994753.8205698
my rank is 41/100, and node_name: dlhpcshare-agent-25 Ireceived:[40]999999613.1839917
my rank is 42/100, and node_name: dlhpcshare-agent-25 Ireceived:[41]1000032886.4796938
my rank is 43/100, and node_name: dlhpcshare-agent-25 Ireceived:[42]1000015299.8141469
my rank is 44/100, and node_name: dlhpcshare-agent-25 Ireceived:[43]999978839.5215104
my rank is 45/100, and node_name: dlhpcshare-agent-25 Ireceived:[44]1000056945.34637
my rank is 46/100, and node_name: dlhpcshare-agent-25 Ireceived:[45]999987546.3340306
my rank is 47/100, and node_name: dlhpcshare-agent-25 Ireceived:[46]999977891.1204312
my rank is 48/100, and node_name: dlhpcshare-agent-25 Ireceived:[47]999964171.7260697
my rank is 49/100, and node_name: dlhpcshare-agent-25 Ireceived:[48]1000002815.8499967
my rank is 50/100, and node_name: dlhpcshare-agent-49 Ireceived:[49]1000088232.1541688
my rank is 51/100, and node_name: dlhpcshare-agent-49 Ireceived:[50]999922549.1663526
my rank is 52/100, and node_name: dlhpcshare-agent-49 Ireceived:[51]1000036873.3701472
my rank is 53/100, and node_name: dlhpcshare-agent-49 Ireceived:[52]999946075.0287105
my rank is 54/100, and node_name: dlhpcshare-agent-49 Ireceived:[53]1000071587.8812076
my rank is 55/100, and node_name: dlhpcshare-agent-49 Ireceived:[54]999950657.5166117
my rank is 56/100, and node_name: dlhpcshare-agent-49 Ireceived:[55]999933878.6885785
my rank is 57/100, and node_name: dlhpcshare-agent-49 Ireceived:[56]1000066159.5095633
my rank is 58/100, and node_name: dlhpcshare-agent-49 Ireceived:[57]999963883.8407422
my rank is 59/100, and node_name: dlhpcshare-agent-49 Ireceived:[58]1000010886.356683
my rank is 60/100, and node_name: dlhpcshare-agent-49 Ireceived:[59]999998936.3161815
my rank is 61/100, and node_name: dlhpcshare-agent-49 Ireceived:[60]1000032457.9024019
my rank is 62/100, and node_name: dlhpcshare-agent-49 Ireceived:[61]1000008817.6049147
my rank is 63/100, and node_name: dlhpcshare-agent-49 Ireceived:[62]999916332.1086562
my rank is 64/100, and node_name: dlhpcshare-agent-49 Ireceived:[63]1000028239.993777
my rank is 65/100, and node_name: dlhpcshare-agent-49 Ireceived:[64]1000025534.4065487
my rank is 66/100, and node_name: dlhpcshare-agent-49 Ireceived:[65]999965383.8654947
my rank is 67/100, and node_name: dlhpcshare-agent-49 Ireceived:[66]999901860.5872655
my rank is 68/100, and node_name: dlhpcshare-agent-49 Ireceived:[67]1000061659.5934151
my rank is 69/100, and node_name: dlhpcshare-agent-49 Ireceived:[68]1000018931.4276459
my rank is 70/100, and node_name: dlhpcshare-agent-49 Ireceived:[69]1000066546.8614273
my rank is 71/100, and node_name: dlhpcshare-agent-49 Ireceived:[70]999979458.4319509
my rank is 72/100, and node_name: dlhpcshare-agent-49 Ireceived:[71]1000010398.3486695
my rank is 73/100, and node_name: dlhpcshare-agent-49 Ireceived:[72]999996877.6207495
my rank is 74/100, and node_name: dlhpcshare-agent-49 Ireceived:[73]999963902.0169674
my rank is 75/100, and node_name: dlhpcshare-agent-49 Ireceived:[74]999949408.9980878
my rank is 76/100, and node_name: dlhpcshare-agent-49 Ireceived:[75]1000033206.5989758
my rank is 77/100, and node_name: dlhpcshare-agent-49 Ireceived:[76]999980319.9703398
my rank is 78/100, and node_name: dlhpcshare-agent-49 Ireceived:[77]1000096159.5354401
my rank is 79/100, and node_name: dlhpcshare-agent-49 Ireceived:[78]1000012855.660957
my rank is 80/100, and node_name: dlhpcshare-agent-28 Ireceived:[79]999945256.5891613
my rank is 81/100, and node_name: dlhpcshare-agent-28 Ireceived:[80]999936662.6228058
my rank is 82/100, and node_name: dlhpcshare-agent-28 Ireceived:[81]1000047847.3015394
my rank is 83/100, and node_name: dlhpcshare-agent-28 Ireceived:[82]1000009892.1034278
my rank is 84/100, and node_name: dlhpcshare-agent-28 Ireceived:[83]1000017076.6121353
my rank is 85/100, and node_name: dlhpcshare-agent-28 Ireceived:[84]1000018412.7315669
my rank is 86/100, and node_name: dlhpcshare-agent-28 Ireceived:[85]999982459.9707924
my rank is 87/100, and node_name: dlhpcshare-agent-28 Ireceived:[86]1000033177.8524352
my rank is 88/100, and node_name: dlhpcshare-agent-28 Ireceived:[87]999969004.9046781
my rank is 89/100, and node_name: dlhpcshare-agent-28 Ireceived:[88]999985019.450158
my rank is 90/100, and node_name: dlhpcshare-agent-28 Ireceived:[89]1000057133.9164723
my rank is 91/100, and node_name: dlhpcshare-agent-28 Ireceived:[90]1000076183.055097
my rank is 92/100, and node_name: dlhpcshare-agent-28 Ireceived:[91]1000001585.1896353
my rank is 93/100, and node_name: dlhpcshare-agent-28 Ireceived:[92]999997285.7032727
my rank is 94/100, and node_name: dlhpcshare-agent-28 Ireceived:[93]999958897.370945
my rank is 95/100, and node_name: dlhpcshare-agent-28 Ireceived:[94]999987025.9069097
my rank is 96/100, and node_name: dlhpcshare-agent-28 Ireceived:[95]1000040273.2979946
my rank is 97/100, and node_name: dlhpcshare-agent-28 Ireceived:[96]1000019695.0531096
my rank is 98/100, and node_name: dlhpcshare-agent-28 Ireceived:[97]1000010986.5963597
my rank is 99/100, and node_name: dlhpcshare-agent-28 Ireceived:[98]1000039139.5256413

统计下结果中运行的主机情况:

大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC——官方的运行配置文件中的错误修正——MPI启动配置_浪潮计算平台

 

 

----------------------------------------------------------

 

这个结果说明,HPC的调度分配是用容器作为task_list进行分配的,我们运行的目的是启动100个task,每个task使用10个CPU,最终HPC调度器也是如此分配的,但是在HPC的MPI启动配置文件时就出了错误,这时候是启动100个task,每个task使用1个CPU,结果就是调度器分配的26台主机,1000个CPU,而最终运行HPC时却只使用了4个主机和100个CPU,可以知道这个错误造成的性能差距是十分巨大的。

 

 

----------------------------------------------------------

 

更改运行的MPI代码:

加入代码:os.cpu_count()

 

结果:

my rank is 0/100, and node_name: dlhpcshare-agent-53 Ireceived:[99] 128 1000106515.606065
my rank is 1/100, and node_name: dlhpcshare-agent-53 Ireceived:[0] 128 999939847.1516143
my rank is 2/100, and node_name: dlhpcshare-agent-53 Ireceived:[1] 128 999893013.0117773
my rank is 3/100, and node_name: dlhpcshare-agent-53 Ireceived:[2] 128 1000006650.8617853
my rank is 4/100, and node_name: dlhpcshare-agent-53 Ireceived:[3] 128 1000014617.9322758
my rank is 5/100, and node_name: dlhpcshare-agent-53 Ireceived:[4] 128 1000035850.8999186
my rank is 6/100, and node_name: dlhpcshare-agent-53 Ireceived:[5] 128 1000056091.3767813
my rank is 7/100, and node_name: dlhpcshare-agent-53 Ireceived:[6] 128 1000092650.9862612
my rank is 8/100, and node_name: dlhpcshare-agent-53 Ireceived:[7] 128 999992376.4017116
my rank is 9/100, and node_name: dlhpcshare-agent-53 Ireceived:[8] 128 999990154.7197974
my rank is 10/100, and node_name: dlhpcshare-agent-53 Ireceived:[9] 128 999945251.9710757
my rank is 11/100, and node_name: dlhpcshare-agent-53 Ireceived:[10] 128 1000018748.3208461
my rank is 12/100, and node_name: dlhpcshare-agent-53 Ireceived:[11] 128 1000014784.1006019
my rank is 13/100, and node_name: dlhpcshare-agent-53 Ireceived:[12] 128 1000018181.9926168
my rank is 14/100, and node_name: dlhpcshare-agent-53 Ireceived:[13] 128 999958054.3204675
my rank is 15/100, and node_name: dlhpcshare-agent-53 Ireceived:[14] 128 999924067.4712228
my rank is 16/100, and node_name: dlhpcshare-agent-53 Ireceived:[15] 128 999988603.3355628
my rank is 17/100, and node_name: dlhpcshare-agent-53 Ireceived:[16] 128 1000012689.8649071
my rank is 18/100, and node_name: dlhpcshare-agent-53 Ireceived:[17] 128 999984526.9629173
my rank is 19/100, and node_name: dlhpcshare-agent-53 Ireceived:[18] 128 1000021401.943769
my rank is 20/100, and node_name: dlhpcshare-agent-53 Ireceived:[19] 128 999929078.5520971
my rank is 21/100, and node_name: dlhpcshare-agent-53 Ireceived:[20] 128 999952623.3748196
my rank is 22/100, and node_name: dlhpcshare-agent-53 Ireceived:[21] 128 999959034.3172395
my rank is 23/100, and node_name: dlhpcshare-agent-53 Ireceived:[22] 128 999987730.0943791
my rank is 24/100, and node_name: dlhpcshare-agent-53 Ireceived:[23] 128 999965389.0376471
my rank is 25/100, and node_name: dlhpcshare-agent-53 Ireceived:[24] 128 1000037427.562461
my rank is 26/100, and node_name: dlhpcshare-agent-53 Ireceived:[25] 128 1000018241.0460006
my rank is 27/100, and node_name: dlhpcshare-agent-53 Ireceived:[26] 128 999971382.2166605
my rank is 28/100, and node_name: dlhpcshare-agent-53 Ireceived:[27] 128 999972307.7894782
my rank is 29/100, and node_name: dlhpcshare-agent-53 Ireceived:[28] 128 999946836.4048017
my rank is 30/100, and node_name: dlhpcshare-agent-53 Ireceived:[29] 128 999964303.4481132
my rank is 31/100, and node_name: dlhpcshare-agent-53 Ireceived:[30] 128 999984879.3717456
my rank is 32/100, and node_name: dlhpcshare-agent-53 Ireceived:[31] 128 999965798.9835973
my rank is 33/100, and node_name: dlhpcshare-agent-53 Ireceived:[32] 128 1000106757.4002326
my rank is 34/100, and node_name: dlhpcshare-agent-53 Ireceived:[33] 128 999995905.8208525
my rank is 35/100, and node_name: dlhpcshare-agent-53 Ireceived:[34] 128 999939982.0805119
my rank is 36/100, and node_name: dlhpcshare-agent-53 Ireceived:[35] 128 1000040325.6754566
my rank is 37/100, and node_name: dlhpcshare-agent-53 Ireceived:[36] 128 999974755.9749557
my rank is 38/100, and node_name: dlhpcshare-agent-53 Ireceived:[37] 128 999990948.2711018
my rank is 39/100, and node_name: dlhpcshare-agent-53 Ireceived:[38] 128 1000005991.5601768
my rank is 40/100, and node_name: dlhpcshare-agent-53 Ireceived:[39] 128 1000026937.8002824
my rank is 41/100, and node_name: dlhpcshare-agent-53 Ireceived:[40] 128 1000013302.4165831
my rank is 42/100, and node_name: dlhpcshare-agent-53 Ireceived:[41] 128 1000048050.6615318
my rank is 43/100, and node_name: dlhpcshare-agent-53 Ireceived:[42] 128 1000050569.9395372
my rank is 44/100, and node_name: dlhpcshare-agent-53 Ireceived:[43] 128 999994676.4451874
my rank is 45/100, and node_name: dlhpcshare-agent-53 Ireceived:[44] 128 999965814.8078717
my rank is 46/100, and node_name: dlhpcshare-agent-53 Ireceived:[45] 128 1000063779.5651985
my rank is 47/100, and node_name: dlhpcshare-agent-53 Ireceived:[46] 128 999912822.4157392
my rank is 48/100, and node_name: dlhpcshare-agent-53 Ireceived:[47] 128 999928593.3258204
my rank is 49/100, and node_name: dlhpcshare-agent-53 Ireceived:[48] 128 1000079311.6429831
my rank is 50/100, and node_name: dlhpcshare-agent-53 Ireceived:[49] 128 999969785.781623
my rank is 51/100, and node_name: dlhpcshare-agent-53 Ireceived:[50] 128 999976691.4177382
my rank is 52/100, and node_name: dlhpcshare-agent-53 Ireceived:[51] 128 999981874.8187928
my rank is 53/100, and node_name: dlhpcshare-agent-53 Ireceived:[52] 128 999978211.3062496
my rank is 54/100, and node_name: dlhpcshare-agent-53 Ireceived:[53] 128 1000030697.1055744
my rank is 55/100, and node_name: dlhpcshare-agent-53 Ireceived:[54] 128 999949893.7772596
my rank is 56/100, and node_name: dlhpcshare-agent-53 Ireceived:[55] 128 1000048383.3265806
my rank is 57/100, and node_name: dlhpcshare-agent-53 Ireceived:[56] 128 999981985.1777877
my rank is 58/100, and node_name: dlhpcshare-agent-53 Ireceived:[57] 128 1000038613.5794754
my rank is 59/100, and node_name: dlhpcshare-agent-53 Ireceived:[58] 128 1000025518.5350524
my rank is 60/100, and node_name: dlhpcshare-agent-25 Ireceived:[59] 128 999977504.5852437
my rank is 61/100, and node_name: dlhpcshare-agent-25 Ireceived:[60] 128 1000064099.5901086
my rank is 62/100, and node_name: dlhpcshare-agent-25 Ireceived:[61] 128 999958375.7325293
my rank is 63/100, and node_name: dlhpcshare-agent-25 Ireceived:[62] 128 1000032327.017775
my rank is 64/100, and node_name: dlhpcshare-agent-25 Ireceived:[63] 128 999973698.7042892
my rank is 65/100, and node_name: dlhpcshare-agent-25 Ireceived:[64] 128 1000005174.9065778
my rank is 66/100, and node_name: dlhpcshare-agent-25 Ireceived:[65] 128 1000014848.157102
my rank is 67/100, and node_name: dlhpcshare-agent-25 Ireceived:[66] 128 1000007693.8751913
my rank is 68/100, and node_name: dlhpcshare-agent-25 Ireceived:[67] 128 1000007725.2601458
my rank is 69/100, and node_name: dlhpcshare-agent-25 Ireceived:[68] 128 999985807.2807001
my rank is 70/100, and node_name: dlhpcshare-agent-49 Ireceived:[69] 128 999996348.5611703
my rank is 71/100, and node_name: dlhpcshare-agent-49 Ireceived:[70] 128 1000019451.4848017
my rank is 72/100, and node_name: dlhpcshare-agent-49 Ireceived:[71] 128 999930652.0451655
my rank is 73/100, and node_name: dlhpcshare-agent-49 Ireceived:[72] 128 1000056083.7573292
my rank is 74/100, and node_name: dlhpcshare-agent-49 Ireceived:[73] 128 999974813.2241176
my rank is 75/100, and node_name: dlhpcshare-agent-49 Ireceived:[74] 128 1000010925.1539313
my rank is 76/100, and node_name: dlhpcshare-agent-49 Ireceived:[75] 128 1000053204.7815492
my rank is 77/100, and node_name: dlhpcshare-agent-49 Ireceived:[76] 128 1000002623.7976538
my rank is 78/100, and node_name: dlhpcshare-agent-49 Ireceived:[77] 128 999995721.4458407
my rank is 79/100, and node_name: dlhpcshare-agent-49 Ireceived:[78] 128 1000016805.8214403
my rank is 80/100, and node_name: dlhpcshare-agent-28 Ireceived:[79] 128 999993741.0873865
my rank is 81/100, and node_name: dlhpcshare-agent-28 Ireceived:[80] 128 999985355.8043406
my rank is 82/100, and node_name: dlhpcshare-agent-28 Ireceived:[81] 128 1000020033.2797922
my rank is 83/100, and node_name: dlhpcshare-agent-28 Ireceived:[82] 128 1000025818.03303
my rank is 84/100, and node_name: dlhpcshare-agent-28 Ireceived:[83] 128 999994803.3162259
my rank is 85/100, and node_name: dlhpcshare-agent-28 Ireceived:[84] 128 999949729.7594
my rank is 86/100, and node_name: dlhpcshare-agent-28 Ireceived:[85] 128 1000026843.4581102
my rank is 87/100, and node_name: dlhpcshare-agent-28 Ireceived:[86] 128 1000021547.2364391
my rank is 88/100, and node_name: dlhpcshare-agent-28 Ireceived:[87] 128 999972651.1080276
my rank is 89/100, and node_name: dlhpcshare-agent-28 Ireceived:[88] 128 999978283.9195337
my rank is 90/100, and node_name: dlhpcshare-agent-28 Ireceived:[89] 128 1000060266.1070893
my rank is 91/100, and node_name: dlhpcshare-agent-28 Ireceived:[90] 128 1000058939.073361
my rank is 92/100, and node_name: dlhpcshare-agent-28 Ireceived:[91] 128 1000003214.8985409
my rank is 93/100, and node_name: dlhpcshare-agent-28 Ireceived:[92] 128 999972223.4514489
my rank is 94/100, and node_name: dlhpcshare-agent-28 Ireceived:[93] 128 999999667.1805131
my rank is 95/100, and node_name: dlhpcshare-agent-28 Ireceived:[94] 128 999999557.5432729
my rank is 96/100, and node_name: dlhpcshare-agent-28 Ireceived:[95] 128 1000041692.9375429
my rank is 97/100, and node_name: dlhpcshare-agent-28 Ireceived:[96] 128 1000001575.5039369
my rank is 98/100, and node_name: dlhpcshare-agent-28 Ireceived:[97] 128 999997981.0942448
my rank is 99/100, and node_name: dlhpcshare-agent-28 Ireceived:[98] 128 1000024796.3548242

 

可以看到每个进程都可以识别到当前所在主机的128个CPU,这也就更加印证了之前blog中的分析,华为的HPC调度中的容器这个划分其实并没有对CPU进行分割,每个在主机上运行的进程都是可以识别并使用当前主机上的所有CPU,即128个CPU的,而在HPC调度器上申请的数量只是用于计费和监督的,我们完全可以对其进行绕过的。

 

有了这个分析后我们可以对华为官方给出的启动配置进行修改:

 将函数:

ntask=`cat ${CCS_ALLOC_FILE} | awk -v fff="$HOSTFILE" '{}
{
    split($0, a, " ")
    if (length(a[1]) >0 && length(a[3]) >0) {
        print a[1]" slots="a[2] >> fff
        total_task+=a[3]
    }
}END{print total_task}'`

修改为:

ntask=`cat ${CCS_ALLOC_FILE} | awk -v fff="$HOSTFILE" '{}
{
    split($0, a, " ")
    if (length(a[1]) >0 && length(a[3]) >0) {
        print a[1]" slots="a[3] >> fff
        total_task+=a[3]
    }
}END{print total_task}'`

 

 

 

============================================

 

 

修改完成后,进行验证:

代码:

import os
import mpi4py.MPI as MPI
import sys
import numpy as np
import time



def func1(queue, num):
    # time.sleep(num)
    # time.sleep(180)

    x = np.random.rand(100)
    for _ in range(2000000):
        x += np.random.rand(100)
    num += np.sum(x)


    queue.put(num)


def run_queue():
    from multiprocessing import Process, Queue

    ps = 10

    queue = Queue(maxsize=200)  # the following attribute can call in anywhere

    process = [Process(target=func1, args=(queue, num)) for num in range(ps)]
    [p.start() for p in process]
    [p.join() for p in process]
    return sum([queue.get() for p in process])

 
comm = MPI.COMM_WORLD
comm_rank = comm.Get_rank()
comm_size = comm.Get_size()
node_name = MPI.Get_processor_name()
 
# point to point communication
data_send = [comm_rank]*1

comm.send(data_send,dest=(comm_rank+1)%comm_size)

res = run_queue() ###

data_recv =comm.recv(source=(comm_rank-1)%comm_size)

# print("my rank is %d, and Ireceived:" % comm_rank, data_recv, file=sys.stdout, flush=True)
# print(data_recv)

with open("/home/share/xxxxxxxxxx/home/xxxxxx/xxxxxx/results/{}.txt".format(comm_rank, ), "w") as f:
    f.write("my rank is %d/%d, and node_name: %s Ireceived:" % (comm_rank, comm_size, node_name) + str(data_recv)+" "+str(os.cpu_count()) +" "+ str(res) + "\n" )

 

运行的配置log:

----- print env vars -----
   
-rw-rw----+ 1 ccs_agent ccs_agent 4777 Aug 25 10:49 /tmp/.ccscheduler/xxxxxx/mpi/allocFile_container_21305_default_00000_e01_000001
------ cat /tmp/.ccscheduler/xxxxxx/mpi/allocFile_container_21305_default_00000_e01_000001
dlhpcshare-agent-28 40 4 container_21305_default_00000_e01_000001 container_21305_default_00046_e01_000047 container_21305_default_00092_e01_000093 container_21305_default_00033_e01_000034
dlhpcshare-agent-25 40 4 container_21305_default_00086_e01_000087 container_21305_default_00009_e01_000010 container_21305_default_00002_e01_000003 container_21305_default_00070_e01_000071 
dlhpcshare-agent-49 20 2 container_21305_default_00020_e01_000021 container_21305_default_00061_e01_000062 
dlhpcshare-agent-21 10 1 container_21305_default_00051_e01_000052 
dlhpcshare-agent-44 30 3 container_21305_default_00066_e01_000067 container_21305_default_00076_e01_000077 container_21305_default_00099_e01_000100 
dlhpcshare-agent-46 110 11 container_21305_default_00098_e01_000099 container_21305_default_00029_e01_000030 container_21305_default_00055_e01_000056 container_21305_default_00064_e01_000065 container_21305_default_00079_e01_000080 container_21305_default_00072_e01_000073 container_21305_default_00083_e01_000084 container_21305_default_00041_e01_000042 container_21305_default_00022_e01_000023 container_21305_default_00023_e01_000024 container_21305_default_00090_e01_000091 
dlhpcshare-agent-61 10 1 container_21305_default_00097_e01_000098 
dlhpcshare-agent-40 10 1 container_21305_default_00087_e01_000088 
dlhpcshare-agent-42 10 1 container_21305_default_00084_e01_000085 
dlhpcshare-agent-20 30 3 container_21305_default_00013_e01_000014 container_21305_default_00030_e01_000031 container_21305_default_00062_e01_000063 
dlhpcshare-agent-60 80 8 container_21305_default_00089_e01_000090 container_21305_default_00004_e01_000005 container_21305_default_00063_e01_000064 container_21305_default_00065_e01_000066 container_21305_default_00080_e01_000081 container_21305_default_00056_e01_000057 container_21305_default_00060_e01_000061 container_21305_default_00054_e01_000055 
dlhpcshare-agent-8 60 6 container_21305_default_00091_e01_000092 container_21305_default_00024_e01_000025 container_21305_default_00040_e01_000041 container_21305_default_00038_e01_000039 container_21305_default_00047_e01_000048 container_21305_default_00081_e01_000082 
dlhpcshare-agent-6 10 1 container_21305_default_00019_e01_000020 
dlhpcshare-agent-2 10 1 container_21305_default_00015_e01_000016 
dlhpcshare-agent-14 50 5 container_21305_default_00031_e01_000032 container_21305_default_00053_e01_000054 container_21305_default_00093_e01_000094 container_21305_default_00059_e01_000060 container_21305_default_00069_e01_000070 
dlhpcshare-agent-36 50 5 container_21305_default_00006_e01_000007 container_21305_default_00085_e01_000086 container_21305_default_00039_e01_000040 container_21305_default_00045_e01_000046 container_21305_default_00021_e01_000022 
dlhpcshare-agent-59 30 3 container_21305_default_00036_e01_000037 container_21305_default_00073_e01_000074 container_21305_default_00074_e01_000075 
dlhpcshare-agent-37 50 5 container_21305_default_00011_e01_000012 container_21305_default_00018_e01_000019 container_21305_default_00025_e01_000026 container_21305_default_00050_e01_000051 container_21305_default_00005_e01_000006 
dlhpcshare-agent-15 60 6 container_21305_default_00037_e01_000038 container_21305_default_00095_e01_000096 container_21305_default_00096_e01_000097 container_21305_default_00048_e01_000049 container_21305_default_00016_e01_000017 container_21305_default_00003_e01_000004 
dlhpcshare-agent-17 20 2 container_21305_default_00077_e01_000078 container_21305_default_00094_e01_000095 
dlhpcshare-agent-54 50 5 container_21305_default_00068_e01_000069 container_21305_default_00043_e01_000044 container_21305_default_00034_e01_000035 container_21305_default_00026_e01_000027 container_21305_default_00049_e01_000050 
dlhpcshare-agent-12 60 6 container_21305_default_00082_e01_000083 container_21305_default_00057_e01_000058 container_21305_default_00058_e01_000059 container_21305_default_00017_e01_000018 container_21305_default_00071_e01_000072 container_21305_default_00008_e01_000009 
dlhpcshare-agent-34 100 10 container_21305_default_00012_e01_000013 container_21305_default_00007_e01_000008 container_21305_default_00088_e01_000089 container_21305_default_00035_e01_000036 container_21305_default_00075_e01_000076 container_21305_default_00027_e01_000028 container_21305_default_00001_e01_000002 container_21305_default_00042_e01_000043 container_21305_default_00044_e01_000045 container_21305_default_00032_e01_000033 
dlhpcshare-agent-57 10 1 container_21305_default_00028_e01_000029 
dlhpcshare-agent-13 20 2 container_21305_default_00052_e01_000053 container_21305_default_00010_e01_000011 
dlhpcshare-agent-53 30 3 container_21305_default_00067_e01_000068 container_21305_default_00014_e01_000015 container_21305_default_00078_e01_000079 

openmpi hostfile /tmp/hostfile.1610738 generated:
-----------------------
dlhpcshare-agent-28 slots=4
dlhpcshare-agent-25 slots=4
dlhpcshare-agent-49 slots=2
dlhpcshare-agent-21 slots=1
dlhpcshare-agent-44 slots=3
dlhpcshare-agent-46 slots=11
dlhpcshare-agent-61 slots=1
dlhpcshare-agent-40 slots=1
dlhpcshare-agent-42 slots=1
dlhpcshare-agent-20 slots=3
dlhpcshare-agent-60 slots=8
dlhpcshare-agent-8 slots=6
dlhpcshare-agent-6 slots=1
dlhpcshare-agent-2 slots=1
dlhpcshare-agent-14 slots=5
dlhpcshare-agent-36 slots=5
dlhpcshare-agent-59 slots=3
dlhpcshare-agent-37 slots=5
dlhpcshare-agent-15 slots=6
dlhpcshare-agent-17 slots=2
dlhpcshare-agent-54 slots=5
dlhpcshare-agent-12 slots=6
dlhpcshare-agent-34 slots=10
dlhpcshare-agent-57 slots=1
dlhpcshare-agent-13 slots=2
dlhpcshare-agent-53 slots=3
-----------------------
Total tasks is 100
mpirun -hostfile /tmp/hostfile.1610738 -n 100 <your application>

 

 

运行结果:

my rank is 0/100, and node_name: dlhpcshare-agent-28 Ireceived:[99] 128 1000001548.0369567
my rank is 1/100, and node_name: dlhpcshare-agent-28 Ireceived:[0] 128 1000044246.7773606
my rank is 2/100, and node_name: dlhpcshare-agent-28 Ireceived:[1] 128 999962097.0663284
my rank is 3/100, and node_name: dlhpcshare-agent-28 Ireceived:[2] 128 1000016400.6064295
my rank is 4/100, and node_name: dlhpcshare-agent-25 Ireceived:[3] 128 1000008126.1840497
my rank is 5/100, and node_name: dlhpcshare-agent-25 Ireceived:[4] 128 999961675.040424
my rank is 6/100, and node_name: dlhpcshare-agent-25 Ireceived:[5] 128 1000021516.2687825
my rank is 7/100, and node_name: dlhpcshare-agent-25 Ireceived:[6] 128 1000044540.8409615
my rank is 8/100, and node_name: dlhpcshare-agent-49 Ireceived:[7] 128 999952778.3406386
my rank is 9/100, and node_name: dlhpcshare-agent-49 Ireceived:[8] 128 1000005490.1392791
my rank is 10/100, and node_name: dlhpcshare-agent-21 Ireceived:[9] 128 999986083.2240833
my rank is 11/100, and node_name: dlhpcshare-agent-44 Ireceived:[10] 128 999979176.7722001
my rank is 12/100, and node_name: dlhpcshare-agent-44 Ireceived:[11] 128 1000024127.996607
my rank is 13/100, and node_name: dlhpcshare-agent-44 Ireceived:[12] 128 999986125.1458977
my rank is 14/100, and node_name: dlhpcshare-agent-46 Ireceived:[13] 128 999968627.061783
my rank is 15/100, and node_name: dlhpcshare-agent-46 Ireceived:[14] 128 999972671.8341721
my rank is 16/100, and node_name: dlhpcshare-agent-46 Ireceived:[15] 128 999956437.583436
my rank is 17/100, and node_name: dlhpcshare-agent-46 Ireceived:[16] 128 999998544.5230424
my rank is 18/100, and node_name: dlhpcshare-agent-46 Ireceived:[17] 128 1000062838.9180896
my rank is 19/100, and node_name: dlhpcshare-agent-46 Ireceived:[18] 128 999977571.1755748
my rank is 20/100, and node_name: dlhpcshare-agent-46 Ireceived:[19] 128 1000026783.9051983
my rank is 21/100, and node_name: dlhpcshare-agent-46 Ireceived:[20] 128 999944040.7103922
my rank is 22/100, and node_name: dlhpcshare-agent-46 Ireceived:[21] 128 1000032367.5440634
my rank is 23/100, and node_name: dlhpcshare-agent-46 Ireceived:[22] 128 1000000648.6932456
my rank is 24/100, and node_name: dlhpcshare-agent-46 Ireceived:[23] 128 1000000311.0008336
my rank is 25/100, and node_name: dlhpcshare-agent-61 Ireceived:[24] 128 1000081391.8491123
my rank is 26/100, and node_name: dlhpcshare-agent-40 Ireceived:[25] 128 999958520.3115258
my rank is 27/100, and node_name: dlhpcshare-agent-42 Ireceived:[26] 128 1000029731.7126155
my rank is 28/100, and node_name: dlhpcshare-agent-20 Ireceived:[27] 128 999989692.1419857
my rank is 29/100, and node_name: dlhpcshare-agent-20 Ireceived:[28] 128 1000007198.1436479
my rank is 30/100, and node_name: dlhpcshare-agent-20 Ireceived:[29] 128 1000033574.3678787
my rank is 31/100, and node_name: dlhpcshare-agent-60 Ireceived:[30] 128 999947556.4409302
my rank is 32/100, and node_name: dlhpcshare-agent-60 Ireceived:[31] 128 1000064412.7062727
my rank is 33/100, and node_name: dlhpcshare-agent-60 Ireceived:[32] 128 1000007761.7041193
my rank is 34/100, and node_name: dlhpcshare-agent-60 Ireceived:[33] 128 999925453.4237553
my rank is 35/100, and node_name: dlhpcshare-agent-60 Ireceived:[34] 128 1000017841.1207856
my rank is 36/100, and node_name: dlhpcshare-agent-60 Ireceived:[35] 128 999973881.0131118
my rank is 37/100, and node_name: dlhpcshare-agent-60 Ireceived:[36] 128 999970098.9094045
my rank is 38/100, and node_name: dlhpcshare-agent-60 Ireceived:[37] 128 999952826.1036501
my rank is 39/100, and node_name: dlhpcshare-agent-8 Ireceived:[38] 128 1000046255.7144414
my rank is 40/100, and node_name: dlhpcshare-agent-8 Ireceived:[39] 128 1000000229.5421265
my rank is 41/100, and node_name: dlhpcshare-agent-8 Ireceived:[40] 128 1000022290.0252079
my rank is 42/100, and node_name: dlhpcshare-agent-8 Ireceived:[41] 128 1000011328.9696794
my rank is 43/100, and node_name: dlhpcshare-agent-8 Ireceived:[42] 128 999943992.9039626
my rank is 44/100, and node_name: dlhpcshare-agent-8 Ireceived:[43] 128 1000026325.9412429
my rank is 45/100, and node_name: dlhpcshare-agent-6 Ireceived:[44] 128 999996594.9427018
my rank is 46/100, and node_name: dlhpcshare-agent-2 Ireceived:[45] 128 1000041001.7400087
my rank is 47/100, and node_name: dlhpcshare-agent-14 Ireceived:[46] 128 1000019117.6120685
my rank is 48/100, and node_name: dlhpcshare-agent-14 Ireceived:[47] 128 1000004723.1141306
my rank is 49/100, and node_name: dlhpcshare-agent-14 Ireceived:[48] 128 1000000249.2617867
my rank is 50/100, and node_name: dlhpcshare-agent-14 Ireceived:[49] 128 999937020.4400109
my rank is 51/100, and node_name: dlhpcshare-agent-14 Ireceived:[50] 128 999991492.5603293
my rank is 52/100, and node_name: dlhpcshare-agent-36 Ireceived:[51] 128 999966666.7490039
my rank is 53/100, and node_name: dlhpcshare-agent-36 Ireceived:[52] 128 999923990.1804005
my rank is 54/100, and node_name: dlhpcshare-agent-36 Ireceived:[53] 128 999961214.0865831
my rank is 55/100, and node_name: dlhpcshare-agent-36 Ireceived:[54] 128 999962973.8227273
my rank is 56/100, and node_name: dlhpcshare-agent-36 Ireceived:[55] 128 1000033223.1139216
my rank is 57/100, and node_name: dlhpcshare-agent-59 Ireceived:[56] 128 999994139.1621871
my rank is 58/100, and node_name: dlhpcshare-agent-59 Ireceived:[57] 128 999996580.0694407
my rank is 59/100, and node_name: dlhpcshare-agent-59 Ireceived:[58] 128 999932381.9617747
my rank is 60/100, and node_name: dlhpcshare-agent-37 Ireceived:[59] 128 999952736.7008783
my rank is 61/100, and node_name: dlhpcshare-agent-37 Ireceived:[60] 128 999961251.6714846
my rank is 62/100, and node_name: dlhpcshare-agent-37 Ireceived:[61] 128 999999478.042908
my rank is 63/100, and node_name: dlhpcshare-agent-37 Ireceived:[62] 128 999979645.2011644
my rank is 64/100, and node_name: dlhpcshare-agent-37 Ireceived:[63] 128 999990214.933797
my rank is 65/100, and node_name: dlhpcshare-agent-15 Ireceived:[64] 128 999921783.2922219
my rank is 66/100, and node_name: dlhpcshare-agent-15 Ireceived:[65] 128 999930730.5282868
my rank is 67/100, and node_name: dlhpcshare-agent-15 Ireceived:[66] 128 999998188.9918685
my rank is 68/100, and node_name: dlhpcshare-agent-15 Ireceived:[67] 128 999967715.6977485
my rank is 69/100, and node_name: dlhpcshare-agent-15 Ireceived:[68] 128 1000079199.5909462
my rank is 70/100, and node_name: dlhpcshare-agent-15 Ireceived:[69] 128 999994935.3312064
my rank is 71/100, and node_name: dlhpcshare-agent-17 Ireceived:[70] 128 999971604.2596891
my rank is 72/100, and node_name: dlhpcshare-agent-17 Ireceived:[71] 128 1000072224.268106
my rank is 73/100, and node_name: dlhpcshare-agent-54 Ireceived:[72] 128 999990133.4711875
my rank is 74/100, and node_name: dlhpcshare-agent-54 Ireceived:[73] 128 999978027.9042766
my rank is 75/100, and node_name: dlhpcshare-agent-54 Ireceived:[74] 128 999990773.9722083
my rank is 76/100, and node_name: dlhpcshare-agent-54 Ireceived:[75] 128 999985109.0805329
my rank is 77/100, and node_name: dlhpcshare-agent-54 Ireceived:[76] 128 1000040580.0881176
my rank is 78/100, and node_name: dlhpcshare-agent-12 Ireceived:[77] 128 999952277.021426
my rank is 79/100, and node_name: dlhpcshare-agent-12 Ireceived:[78] 128 1000006113.7835764
my rank is 80/100, and node_name: dlhpcshare-agent-12 Ireceived:[79] 128 1000016257.5420581
my rank is 81/100, and node_name: dlhpcshare-agent-12 Ireceived:[80] 128 1000009820.5689441
my rank is 82/100, and node_name: dlhpcshare-agent-12 Ireceived:[81] 128 999948478.4103069
my rank is 83/100, and node_name: dlhpcshare-agent-12 Ireceived:[82] 128 999974697.9367532
my rank is 84/100, and node_name: dlhpcshare-agent-34 Ireceived:[83] 128 999979055.5054404
my rank is 85/100, and node_name: dlhpcshare-agent-34 Ireceived:[84] 128 1000041499.0674162
my rank is 86/100, and node_name: dlhpcshare-agent-34 Ireceived:[85] 128 1000032224.6438636
my rank is 87/100, and node_name: dlhpcshare-agent-34 Ireceived:[86] 128 999951417.6075771
my rank is 88/100, and node_name: dlhpcshare-agent-34 Ireceived:[87] 128 1000032051.0978342
my rank is 89/100, and node_name: dlhpcshare-agent-34 Ireceived:[88] 128 1000000387.5620804
my rank is 90/100, and node_name: dlhpcshare-agent-34 Ireceived:[89] 128 999916308.7998385
my rank is 91/100, and node_name: dlhpcshare-agent-34 Ireceived:[90] 128 999932974.9200188
my rank is 92/100, and node_name: dlhpcshare-agent-34 Ireceived:[91] 128 1000075592.2132264
my rank is 93/100, and node_name: dlhpcshare-agent-34 Ireceived:[92] 128 1000026815.8564372
my rank is 94/100, and node_name: dlhpcshare-agent-57 Ireceived:[93] 128 1000013768.6777565
my rank is 95/100, and node_name: dlhpcshare-agent-13 Ireceived:[94] 128 1000045400.8612764
my rank is 96/100, and node_name: dlhpcshare-agent-13 Ireceived:[95] 128 999995092.0445201
my rank is 97/100, and node_name: dlhpcshare-agent-53 Ireceived:[96] 128 999993829.6496035
my rank is 98/100, and node_name: dlhpcshare-agent-53 Ireceived:[97] 128 1000046106.0476944
my rank is 99/100, and node_name: dlhpcshare-agent-53 Ireceived:[98] 128 999911786.4949732

 

可以看到,修改后的HPC上MPI代码可以成功利用所有的调度器分配的主机与CPU核心。

 

 

 

PS:

需要注意,本文的修改主要目的是解决官方的配置文件的错误问题,也就是解决单job中的每个task不能使用多个CPU核心的问题,当然传统的MPI设计中单task往往都是使用单个CPU的,但是由于现在的计算Job往往很复杂,如AI方向的,我们已经难以满足单Job下的各子task只使用单个CPU的设计,使用本文给出的更正后的配置可以实现单task下使用多CPU核心的需求,然后我们可以在单task中通过使用from multiprocessing import Process, Queue的方式来实现多个CPU核心的利用,以此实现更高性能的需求。

 

 

============================================

 

精彩评论(0)

0 0 举报