- 创建endpoints
apiVersion: v1 kind: Endpoints metadata: name: nvidia-gpu-exporter namespace: monitoring subsets: - addresses: - ip: 10.1.12.17 ports: - name: http port: 9835 protocol: TCP
上面的ip为GPU服务器地址,如果是多台GPU,可在下面继续添加,如 - ip: *.*.*.* - ip: *.*.*.*
endpoints/nvidia-gpu-exporter created
NAME ENDPOINTS AGE nvidia-gpu-exporter 10.1.12.17:9835 39s
Name: nvidia-gpu-exporter Namespace: monitoring Labels: <none> Annotations: <none> Subsets: Addresses: 10.1.12.17 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- http 9835 TCP
Events: <none>
- 创建service
apiVersion: v1 kind: Service metadata: labels: app: nvidia-gpu-exporter name: nvidia-gpu-exporter namespace: monitoring spec: ports: - name: http protocol: TCP port: 9835 targetPort: http type: ClusterIP
service "nvidia-gpu-exporter" deleted kubectl create -f gpu-exporter-svc.yaml service/nvidia-gpu-exporter created
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nvidia-gpu-exporter ClusterIP 10.10.75.226 <none> 9835/TCP 12s
Name: nvidia-gpu-exporter Namespace: monitoring Labels: app=nvidia-gpu-exporter Annotations: <none> Selector: <none> Type: ClusterIP IP: 10.10.235.70 Port: http 9835/TCP TargetPort: http/TCP Endpoints: 10.1.12.17:9835 Session Affinity: None Events: <none>
上面的endpioins一定要为上面创建的endpoints中的IP和port
- 创建servicemonitor
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: nvidia-gpu-exporter name: nvidia-gpu-exporter namespace: monitoring spec: endpoints: - interval: 30s port: http jobLabel: app selector: matchLabels: app: nvidia-gpu-exporter kubectl create -f gpu-exporter-serviceMonitor.yaml servicemonitor.monitoring.coreos.com/nvidia-gpu-exporter created [root@k8s-master dongtai] NAME AGE nvidia-gpu-exporter 12s
Name: nvidia-gpu-exporter Namespace: monitoring Labels: app=nvidia-gpu-exporter Annotations: <none> API Version: monitoring.coreos.com/v1 Kind: ServiceMonitor Metadata: Creation Timestamp: 2022-05-13T09:50:35Z Generation: 1 Managed Fields: API Version: monitoring.coreos.com/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: .: f:app: f:spec: .: f:endpoints: f:jobLabel: f:selector: .: f:matchLabels: .: f:app: Manager: kubectl-create Operation: Update Time: 2022-05-13T09:50:35Z Resource Version: 14080381 Self Link: /apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors/nvidia-gpu-exporter UID: 7fdb365b-8bcd-4fc2-9772-9ad7de6155bf Spec: Endpoints: Interval: 30s Port: http Job Label: app Selector: Match Labels: App: nvidia-gpu-exporter Events: <none>
- prometheus页面验证
在prometheus页面的targets中查看nvidia_gpu_exporter
在Graph页面中进行nvidia搜索
通过搜索可以得到这台GPU服务器有两张3090GPU |