又重新配置一台旧GPU服务器

背景

主要是记录一下工具安装和配置的过程。

主机配置：

24 核 E5-2678 v3 @ 2.50GHz
92G 内存
2张 12G 显存的 1080Ti 显卡

配置过程

系统升级

cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

yum clean all
yum update

reboot

cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)

清理源

yum clean all
yum makecache
yum update

安装依赖

yum group install "Development Tools"
yum install gcc gcc-c++ -y

ntp时间同步

#没有 ntpdate 命令, 就安装
yum -y install ntpdate
#同步和查看
ntpdate ntp1.aliyun.com
date

#添加定时
crontab -e

*/10 * * * * /usr/sbin/ntpdate ntp1.aliyun.com >/dev/null &

查看显卡情况

yum install -y lshw
lshw -numeric -C display

#添加新elrepo源
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

yum install -y nvidia-detect
nvidia-detect
lspci -nnk | grep -i nvi

安装显卡驱动

#查看系统发行版号
uname -r
3.10.0-327.el7.x86_64
#安装 kernel source 命令
# https://buildlogs.centos.org/c7.1511.00/kernel/20151119220809/3.10.0-327.el7.x86_64/
# 下载对应的 kernel-headers kernel-devel
yum install kernel-headers-3.10.0-327.el7.x86_64 -y
yum install kernel-devel-3.10.0-327.el7.x86_64 -y
#到官网找到对应的链接
#https://www.nvidia.com/Download/index.aspx?lang=en-us
#下载显卡驱动
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/470.74/NVIDIA-Linux-x86_64-470.74.run
# 赋予执行权限
chmod a+x NVIDIA-Linux-x86_64-470.74.run
# 安装驱动
./NVIDIA-Linux-x86_64-470.74.run -no-x-check -no-nouveau-check -no-opengl-files
# 安装过程中
# 遇到 Install NVIDIA's 32-bit compatibility libraries 时选择 No
# 遇到Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.时，选择Yes

安装CUDA

#到官网找到下载链接
#https://developer.nvidia.com/cuda-toolkit-archive
wget https://developer.download.nvidia.com/compute/cuda/11.4.2/local_installers/cuda_11.4.2_470.57.02_linux.run
sh cuda_11.4.2_470.57.02_linux.run
#因为上面已经安装过驱动了，安装选项的时候，不要选上驱动安装

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-11.4/
Samples:  Installed in /home/liuzexu/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-11.4/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.4/lib64, or, add /usr/local/cuda-11.4/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.4/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 470.00 is required for CUDA 11.4 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log

vim ~/.bashrc

#添加下面内容
export PATH=/usr/local/cuda-11.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH
#然后刷新一下
source ~/.bashrc

vim /etc/profile
export PATH=/usr/local/cuda-11.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64:$LD_LIBRARY_PATH


cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  470.74  Mon Sep 13 23:09:15 UTC 2021
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Aug_15_21:14:11_PDT_2021
Cuda compilation tools, release 11.4, V11.4.120
Build cuda_11.4.r11.4/compiler.30300941_0

至此CUDA安装完成

执行cuda sample前需要 upgrade gcc

yum install centos-release-scl -y 
yum install devtoolset-7 -y 
scl enable devtoolset-7 bash

#用 CUDA 的 smaple tool 验证是否正常
cd /home/liuzexu/NVIDIA_CUDA-11.4_Samples/1_Utilities/deviceQuery
make
./deviceQuery

# 执行后的信息如下
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "NVIDIA GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
  GPU Max Clock rate:                            1620 MHz (1.62 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
  GPU Max Clock rate:                            1620 MHz (1.62 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA GeForce GTX 1080 Ti (GPU0) -> NVIDIA GeForce GTX 1080 Ti (GPU1) : Yes
> Peer access from NVIDIA GeForce GTX 1080 Ti (GPU1) -> NVIDIA GeForce GTX 1080 Ti (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.4, CUDA Runtime Version = 11.4, NumDevs = 2
Result = PASS

看到 Detected 2 CUDA Capable device(s) 这行，说明检测到2张显卡了，正常
看到 PASS 说明一切正常

安装cuDNN

#到官网下载，需要注册和登录才能下载
#https://developer.nvidia.com/rdp/cudnn-download
#这里下载比较麻烦，下载需要验证用户token，没有token会报403 Forbidden
#处理方法用桌面浏览器操作下载页面，点击下载后，到浏览器的下载内容，复制下载链接，这个链接会带有用户的token信息
wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/8.2.4/11.4_20210831/cudnn-11.4-linux-x64-v8.2.4.15.tgz
tar zxvf cudnn-11.4-linux-x64-v8.2.4.15.tgz -C ./

cp cuda/include/cudnn.h /usr/local/cuda-11.4/include/
cp cuda/lib64/libcudnn* /usr/local/cuda-11.4/lib64/

chmod a+r /usr/local/cuda-11.4/include/cudnn.h
chmod a+r /usr/local/cuda-11.4/lib64/libcudnn*

# 一样在同一个页面找到并下载，这步我下载了，并没有执行安装
#libcudnn8-8.2.2.26-1.cuda11.4.x86_64.rpm
#libcudnn8-devel-8.2.2.26-1.cuda11.4.x86_64.rpm
#libcudnn8-samples-8.2.2.26-1.cuda11.4.x86_64.rpm
#如果文件名过长无法下载，使用 wget -O 指定文件名

安装Anaconda

# 从官网下载，这里选择的版本是 Anaconda3-2021.05-Linux-x86_64.sh
# https://repo.anaconda.com/archive/
wget //repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
bash Anaconda3-2021.05-Linux-x86_64.sh

# 会阅读很长的条款，一直 Eneter 到底就行了，最后需要输入yes
# 然后指定安装的绝对路径 /usr/local/anaconda3 代替 /root/anaconda3
# 最后一步是 conda init 询问是否加入环境变量，写 yes 即可

vim .bashrc
export PATH=/usr/local/anaconda3/bin:$PATH
source .bashrc
vim /etc/profit
export PATH=/usr/local/anaconda3/bin:$PATH

安装完成后，关闭终端重新开一个，输入下面命令测试

conda --version
conda 4.10.1

收工~ 👊

本文由 Chakhsu Lau 创作，采用知识共享署名4.0 国际许可协议进行许可。
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名。