Nvidia
Useful resources:
- https://docs.kinetica.com/7.1/install/nvidia_rhel/
- https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/
- Good Nvidia driver installation guide: https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Linux.pdf
What is Nvidia-smi?
The nvidia-smi (NVIDIA System Management Interface) command in Linux is a utility provided by NVIDIA to monitor and manage GPU devices. It is part of the NVIDIA driver package and provides detailed information about the status of the NVIDIA GPUs installed on your system.
When you have Nvidia drivers installed, the command nvidia-smi outputs a neat table giving you information about your GPU, CUDA, and driver setup.
How to install NVIDIA driver on RHEL8 for Specific GPU model (using .run file)
The following steps uses NVIDIA's .run file. For steps on how to install nvidia-driver via dnf/command line, please refer to this link
Another helpful link: https://www.if-not-true-then-false.com/2021/install-nvidia-drivers-on-centos-rhel-rocky-linux/
Notes:
• The following steps have been tested on Dell Precision 3480 (RTX A500 Laptop GPU) and Dell Precision 7670 (RTX A2000 8GB Laptop GPU).
Steps: 1. First connect to the system via SSH/PuTTy and check the NVIDIA graphics card model:
$ lspci | grep -i nvidia
In this example output, the NVIDIA graphics card is RTX A2000 8GB Laptop GPU.
2. Search the driver from Nvidia's website and download it locally
You can use wget to download the .run file directly from Nvidia’s website to the system. Tip: To get the correct URL link, click “Download” and then right-click on “Agree & Download”.
$ wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run
3. Install prerequisites:
$ dzdo yum install gcc kernel-devel libglvnd-devel elfutils-libelf-devel
4. Then perform the following steps to disable the nouveau driver: a. Edit the /etc/default/grub file and ensure that the options rd.driver.blacklist=nouveau nouveau.modeset=0 is at the end of the GRUB_CMD_LINUX line:
$ dzdo vi /etc/default/grub
Example:
GRUB_CMDLINE_LINUX="resume=/dev/mapper/cmw--rhel-swap rd.lvm.lv=cmw-rhel/root rd.lvm.lv=cmw-rhel/swap rhgb quiet audit=1 audit_backlog_limit=8192 pti=on page_poison=1 slub_debug=P fips=1 boot=UUID=1125ad64-b4b3-4995-928c-8f8a1fa2c48b rd.driver.blacklist=nouveau nouveau.modeset=0"
b. Save the file and exit. c. Next, rebuild the GRUB configuration file:
$ dzdo grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
d. Create the disable-nouveau.conf file under the /etc/modprobe.d/ directory:
$ dzdo vi /etc/modprobe.d/disable-nouveau.conf
And then insert these separate lines:
blacklist nouveau options nouveau modeset=0
Example:
$ dzdo cat /etc/modprobe.d/disable-nouveau.conf blacklist nouveau options nouveau modeset=0
e. Modify the permissions and then regenerate the initramfs file by running dracut. Then reboot.
$ dzdo chmod 644 /etc/modprobe.d/disable-nouveau.conf $ dzdo dracut -f $ dzdo reboot
5. Log back in via SSH and then run the following commands:
$ dzdo init 3 $ chmod +x NVIDIA-Linux-x86_64-<version.number>.run $ dzdo mount -o remount,exec /tmp $ dzdo yum remove nvidia-driver
6. Then run the installer and answer the prompts:
$ dzdo ./NVIDIA-Linux-x86_64-<version.number>.run
Wait for graphical installer then Answer 5 prompts/ 4 questions:
- Install 32-bit compatibility Libraries – Yes
- Register with DKMS – No
- Initramfs rebuild – Yes
- Update X Configuration File – Yes
- Installation Complete – Click OK
7. Then confirm the latest driver version was installed successfully (check Driver Version):
$ nvidia-smi
8. Reboot the system and confirm that its booting up and working as expected. Once logged back in, re-run nvidia-smi to confirm.
$ dzdo reboot $ nvidia-smi (or) $ nvidia-smi -q | grep -i “driver version”
9. Clean up: remove the .run installer once completed successfully.
$ dzdo rm NVIDIA-Linux-x86_64-<version.number>.run
How to uninstall the NVIDIA driver after installing it using the .run file
sudo ./NVIDIA-Linux-x86_64-<version.number>.run --uninstall
How to check Nvidia driver version and other information
$ nvidia-smi Tue Jun 4 13:09:18 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A2000 8GB Lap... Off | 00000000:01:00.0 Off | N/A | | N/A 49C P0 25W / 60W | 1MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
Alternatively, you can run the following command which will give you the same information (or more) in a non-table format:
$ nvidia-smi -q ==============NVSMI LOG============== Timestamp : Thu Jun 27 14:55:53 2024 Driver Version : 555.42.02 CUDA Version : 12.5 Attached GPUs : 1 GPU 00000000:02:00.0 Product Name : NVIDIA RTX A500 Laptop GPU Product Brand : NVIDIA RTX Product Architecture : Ampere Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Addressing Mode : None MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-fabe1493-c99a-410b-5941-abeb61f58c80 Minor Number : 0 VBIOS Version : 94.07.7C.00.0B
or
$ nvidia-smi -q | grep -i driver Driver Version : 555.42.02 Driver Model
NVIDIA DOCUMENTATION ON HOW TO DISABLE NOUVEAU
https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/nouveau.html https://docs.nvidia.com/ai-enterprise/deployment-guide-bare-metal/0.1.0/nouveau.html
GSP Firmware
GSP firmware being enabled has been causing issues with some Dell models that contain specific GPU hardware.
What is GSP?
Some GPUs include a GPU System Processor (GSP) which can be used to offload GPU initialization and management tasks. This processor is driven by firmware files distributed with the driver. The GSP firmware is used by default on GPUs which support it.
Offloading tasks which were traditionally performed by the driver on the CPU can improve performance due to lower latency access to GPU hardware internals.
Firmware files gsp_*.bin are installed in /lib/firmware/nvidia/560.28.03/. Each GSP firmware file is named after a GPU architecture (for example, gsp_tu10x.bin is named after Turing) and supports GPUs from one or more architectures.
Disabling GSP Mode
The driver can be forced to disable use of GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=0.
The nvidia-smi utility can be used to query the current use of GSP firmware. It will display a valid version if GSP firmware is enabled, or “N/A” if disabled
$ nvidia-smi -q | grep -iE 'driver version|gsp' Driver Version : 555.42.06 GSP Firmware Version : N/A
Enabling GSP Mode
The GSP firmware will be used by default for all Turing and later GPUs. The driver can be explicitly configured to use the GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=1.
https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/gsp.html
Issues with the 560 Driver Not Able to Disable GSP
Nvidia then released the 560 driver, which also comes with the GSP enabled. However, starting with the 560 driver, Nvidia decided to make the open kernel module as the default installation instead of the proprietary version. From my research, I found that we were not able to disable the GSP firmware when using the open kernel module version of the driver, because it ignores the NVreg_EnableGpuFirmware=0 kernel parameter.
How to check if Nvidia is Open-source or the proprietary version
$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 560.28.03 Release Build (dvs-builder@U16-A24-27-4) Thu Jul 18 20:46:24 UTC 2024 GCC version: gcc version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC)
$ rpm -qi kmod-nvidia-open-dkms-560.28.03-1.el8.x86_64 Name : kmod-nvidia-open-dkms Epoch : 3 Version : 560.28.03 Release : 1.el8 Architecture: x86_64 Install Date: Mon 12 Aug 2024 10:51:53 AM EDT Group : Unspecified Size : 23911228 License : NVIDIA License Signature : RSA/SHA512, Thu 18 Jul 2024 11:27:31 PM EDT, Key ID 9cd0a493d42d0685 Source RPM : kmod-nvidia-open-dkms-560.28.03-1.el8.src.rpm Build Date : Thu 18 Jul 2024 11:25:41 PM EDT Build Host : cf605bc53c42 Relocations : (not relocatable) URL : http://www.nvidia.com/object/unix.html Summary : NVIDIA driver open kernel module flavor Description : This package provides the open-source Nvidia kernel driver modules. The modules are rebuilt through the DKMS system when a new kernel or modules become available.
To resolve this, you must uninstall the open kernel module and reinstall the proprietary version in order for the kernel parameter to work and successfully disable the GSP firmware. To install the proprietary version, use the command sudo dnf module install -y nvidia-driver:latest-dkms so that the GSP firmware gets disabled during the image process.
Here are some links as a reference:
- The updated nvidia_driver.sh script (See lines 30-60 for the fix): http://va3cngl01a.cacicorenet.com/linux_image/rhel8/-/blob/main/Dev_wk_files/nvidia_driver.sh
- From this link: https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/kernel_open.html it says "Because the two flavors of kernel modules are mutually exclusive, one or the other must be chosen at install time. By default, installation will choose which flavor of kernel modules to install, based on the GPUs detected in the system. If a pre-Turing GPU is detected, installation will default to the proprietary flavor of kernel modules. Otherwise, installation will default to the open flavor of kernel modules."
- Per https://www.reddit.com/r/linux_gaming/comments/1cp4heq/news_starting_with_nvidia_560_the_open_source/ it says "Starting in the release 560 series, it will be recommended to use the open flavor of NVIDIA Linux Kernel Modules wherever possible (Turing or later GPUs, or Ada or later when using GPU virtualization).
- If installing from the .run file, installation will detect what GPUs are present and default to installing the open kernel modules if all NVIDIA GPUs in the system can be driven by the open kernel modules. Distribution-specific repackaging of the NVIDIA driver may require additional steps, specific to that packaging, to choose the open flavor.
- In the release 560 series, it will still be possible to configure the .run file to install the proprietary flavor of kernel modules, with the --kernel-module-type=proprietary command line option. However, in the future, some GPUs may only be supported with the open flavor."
How to List Available Nvidia Module Streams
$ sudo dnf module list nvidia-driver . . . Name Stream Profiles Summary nvidia-driver latest default [d], fm, ks, src Nvidia driver for latest branch nvidia-driver latest-dkms default [d], fm, ks Nvidia driver for latest-dkms branch nvidia-driver open-dkms [d] default [d], fm, ks, src Nvidia driver for open-dkms branch nvidia-driver 515 default [d], fm, ks, src Nvidia driver for 515 branch nvidia-driver 515-dkms default [d], fm, ks Nvidia driver for 515-dkms branch nvidia-driver 515-open default [d], fm, ks, src Nvidia driver for 515-open branch nvidia-driver 520 default [d], fm, ks, src Nvidia driver for 520 branch nvidia-driver 520-dkms default [d], fm, ks Nvidia driver for 520-dkms branch nvidia-driver 520-open default [d], fm, ks, src Nvidia driver for 520-open branch nvidia-driver 525 default [d], fm, ks, src Nvidia driver for 525 branch nvidia-driver 525-dkms default [d], fm, ks Nvidia driver for 525-dkms branch nvidia-driver 525-open default [d], fm, ks, src Nvidia driver for 525-open branch nvidia-driver 530 default [d], fm, ks, src Nvidia driver for 530 branch nvidia-driver 530-dkms default [d], fm, ks Nvidia driver for 530-dkms branch nvidia-driver 530-open default [d], fm, ks, src Nvidia driver for 530-open branch nvidia-driver 535 default [d], fm, ks, src Nvidia driver for 535 branch nvidia-driver 535-dkms default [d], fm, ks Nvidia driver for 535-dkms branch nvidia-driver 535-open default [d], fm, ks, src Nvidia driver for 535-open branch nvidia-driver 545 default [d], fm, ks, src Nvidia driver for 545 branch nvidia-driver 545-dkms default [d], fm, ks Nvidia driver for 545-dkms branch nvidia-driver 545-open default [d], fm, ks, src Nvidia driver for 545-open branch nvidia-driver 550 default [d], fm, ks, src Nvidia driver for 550 branch nvidia-driver 550-dkms default [d], fm, ks Nvidia driver for 550-dkms branch nvidia-driver 550-open default [d], fm, ks, src Nvidia driver for 550-open branch nvidia-driver 555 default [d], fm, ks, src Nvidia driver for 555 branch nvidia-driver 555-dkms default [d], fm, ks Nvidia driver for 555-dkms branch nvidia-driver 555-open default [d], fm, ks, src Nvidia driver for 555-open branch nvidia-driver 560 default [d], fm, ks, src Nvidia driver for 560 branch nvidia-driver 560-dkms default [d], fm, ks Nvidia driver for 560-dkms branch nvidia-driver 560-open default [d], fm, ks, src Nvidia driver for 560-open branch