Nvidia

Useful resources:

https://docs.kinetica.com/7.1/install/nvidia_rhel/
https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/
Good Nvidia driver installation guide: https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Linux.pdf

How to install NVIDIA driver on RHEL8 for Specific GPU model (using .run file)

The following steps uses NVIDIA's .run file. For steps on how to install nvidia-driver via dnf/command line, please refer to this link

Another helpful link: https://www.if-not-true-then-false.com/2021/install-nvidia-drivers-on-centos-rhel-rocky-linux/

Notes: • The following steps have been tested on Dell Precision 3480 (RTX A500 Laptop GPU) and Dell Precision 7670 (RTX A2000 8GB Laptop GPU).

Steps: 1. First connect to the system via SSH/PuTTy and check the NVIDIA graphics card model:

$ lspci | grep -i nvidia

In this example output, the NVIDIA graphics card is RTX A2000 8GB Laptop GPU.

2. Search the driver from Nvidia's website and download it locally

You can use wget to download the .run file directly from Nvidia’s website to the system. Tip: To get the correct URL link, click “Download” and then right-click on “Agree & Download”.

$ wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run

3. Install prerequisites:

$ dzdo yum install gcc kernel-devel libglvnd-devel elfutils-libelf-devel

4. Then perform the following steps to disable the nouveau driver: a. Edit the /etc/default/grub file and ensure that the options rd.driver.blacklist=nouveau nouveau.modeset=0 is at the end of the GRUB_CMD_LINUX line:

$ dzdo vi /etc/default/grub

Example:

GRUB_CMDLINE_LINUX="resume=/dev/mapper/cmw--rhel-swap rd.lvm.lv=cmw-rhel/root rd.lvm.lv=cmw-rhel/swap rhgb quiet audit=1 audit_backlog_limit=8192 pti=on page_poison=1 slub_debug=P fips=1 boot=UUID=1125ad64-b4b3-4995-928c-8f8a1fa2c48b rd.driver.blacklist=nouveau nouveau.modeset=0"

b. Save the file and exit. c. Next, rebuild the GRUB configuration file:

$ dzdo grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

d. Create the disable-nouveau.conf file under the /etc/modprobe.d/ directory:

$ dzdo vi /etc/modprobe.d/disable-nouveau.conf

And then insert these separate lines:

blacklist nouveau
options nouveau modeset=0

Example:

$ dzdo cat /etc/modprobe.d/disable-nouveau.conf
blacklist nouveau
options nouveau modeset=0

e. Modify the permissions and then regenerate the initramfs file by running dracut. Then reboot.

$ dzdo chmod 644 /etc/modprobe.d/disable-nouveau.conf
$ dzdo dracut -f
$ dzdo reboot

5. Log back in via SSH and then run the following commands:

$ dzdo init 3
$ chmod +x NVIDIA-Linux-x86_64-<version.number>.run
$ dzdo mount -o remount,exec /tmp
$ dzdo yum remove nvidia-driver

6. Then run the installer and answer the prompts:

$ dzdo ./NVIDIA-Linux-x86_64-<version.number>.run

Wait for graphical installer then Answer 5 prompts/ 4 questions:

Install 32-bit compatibility Libraries – Yes
Register with DKMS – No
Initramfs rebuild – Yes
Update X Configuration File – Yes
Installation Complete – Click OK

7. Then confirm the latest driver version was installed successfully (check Driver Version):

$ nvidia-smi

8. Reboot the system and confirm that its booting up and working as expected. Once logged back in, re-run nvidia-smi to confirm.

$ dzdo reboot
$ nvidia-smi (or) $ nvidia-smi -q | grep -i “driver version”

9. Clean up: remove the .run installer once completed successfully.

$ dzdo rm NVIDIA-Linux-x86_64-<version.number>.run

How to uninstall the NVIDIA driver after installing it using the .run file

sudo ./NVIDIA-Linux-x86_64-<version.number>.run --uninstall

How to check Nvidia driver version and other information

$ nvidia-smi
Tue Jun  4 13:09:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000 8GB Lap...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0             25W /   60W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Alternatively, you can run the following command which will give you the same information (or more) in a non-table format:

$ nvidia-smi -q
==============NVSMI LOG==============

Timestamp                                 : Thu Jun 27 14:55:53 2024
Driver Version                            : 555.42.02
CUDA Version                              : 12.5

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Product Name                          : NVIDIA RTX A500 Laptop GPU
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-fabe1493-c99a-410b-5941-abeb61f58c80
    Minor Number                          : 0
    VBIOS Version                         : 94.07.7C.00.0B

or

$ nvidia-smi -q | grep -i driver
Driver Version                            : 555.42.02
    Driver Model

NVIDIA DOCUMENTATION ON HOW TO DISABLE NOUVEAU

https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/nouveau.html https://docs.nvidia.com/ai-enterprise/deployment-guide-bare-metal/0.1.0/nouveau.html

GSP Firmware

GSP firmware being enabled has been causing issues with some Dell models that contain specific GPU hardware.

What is GSP?

Some GPUs include a GPU System Processor (GSP) which can be used to offload GPU initialization and management tasks. This processor is driven by firmware files distributed with the driver. The GSP firmware is used by default on GPUs which support it.

Offloading tasks which were traditionally performed by the driver on the CPU can improve performance due to lower latency access to GPU hardware internals.

Firmware files gsp_*.bin are installed in /lib/firmware/nvidia/560.28.03/. Each GSP firmware file is named after a GPU architecture (for example, gsp_tu10x.bin is named after Turing) and supports GPUs from one or more architectures.

Disabling GSP Mode

The driver can be forced to disable use of GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=0.

The nvidia-smi utility can be used to query the current use of GSP firmware. It will display a valid version if GSP firmware is enabled, or “N/A” if disabled

$  nvidia-smi -q | grep -iE 'driver version|gsp'
Driver Version                            : 555.42.06
    GSP Firmware Version                  : N/A

Enabling GSP Mode

The GSP firmware will be used by default for all Turing and later GPUs. The driver can be explicitly configured to use the GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=1.

https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/gsp.html

Issues with the 560 Driver Not Able to Disable GSP

Nvidia then released the 560 driver, which also comes with the GSP enabled. However, starting with the 560 driver, Nvidia decided to make the open kernel module as the default installation instead of the proprietary version. From my research, I found that we were not able to disable the GSP firmware when using the open kernel module version of the driver, because it ignores the NVreg_EnableGpuFirmware=0 kernel parameter.

To resolve this, you must uninstall the open kernel module and reinstall the proprietary version in order for the kernel parameter to work and successfully disable the GSP firmware. To install the proprietary version, use the command sudo dnf module install -y nvidia-driver:latest-dkms so that the GSP firmware gets disabled during the image process.

Here are some links as a reference:

The updated nvidia_driver.sh script (See lines 30-60 for the fix): http://va3cngl01a.cacicorenet.com/linux_image/rhel8/-/blob/main/Dev_wk_files/nvidia_driver.sh
From this link: https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/kernel_open.html it says "Because the two flavors of kernel modules are mutually exclusive, one or the other must be chosen at install time. By default, installation will choose which flavor of kernel modules to install, based on the GPUs detected in the system. If a pre-Turing GPU is detected, installation will default to the proprietary flavor of kernel modules. Otherwise, installation will default to the open flavor of kernel modules."
Per https://www.reddit.com/r/linux_gaming/comments/1cp4heq/news_starting_with_nvidia_560_the_open_source/ it says "Starting in the release 560 series, it will be recommended to use the open flavor of NVIDIA Linux Kernel Modules wherever possible (Turing or later GPUs, or Ada or later when using GPU virtualization).
- If installing from the .run file, installation will detect what GPUs are present and default to installing the open kernel modules if all NVIDIA GPUs in the system can be driven by the open kernel modules. Distribution-specific repackaging of the NVIDIA driver may require additional steps, specific to that packaging, to choose the open flavor.
- In the release 560 series, it will still be possible to configure the .run file to install the proprietary flavor of kernel modules, with the --kernel-module-type=proprietary command line option. However, in the future, some GPUs may only be supported with the open flavor."