Nvidia

From DikapediaV2
Jump to: navigation, search

Useful resources:


What is Nvidia-smi?


The nvidia-smi (NVIDIA System Management Interface) command in Linux is a utility provided by NVIDIA to monitor and manage GPU devices. It is part of the NVIDIA driver package and provides detailed information about the status of the NVIDIA GPUs installed on your system.

When you have Nvidia drivers installed, the command nvidia-smi outputs a neat table giving you information about your GPU, CUDA, and driver setup.


How to install NVIDIA driver on RHEL8 for Specific GPU model (using .run file)


The following steps uses NVIDIA's .run file. For steps on how to install nvidia-driver via dnf/command line, please refer to this link

Another helpful link: https://www.if-not-true-then-false.com/2021/install-nvidia-drivers-on-centos-rhel-rocky-linux/


Notes: • The following steps have been tested on Dell Precision 3480 (RTX A500 Laptop GPU) and Dell Precision 7670 (RTX A2000 8GB Laptop GPU).

Steps: 1. First connect to the system via SSH/PuTTy and check the NVIDIA graphics card model:

$ lspci | grep -i nvidia

In this example output, the NVIDIA graphics card is RTX A2000 8GB Laptop GPU.

2. Search the driver from Nvidia's website and download it locally

You can use wget to download the .run file directly from Nvidia’s website to the system. Tip: To get the correct URL link, click “Download” and then right-click on “Agree & Download”.

$ wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run

3. Install prerequisites:

$ dzdo yum install gcc kernel-devel libglvnd-devel elfutils-libelf-devel

4. Then perform the following steps to disable the nouveau driver: a. Edit the /etc/default/grub file and ensure that the options rd.driver.blacklist=nouveau nouveau.modeset=0 is at the end of the GRUB_CMD_LINUX line:

$ dzdo vi /etc/default/grub

Example:

GRUB_CMDLINE_LINUX="resume=/dev/mapper/cmw--rhel-swap rd.lvm.lv=cmw-rhel/root rd.lvm.lv=cmw-rhel/swap rhgb quiet audit=1 audit_backlog_limit=8192 pti=on page_poison=1 slub_debug=P fips=1 boot=UUID=1125ad64-b4b3-4995-928c-8f8a1fa2c48b rd.driver.blacklist=nouveau nouveau.modeset=0"

b. Save the file and exit. c. Next, rebuild the GRUB configuration file:

$ dzdo grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

d. Create the disable-nouveau.conf file under the /etc/modprobe.d/ directory:

$ dzdo vi /etc/modprobe.d/disable-nouveau.conf

And then insert these separate lines:

blacklist nouveau
options nouveau modeset=0

Example:

$ dzdo cat /etc/modprobe.d/disable-nouveau.conf
blacklist nouveau
options nouveau modeset=0


e. Modify the permissions and then regenerate the initramfs file by running dracut. Then reboot.

$ dzdo chmod 644 /etc/modprobe.d/disable-nouveau.conf
$ dzdo dracut -f
$ dzdo reboot

5. Log back in via SSH and then run the following commands:

$ dzdo init 3
$ chmod +x NVIDIA-Linux-x86_64-<version.number>.run
$ dzdo mount -o remount,exec /tmp
$ dzdo yum remove nvidia-driver

6. Then run the installer and answer the prompts:

$ dzdo ./NVIDIA-Linux-x86_64-<version.number>.run

Wait for graphical installer then Answer 5 prompts/ 4 questions:

  • Install 32-bit compatibility Libraries – Yes
  • Register with DKMS – No
  • Initramfs rebuild – Yes
  • Update X Configuration File – Yes
  • Installation Complete – Click OK

7. Then confirm the latest driver version was installed successfully (check Driver Version):

$ nvidia-smi

8. Reboot the system and confirm that its booting up and working as expected. Once logged back in, re-run nvidia-smi to confirm.

$ dzdo reboot
$ nvidia-smi (or) $ nvidia-smi -q | grep -i “driver version”

9. Clean up: remove the .run installer once completed successfully.

$ dzdo rm NVIDIA-Linux-x86_64-<version.number>.run


How to uninstall the NVIDIA driver after installing it using the .run file
sudo ./NVIDIA-Linux-x86_64-<version.number>.run --uninstall

How to check Nvidia driver version and other information


$ nvidia-smi
Tue Jun  4 13:09:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000 8GB Lap...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0             25W /   60W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Alternatively, you can run the following command which will give you the same information (or more) in a non-table format:

$ nvidia-smi -q
==============NVSMI LOG==============

Timestamp                                 : Thu Jun 27 14:55:53 2024
Driver Version                            : 555.42.02
CUDA Version                              : 12.5

Attached GPUs                             : 1
GPU 00000000:02:00.0
    Product Name                          : NVIDIA RTX A500 Laptop GPU
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-fabe1493-c99a-410b-5941-abeb61f58c80
    Minor Number                          : 0
    VBIOS Version                         : 94.07.7C.00.0B 

or

$ nvidia-smi -q | grep -i driver
Driver Version                            : 555.42.02
    Driver Model


NVIDIA DOCUMENTATION ON HOW TO DISABLE NOUVEAU


https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/nouveau.html https://docs.nvidia.com/ai-enterprise/deployment-guide-bare-metal/0.1.0/nouveau.html


GSP Firmware


GSP firmware being enabled has been causing issues with some Dell models that contain specific GPU hardware.

What is GSP?

Some GPUs include a GPU System Processor (GSP) which can be used to offload GPU initialization and management tasks. This processor is driven by firmware files distributed with the driver. The GSP firmware is used by default on GPUs which support it.

Offloading tasks which were traditionally performed by the driver on the CPU can improve performance due to lower latency access to GPU hardware internals.

Firmware files gsp_*.bin are installed in /lib/firmware/nvidia/560.28.03/. Each GSP firmware file is named after a GPU architecture (for example, gsp_tu10x.bin is named after Turing) and supports GPUs from one or more architectures.

Disabling GSP Mode

The driver can be forced to disable use of GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=0.

The nvidia-smi utility can be used to query the current use of GSP firmware. It will display a valid version if GSP firmware is enabled, or “N/A” if disabled

$  nvidia-smi -q | grep -iE 'driver version|gsp'
Driver Version                            : 555.42.06
    GSP Firmware Version                  : N/A
Enabling GSP Mode

The GSP firmware will be used by default for all Turing and later GPUs. The driver can be explicitly configured to use the GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=1.

https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/gsp.html


Issues with the 560 Driver Not Able to Disable GSP

Nvidia then released the 560 driver, which also comes with the GSP enabled. However, starting with the 560 driver, Nvidia decided to make the open kernel module as the default installation instead of the proprietary version. From my research, I found that we were not able to disable the GSP firmware when using the open kernel module version of the driver, because it ignores the NVreg_EnableGpuFirmware=0 kernel parameter.

How to check if Nvidia is Open-source or the proprietary version
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  560.28.03  Release Build  (dvs-builder@U16-A24-27-4)  Thu Jul 18 20:46:24 UTC 2024
GCC version:  gcc version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC)
$ rpm -qi kmod-nvidia-open-dkms-560.28.03-1.el8.x86_64
Name        : kmod-nvidia-open-dkms
Epoch       : 3
Version     : 560.28.03
Release     : 1.el8
Architecture: x86_64
Install Date: Mon 12 Aug 2024 10:51:53 AM EDT
Group       : Unspecified
Size        : 23911228
License     : NVIDIA License
Signature   : RSA/SHA512, Thu 18 Jul 2024 11:27:31 PM EDT, Key ID 9cd0a493d42d0685
Source RPM  : kmod-nvidia-open-dkms-560.28.03-1.el8.src.rpm
Build Date  : Thu 18 Jul 2024 11:25:41 PM EDT
Build Host  : cf605bc53c42
Relocations : (not relocatable)
URL         : http://www.nvidia.com/object/unix.html
Summary     : NVIDIA driver open kernel module flavor
Description :
This package provides the open-source Nvidia kernel driver modules.
The modules are rebuilt through the DKMS system when a new kernel or modules become available.


To resolve this, you must uninstall the open kernel module and reinstall the proprietary version in order for the kernel parameter to work and successfully disable the GSP firmware. To install the proprietary version, use the command sudo dnf module install -y nvidia-driver:latest-dkms so that the GSP firmware gets disabled during the image process.


Here are some links as a reference:

  • The updated nvidia_driver.sh script (See lines 30-60 for the fix): http://va3cngl01a.cacicorenet.com/linux_image/rhel8/-/blob/main/Dev_wk_files/nvidia_driver.sh
  • From this link: https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/kernel_open.html it says "Because the two flavors of kernel modules are mutually exclusive, one or the other must be chosen at install time. By default, installation will choose which flavor of kernel modules to install, based on the GPUs detected in the system. If a pre-Turing GPU is detected, installation will default to the proprietary flavor of kernel modules. Otherwise, installation will default to the open flavor of kernel modules."
  • Per https://www.reddit.com/r/linux_gaming/comments/1cp4heq/news_starting_with_nvidia_560_the_open_source/ it says "Starting in the release 560 series, it will be recommended to use the open flavor of NVIDIA Linux Kernel Modules wherever possible (Turing or later GPUs, or Ada or later when using GPU virtualization).
    • If installing from the .run file, installation will detect what GPUs are present and default to installing the open kernel modules if all NVIDIA GPUs in the system can be driven by the open kernel modules. Distribution-specific repackaging of the NVIDIA driver may require additional steps, specific to that packaging, to choose the open flavor.
    • In the release 560 series, it will still be possible to configure the .run file to install the proprietary flavor of kernel modules, with the --kernel-module-type=proprietary command line option. However, in the future, some GPUs may only be supported with the open flavor."

How to List Available Nvidia Module Streams


$ sudo dnf module list nvidia-driver
.
.
.
Name                              Stream                            Profiles                                     Summary
nvidia-driver                     latest                            default [d], fm, ks, src                     Nvidia driver for latest branch
nvidia-driver                     latest-dkms                       default [d], fm, ks                          Nvidia driver for latest-dkms branch
nvidia-driver                     open-dkms [d]                     default [d], fm, ks, src                     Nvidia driver for open-dkms branch
nvidia-driver                     515                               default [d], fm, ks, src                     Nvidia driver for 515 branch
nvidia-driver                     515-dkms                          default [d], fm, ks                          Nvidia driver for 515-dkms branch
nvidia-driver                     515-open                          default [d], fm, ks, src                     Nvidia driver for 515-open branch
nvidia-driver                     520                               default [d], fm, ks, src                     Nvidia driver for 520 branch
nvidia-driver                     520-dkms                          default [d], fm, ks                          Nvidia driver for 520-dkms branch
nvidia-driver                     520-open                          default [d], fm, ks, src                     Nvidia driver for 520-open branch
nvidia-driver                     525                               default [d], fm, ks, src                     Nvidia driver for 525 branch
nvidia-driver                     525-dkms                          default [d], fm, ks                          Nvidia driver for 525-dkms branch
nvidia-driver                     525-open                          default [d], fm, ks, src                     Nvidia driver for 525-open branch
nvidia-driver                     530                               default [d], fm, ks, src                     Nvidia driver for 530 branch
nvidia-driver                     530-dkms                          default [d], fm, ks                          Nvidia driver for 530-dkms branch
nvidia-driver                     530-open                          default [d], fm, ks, src                     Nvidia driver for 530-open branch
nvidia-driver                     535                               default [d], fm, ks, src                     Nvidia driver for 535 branch
nvidia-driver                     535-dkms                          default [d], fm, ks                          Nvidia driver for 535-dkms branch
nvidia-driver                     535-open                          default [d], fm, ks, src                     Nvidia driver for 535-open branch
nvidia-driver                     545                               default [d], fm, ks, src                     Nvidia driver for 545 branch
nvidia-driver                     545-dkms                          default [d], fm, ks                          Nvidia driver for 545-dkms branch
nvidia-driver                     545-open                          default [d], fm, ks, src                     Nvidia driver for 545-open branch
nvidia-driver                     550                               default [d], fm, ks, src                     Nvidia driver for 550 branch
nvidia-driver                     550-dkms                          default [d], fm, ks                          Nvidia driver for 550-dkms branch
nvidia-driver                     550-open                          default [d], fm, ks, src                     Nvidia driver for 550-open branch
nvidia-driver                     555                               default [d], fm, ks, src                     Nvidia driver for 555 branch
nvidia-driver                     555-dkms                          default [d], fm, ks                          Nvidia driver for 555-dkms branch
nvidia-driver                     555-open                          default [d], fm, ks, src                     Nvidia driver for 555-open branch
nvidia-driver                     560                               default [d], fm, ks, src                     Nvidia driver for 560 branch
nvidia-driver                     560-dkms                          default [d], fm, ks                          Nvidia driver for 560-dkms branch
nvidia-driver                     560-open                          default [d], fm, ks, src                     Nvidia driver for 560-open branch