Nvidia: Difference between revisions
No edit summary |
No edit summary |
||
Line 193: | Line 193: | ||
** If installing from the .run file, installation will detect what GPUs are present and default to installing the open kernel modules if all NVIDIA GPUs in the system can be driven by the open kernel modules. Distribution-specific repackaging of the NVIDIA driver may require additional steps, specific to that packaging, to choose the open flavor. | ** If installing from the .run file, installation will detect what GPUs are present and default to installing the open kernel modules if all NVIDIA GPUs in the system can be driven by the open kernel modules. Distribution-specific repackaging of the NVIDIA driver may require additional steps, specific to that packaging, to choose the open flavor. | ||
** In the release 560 series, it will still be possible to configure the .run file to install the proprietary flavor of kernel modules, with the --kernel-module-type=proprietary command line option. However, in the future, some GPUs may only be supported with the open flavor." | ** In the release 560 series, it will still be possible to configure the .run file to install the proprietary flavor of kernel modules, with the --kernel-module-type=proprietary command line option. However, in the future, some GPUs may only be supported with the open flavor." | ||
====How to List Available Nvidia Module Streams==== | |||
---- | |||
sudo dnf module list nvidia-driver |
Revision as of 14:10, 25 September 2024
Useful resources:
- https://docs.kinetica.com/7.1/install/nvidia_rhel/
- https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/
- Good Nvidia driver installation guide: https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Linux.pdf
How to install NVIDIA driver on RHEL8 for Specific GPU model (using .run file)
The following steps uses NVIDIA's .run file. For steps on how to install nvidia-driver via dnf/command line, please refer to this link
Another helpful link: https://www.if-not-true-then-false.com/2021/install-nvidia-drivers-on-centos-rhel-rocky-linux/
Notes:
• The following steps have been tested on Dell Precision 3480 (RTX A500 Laptop GPU) and Dell Precision 7670 (RTX A2000 8GB Laptop GPU).
Steps: 1. First connect to the system via SSH/PuTTy and check the NVIDIA graphics card model:
$ lspci | grep -i nvidia
In this example output, the NVIDIA graphics card is RTX A2000 8GB Laptop GPU.
2. Search the driver from Nvidia's website and download it locally
You can use wget to download the .run file directly from Nvidia’s website to the system. Tip: To get the correct URL link, click “Download” and then right-click on “Agree & Download”.
$ wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run
3. Install prerequisites:
$ dzdo yum install gcc kernel-devel libglvnd-devel elfutils-libelf-devel
4. Then perform the following steps to disable the nouveau driver: a. Edit the /etc/default/grub file and ensure that the options rd.driver.blacklist=nouveau nouveau.modeset=0 is at the end of the GRUB_CMD_LINUX line:
$ dzdo vi /etc/default/grub
Example:
GRUB_CMDLINE_LINUX="resume=/dev/mapper/cmw--rhel-swap rd.lvm.lv=cmw-rhel/root rd.lvm.lv=cmw-rhel/swap rhgb quiet audit=1 audit_backlog_limit=8192 pti=on page_poison=1 slub_debug=P fips=1 boot=UUID=1125ad64-b4b3-4995-928c-8f8a1fa2c48b rd.driver.blacklist=nouveau nouveau.modeset=0"
b. Save the file and exit. c. Next, rebuild the GRUB configuration file:
$ dzdo grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
d. Create the disable-nouveau.conf file under the /etc/modprobe.d/ directory:
$ dzdo vi /etc/modprobe.d/disable-nouveau.conf
And then insert these separate lines:
blacklist nouveau options nouveau modeset=0
Example:
$ dzdo cat /etc/modprobe.d/disable-nouveau.conf blacklist nouveau options nouveau modeset=0
e. Modify the permissions and then regenerate the initramfs file by running dracut. Then reboot.
$ dzdo chmod 644 /etc/modprobe.d/disable-nouveau.conf $ dzdo dracut -f $ dzdo reboot
5. Log back in via SSH and then run the following commands:
$ dzdo init 3 $ chmod +x NVIDIA-Linux-x86_64-<version.number>.run $ dzdo mount -o remount,exec /tmp $ dzdo yum remove nvidia-driver
6. Then run the installer and answer the prompts:
$ dzdo ./NVIDIA-Linux-x86_64-<version.number>.run
Wait for graphical installer then Answer 5 prompts/ 4 questions:
- Install 32-bit compatibility Libraries – Yes
- Register with DKMS – No
- Initramfs rebuild – Yes
- Update X Configuration File – Yes
- Installation Complete – Click OK
7. Then confirm the latest driver version was installed successfully (check Driver Version):
$ nvidia-smi
8. Reboot the system and confirm that its booting up and working as expected. Once logged back in, re-run nvidia-smi to confirm.
$ dzdo reboot $ nvidia-smi (or) $ nvidia-smi -q | grep -i “driver version”
9. Clean up: remove the .run installer once completed successfully.
$ dzdo rm NVIDIA-Linux-x86_64-<version.number>.run
How to uninstall the NVIDIA driver after installing it using the .run file
sudo ./NVIDIA-Linux-x86_64-<version.number>.run --uninstall
How to check Nvidia driver version and other information
$ nvidia-smi Tue Jun 4 13:09:18 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A2000 8GB Lap... Off | 00000000:01:00.0 Off | N/A | | N/A 49C P0 25W / 60W | 1MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
Alternatively, you can run the following command which will give you the same information (or more) in a non-table format:
$ nvidia-smi -q ==============NVSMI LOG============== Timestamp : Thu Jun 27 14:55:53 2024 Driver Version : 555.42.02 CUDA Version : 12.5 Attached GPUs : 1 GPU 00000000:02:00.0 Product Name : NVIDIA RTX A500 Laptop GPU Product Brand : NVIDIA RTX Product Architecture : Ampere Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Addressing Mode : None MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-fabe1493-c99a-410b-5941-abeb61f58c80 Minor Number : 0 VBIOS Version : 94.07.7C.00.0B
or
$ nvidia-smi -q | grep -i driver Driver Version : 555.42.02 Driver Model
NVIDIA DOCUMENTATION ON HOW TO DISABLE NOUVEAU
https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/nouveau.html https://docs.nvidia.com/ai-enterprise/deployment-guide-bare-metal/0.1.0/nouveau.html
GSP Firmware
GSP firmware being enabled has been causing issues with some Dell models that contain specific GPU hardware.
What is GSP?
Some GPUs include a GPU System Processor (GSP) which can be used to offload GPU initialization and management tasks. This processor is driven by firmware files distributed with the driver. The GSP firmware is used by default on GPUs which support it.
Offloading tasks which were traditionally performed by the driver on the CPU can improve performance due to lower latency access to GPU hardware internals.
Firmware files gsp_*.bin are installed in /lib/firmware/nvidia/560.28.03/. Each GSP firmware file is named after a GPU architecture (for example, gsp_tu10x.bin is named after Turing) and supports GPUs from one or more architectures.
Disabling GSP Mode
The driver can be forced to disable use of GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=0.
The nvidia-smi utility can be used to query the current use of GSP firmware. It will display a valid version if GSP firmware is enabled, or “N/A” if disabled
$ nvidia-smi -q | grep -iE 'driver version|gsp' Driver Version : 555.42.06 GSP Firmware Version : N/A
Enabling GSP Mode
The GSP firmware will be used by default for all Turing and later GPUs. The driver can be explicitly configured to use the GSP firmware by setting the kernel module parameter nvidia.NVreg_EnableGpuFirmware=1.
https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/gsp.html
Issues with the 560 Driver Not Able to Disable GSP
Nvidia then released the 560 driver, which also comes with the GSP enabled. However, starting with the 560 driver, Nvidia decided to make the open kernel module as the default installation instead of the proprietary version. From my research, I found that we were not able to disable the GSP firmware when using the open kernel module version of the driver, because it ignores the NVreg_EnableGpuFirmware=0 kernel parameter.
To resolve this, you must uninstall the open kernel module and reinstall the proprietary version in order for the kernel parameter to work and successfully disable the GSP firmware. To install the proprietary version, use the command sudo dnf module install -y nvidia-driver:latest-dkms so that the GSP firmware gets disabled during the image process.
Here are some links as a reference:
- The updated nvidia_driver.sh script (See lines 30-60 for the fix): http://va3cngl01a.cacicorenet.com/linux_image/rhel8/-/blob/main/Dev_wk_files/nvidia_driver.sh
- From this link: https://download.nvidia.com/XFree86/Linux-x86_64/560.28.03/README/kernel_open.html it says "Because the two flavors of kernel modules are mutually exclusive, one or the other must be chosen at install time. By default, installation will choose which flavor of kernel modules to install, based on the GPUs detected in the system. If a pre-Turing GPU is detected, installation will default to the proprietary flavor of kernel modules. Otherwise, installation will default to the open flavor of kernel modules."
- Per https://www.reddit.com/r/linux_gaming/comments/1cp4heq/news_starting_with_nvidia_560_the_open_source/ it says "Starting in the release 560 series, it will be recommended to use the open flavor of NVIDIA Linux Kernel Modules wherever possible (Turing or later GPUs, or Ada or later when using GPU virtualization).
- If installing from the .run file, installation will detect what GPUs are present and default to installing the open kernel modules if all NVIDIA GPUs in the system can be driven by the open kernel modules. Distribution-specific repackaging of the NVIDIA driver may require additional steps, specific to that packaging, to choose the open flavor.
- In the release 560 series, it will still be possible to configure the .run file to install the proprietary flavor of kernel modules, with the --kernel-module-type=proprietary command line option. However, in the future, some GPUs may only be supported with the open flavor."
How to List Available Nvidia Module Streams
sudo dnf module list nvidia-driver