Version: v1.8 (Dev)

NVIDIA Driver Toolkit

nvidia-driver-toolkit is an add-on that allows you to deploy out-of-band NVIDIA GRID KVM drivers to your existing Harvester clusters.

note

The toolkit only includes the correct Harvester OS image, build utilities, and kernel headers that allow NVIDIA drivers to be compiled and loaded from the container. You must download the NVIDIA KVM drivers using a valid NVIDIA subscription. For guidance on identifying the correct driver for your NVIDIA GPU, see the NVIDIA documentation.

The Harvester ISO does not include the nvidia-driver-toolkit container image. Because of its size, the image is pulled from Docker Hub by default. If you have an air-gapped environment, you can download and push the image to your private registry. The Image Repository and Image Tag fields on the nvidia-driver-toolkit screen provide information about the image that you must download.

note

Each new Harvester version will be released with the correct nvidia-driver-toolkit image to ensure that all dependencies required to install the NVIDIA vGPU KVM drivers are available in the image.

To enable the addon, users need to perform the following:

Provide the Driver Location: which is an http location where nvidia vgpu kvm driver file is located (as shown in the example)
update the Image Repository and Image Tag if needed

Once the addon is enabled, a nvidia-driver-toolkit daemonset is deployed to the cluster.

On pod startup, the ENTRYPOINT script will download the NVIDIA driver from the specified Driver Location. Install the driver and load the kernel drivers.

The PCIDevices addon can now leverage this addon to manage the lifecycle of the vGPU devices on nodes containing supported GPU devices.

Install Different NVIDIA Driver Versions

Available as of v1.7.0

NVIDIA driver versions can vary across cluster nodes. If you want to install a specific driver version on a node, you must annotate the node before starting the nvidia-driver-toolkit add-on.

kubectl annotate nodes {node name} sriovgpu.harvesterhci.io/custom-driver=https://[driver location]

The nvidia-driver-toolkit installs the specified driver version upon starting.

If an NVIDIA driver was previously installed, you must restart the pod to trigger the installation process again.

Advanced Node Scheduling with Node Affinity

Available as of v1.8.0

Starting with v1.8.0, the nvidia-driver-toolkit uses node affinity instead of nodeSelector for more flexible node scheduling.

Customizing Node Affinity

You can customize the node affinity settings to meet your specific requirements.

For example, if you add the labels gpu.model=A100 and gpu.model=A40 to nodes that use these GPU models, you can use the following node affinity settings to target the driver installation.

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: sriovgpu.harvesterhci.io/driver-needed
          operator: In
          values:
          - "true"
        - key: gpu.model
          operator: In
          values:
          - "A100"
          - "A40"

Applying Custom Node Affinity

Edit the nvidia-driver-toolkit add-on configuration using either the Harvester UI or the Helm chart values.
Update the affinity section.
Save the changes.

The DaemonSet is updated automatically.

Install Different NVIDIA Driver Versions​

Advanced Node Scheduling with Node Affinity​

Customizing Node Affinity​

Applying Custom Node Affinity​

Install Different NVIDIA Driver Versions

Advanced Node Scheduling with Node Affinity

Customizing Node Affinity

Applying Custom Node Affinity