Notes update

2023-12-15 04:42:23 +01:00
parent b367990028
commit 5f3c3b0e91
1 changed files with 353 additions and 15 deletions
--- a/Migrations/Say_HI_to_Proxmox/README.md
+++ b/Migrations/Say_HI_to_Proxmox/README.md
@@ -127,43 +127,381 @@ Made this Ansible script:

 ## Deploy remaining services + their NFS mounts

- [x] Jellyfin + architecture selector
+- [x] Jellyfin
 - [x] QBitTorrent
 - [x] Filebrowser


-## [EXTRA] Deploy new slave node on the Proxmox server
+## [EXTRA] Deploy new slave node on the Proxmox server (slave04)

 Decided to add ANOTHER VM as a slave to allow some flexibility between x64 nodes.

- [ ] Configure the VM to have Hardware Acceleration [0] [1]
 - [x] Created the VM and installed the OS
- [ ] ?
- [ ] Done
+- [x] Set up GPU pass through for the newly created VM
+- [x] Created a Kubernetes Node
+- [x] Done

+
+## Set up the GPU available in the Kubernetes Node
+
+Very much what the title says. Steps below.
+
+- [x] Done
+
+
+### Install nvidia drivers
+
+> **Note:**
+> - Steps were performed in the VM Instance (Slave04). \
+> - Snapshots were performed on the Proxmox node, taking a snapshot of the affected VM. \
+> - `Kubectl` command(s) were performed on a computer of mine external to the Kubernetes Cluster/Nodes to interact with the Kubernetes Cluster.
+
+#### Take snapshot
+
+- [x] Done
+
+#### Repo thingies
+
+Enable `non-free` repo for debian.
+
+aka. idk you do that
+
+`non-free` and `non-free-firmware` are different things, so if `non-free-firmware` is already listed, but `non-free` not, slap that bitch in + `contrib`. 
+
+```md
+FROM:
+deb http://ftp.au.debian.org/debian/ buster main
+TO:
+deb-src http://ftp.au.debian.org/debian/ buster main non-free contrib
 ```
-[0]
-https://www.wundertech.net/how-to-set-up-gpu-passthrough-on-proxmox/

-https://www.virtualizationhowto.com/2023/10/proxmox-gpu-passthrough-step-by-step-guide/
+In my case that was enabled during the installation.
+
+Once repos set up, use:
+
+```shell
+apt update && apt install nvidia-detect -y
+```
+
+##### [Error] Unable to locate package nvidia-detect
+
+Ensure both `non-free` and `contrib` are in the repo file.
+
+(File /etc/apt/sources.list)
+
+#### 
+```shell
+nvidia-detect
+```
+```text
+Detected NVIDIA GPUs:
+00:10.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
+
+Checking card:  NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1)
+Your card is supported by all driver versions.
+Your card is also supported by the Tesla drivers series.
+Your card is also supported by the Tesla 470 drivers series.
+It is recommended to install the
+    nvidia-driver
+package.
+```
+
+### Install nvidia driver
+
+```shell
+apt install nvidia-driver
+```
+
+We might receive a complaint regarding "conflicting modules".
+
+Just restart the VM.
+
+#### Reboot VM
+
+```shell
+reboot
+```
+
+#### nvidia-smi
+
+VM has access to the Nvidia drivers/GPU
+
+```shell
+nvidia-smi
+```
+
+```text
+Fri Dec 15 00:00:36 2023       
+-----------------------------------------------------------------------------+
+| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   0  NVIDIA GeForce ...  On   | 00000000:00:10.0 Off |                  N/A |
+|  0%   38C    P8    11W / 160W |      1MiB /  4096MiB |      0%      Default |
+|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+                                                                               
+-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
+```
+
+### Install Nvidia Container Runtime
+
+#### Take snapshot
+
+- [x] Done
+
+#### Install curl
+
+```shell
+apt-get install curl
+```
+
+#### Add repo
+
+https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt
+
+```shell
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
+  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
+    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
+    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
 ```

 ```shell
-echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf
-
-
-
+sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
 ```

+### Update Containerd config
+
+#### Select nvidia-container-runtime as new runtime for Containerd
+
+> No clue if this is a requirement! as afterward also did more changes to the configuration.
+
+```shell
+sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml
+```
+
+#### Reboot Containerd service
+
+```shell
+sudo systemctl restart containerd
+```
+
+#### Check status from Containerd
+
+Check if Containerd has initialized correctly after restarting the service.
+
+```shell
+sudo systemctl status containerd
+```
+
+### Test nvidia runtime
+
+#### Pull nvidia cuda image
+
+I used the Ubuntu based container since I didn't find one specific for Debian.
+
+```shell
+sudo ctr images pull docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04
+```
+
+```text
+docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04:                                    resolved       |++++++++++++++++++++++++++++++++++++++| 
+index-sha256:0654b44e2515f03b811496d0e2d67e9e2b81ca1f6ed225361bb3e3bb67d22e18:    done           |++++++++++++++++++++++++++++++++++++++| 
+manifest-sha256:7d8fdd2a5e96ec57bc511cda1fc749f63a70e207614b3485197fd734359937e7: done           |++++++++++++++++++++++++++++++++++++++| 
+layer-sha256:25ad149ed3cff49ddb57ceb4418377f63c897198de1f9de7a24506397822de3e:    done           |++++++++++++++++++++++++++++++++++++++| 
+layer-sha256:1698c67699a3eee2a8fc185093664034bb69ab67c545ab6d976399d5500b2f44:    done           |++++++++++++++++++++++++++++++++++++++| 
+config-sha256:d13839a3c4fbd332f324c135a279e14c432e90c8a03a9cedc43ddf3858f882a7:   done           |++++++++++++++++++++++++++++++++++++++| 
+layer-sha256:ba7b66a9df40b8a1c1a41d58d7c3beaf33a50dc842190cd6a2b66e6f44c3b57b:    done           |++++++++++++++++++++++++++++++++++++++| 
+layer-sha256:c5f2ffd06d8b1667c198d4f9a780b55c86065341328ab4f59d60dc996ccd5817:    done           |++++++++++++++++++++++++++++++++++++++| 
+layer-sha256:520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8:    done           |++++++++++++++++++++++++++++++++++++++| 
+elapsed: 4.2 s
+```
+
+#### Start container
+
+Containerd already has access to the nvidia gpu/drivers
+
+```shell
+sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi nvidia-smi
+```
+
+```text
+Thu Dec 14 23:18:55 2023       
+-----------------------------------------------------------------------------+
+| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.3     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   0  NVIDIA GeForce ...  On   | 00000000:00:10.0 Off |                  N/A |
+|  0%   41C    P8    11W / 160W |      1MiB /  4096MiB |      0%      Default |
+|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+                                                                               
+-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
+```
+
+### Set the GPU available in the Kubernetes Node
+
+We `still` don't have the GPU added/available in the Node.
+
+```shell
+kubectl describe nodes  |  tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p'  | grep -e Name  -e  nvidia.com  | perl -pe 's/\n//'  |  perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g'  | sed '1s/^/Node Available(GPUs)  Used(GPUs)/' | sed 's/$/ 0 0 0/'  | awk '{print $1, $2, $3}'  | column -t
+```
+
+```text
+Node                 Available(GPUs)  Used(GPUs)
+pi4.filter.home      0                0
+slave01.filter.home  0                0
+slave02.filter.home  0                0
+slave03.filter.home  0                0
+slave04.filter.home  0                0
+```
+
+#### Update 
+
+Set Containerd config with the following settings.
+
+Obv do a backup of the config before proceeding to modify the file.
+
+```toml
+# /etc/containerd/config.toml
+version = 2
+[plugins]
+  [plugins."io.containerd.grpc.v1.cri"]
+    [plugins."io.containerd.grpc.v1.cri".containerd]
+      default_runtime_name = "nvidia"
+
+      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
+        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
+          privileged_without_host_devices = false
+          runtime_engine = ""
+          runtime_root = ""
+          runtime_type = "io.containerd.runc.v2"
+          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
+            BinaryName = "/usr/bin/nvidia-container-runtime"
+```
+#### Restart containerd (again)
+
+```shell
+sudo systemctl restart containerd
+```
+
+#### Check status from Containerd
+
+Check if Containerd has initialized correctly after restarting the service.
+
+```shell
+sudo systemctl status containerd
+```
+
+#### Set some labels to avoid spread
+
+We will deploy Nvidia CRDs so will tag the Kubernetes nodes that **won't** have a GPU available to avoid running GPU related stuff on them.
+
+```shell
+kubectl label nodes slave0{1..3}.filter.home nvidia.com/gpu.deploy.operands=false
+```
+
+#### Deploy nvidia operators
+
+"Why this `--set` flags?"
+
+- Cause that's what worked out for me. Don't like it? Want to explore? Just try which combination works for you idk. 
+
+```shell
+helm install --wait --generate-name \
+   nvidia/gpu-operator \
+   --set operator.defaultRuntime="containerd"\
+   -n gpu-operator \
+   --set driver.enabled=false \
+   --set toolkit.enabled=false
+```
+
+### Check running pods
+
+Check all the pods are running (or have completed)
+
+```shell
+kubectl get pods -n gpu-operator -owide
+```
+```text
+NAME                                                              READY   STATUS      RESTARTS   AGE     IP               NODE                  NOMINATED NODE   READINESS GATES
+gpu-feature-discovery-4nctr                                       1/1     Running     0          9m34s   172.16.241.67    slave04.filter.home   <none>           <none>
+gpu-operator-1702608759-node-feature-discovery-gc-79d6bb94h6fht   1/1     Running     0          9m57s   172.16.176.63    slave03.filter.home   <none>           <none>
+gpu-operator-1702608759-node-feature-discovery-master-64c5nwww4   1/1     Running     0          9m57s   172.16.86.110    pi4.filter.home       <none>           <none>
+gpu-operator-1702608759-node-feature-discovery-worker-72wqk       1/1     Running     0          9m57s   172.16.106.5     slave02.filter.home   <none>           <none>
+gpu-operator-1702608759-node-feature-discovery-worker-7snt4       1/1     Running     0          9m57s   172.16.86.111    pi4.filter.home       <none>           <none>
+gpu-operator-1702608759-node-feature-discovery-worker-9ngnw       1/1     Running     0          9m56s   172.16.176.5     slave03.filter.home   <none>           <none>
+gpu-operator-1702608759-node-feature-discovery-worker-csnfq       1/1     Running     0          9m56s   172.16.241.123   slave04.filter.home   <none>           <none>
+gpu-operator-1702608759-node-feature-discovery-worker-k6dxf       1/1     Running     0          9m57s   172.16.247.8     slave01.filter.home   <none>           <none>
+gpu-operator-fcbd9bbd7-fv5kb                                      1/1     Running     0          9m57s   172.16.86.116    pi4.filter.home       <none>           <none>
+nvidia-cuda-validator-xjfkr                                       0/1     Completed   0          5m37s   172.16.241.126   slave04.filter.home   <none>           <none>
+nvidia-dcgm-exporter-q8kk4                                        1/1     Running     0          9m35s   172.16.241.125   slave04.filter.home   <none>           <none>
+nvidia-device-plugin-daemonset-vvz4c                              1/1     Running     0          9m35s   172.16.241.127   slave04.filter.home   <none>           <none>
+nvidia-operator-validator-8899m                                   1/1     Running     0          9m35s   172.16.241.124   slave04.filter.home   <none>           <none>
+```
+
+### Done!
+
+```shell
+kubectl describe nodes  |  tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p'  | grep -e Name  -e  nvidia.com  | perl -pe 's/\n//'  |  perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g'  | sed '1s/^/Node Available(GPUs)  Used(GPUs)/' | sed 's/$/ 0 0 0/'  | awk '{print $1, $2, $3}'  | column -t
+```
+
+```text
+Node                 Available(GPUs)  Used(GPUs)
+pi4.filter.home      0                0
+slave01.filter.home  0                0
+slave02.filter.home  0                0
+slave03.filter.home  0                0
+slave04.filter.home  1                0
+```
+
+### vGPU
+
+I could use vGPU and split my GPU among multiple VMs, but, it would also mean that the GPU no longer posts to the Physical Monitor attached to the Proxmox PC/Server, which I would like to avoid.
+
+Meanwhile, it's certainly not a requirement (and I only use the monitor on emergencies/whenever I need to touch the BIOS/Install a new OS), I **still** don't own a Serial connector, therefore I will consider making the change to use vGPU **in the future** (whenever I receive the package from Aliexpress, and I confirm it works).
+
+
+
+[//]: # (```shell)
+
+[//]: # (kubectl events pods --field-selector status.phase!=Running -n gpu-operator)
+
+[//]: # (```)
+
+[//]: # ()
+[//]: # (```shell)
+
+[//]: # (kubectl get pods --field-selector status.phase!=Running -n gpu-operator | awk '{print $1}' | tail -n +2 | xargs kubectl events -n gpu-operator pods)
+[//]: # (```)
+

 ## Jellyfin GPU Acceleration

- [ ] Configured Jellyfin with GPU acceleration
- [ ] Apply the same steps for the VM01 previously deployed
+- [x] Configured Jellyfin with GPU acceleration

 ## Deploy master node on the Proxmox server

-
+2 Cores + 4GB RAM

 ## Update rest of the stuff/configs as required to match the new Network distribution