Notes update
This commit is contained in:
parent
b367990028
commit
5f3c3b0e91
@ -127,43 +127,381 @@ Made this Ansible script:
|
||||
|
||||
## Deploy remaining services + their NFS mounts
|
||||
|
||||
- [x] Jellyfin + architecture selector
|
||||
- [x] Jellyfin
|
||||
- [x] QBitTorrent
|
||||
- [x] Filebrowser
|
||||
|
||||
|
||||
## [EXTRA] Deploy new slave node on the Proxmox server
|
||||
## [EXTRA] Deploy new slave node on the Proxmox server (slave04)
|
||||
|
||||
Decided to add ANOTHER VM as a slave to allow some flexibility between x64 nodes.
|
||||
|
||||
- [ ] Configure the VM to have Hardware Acceleration [0] [1]
|
||||
- [x] Created the VM and installed the OS
|
||||
- [ ] ?
|
||||
- [ ] Done
|
||||
- [x] Set up GPU pass through for the newly created VM
|
||||
- [x] Created a Kubernetes Node
|
||||
- [x] Done
|
||||
|
||||
|
||||
## Set up the GPU available in the Kubernetes Node
|
||||
|
||||
Very much what the title says. Steps below.
|
||||
|
||||
- [x] Done
|
||||
|
||||
|
||||
### Install nvidia drivers
|
||||
|
||||
> **Note:**
|
||||
> - Steps were performed in the VM Instance (Slave04). \
|
||||
> - Snapshots were performed on the Proxmox node, taking a snapshot of the affected VM. \
|
||||
> - `Kubectl` command(s) were performed on a computer of mine external to the Kubernetes Cluster/Nodes to interact with the Kubernetes Cluster.
|
||||
|
||||
#### Take snapshot
|
||||
|
||||
- [x] Done
|
||||
|
||||
#### Repo thingies
|
||||
|
||||
Enable `non-free` repo for debian.
|
||||
|
||||
aka. idk you do that
|
||||
|
||||
`non-free` and `non-free-firmware` are different things, so if `non-free-firmware` is already listed, but `non-free` not, slap that bitch in + `contrib`.
|
||||
|
||||
```md
|
||||
FROM:
|
||||
deb http://ftp.au.debian.org/debian/ buster main
|
||||
TO:
|
||||
deb-src http://ftp.au.debian.org/debian/ buster main non-free contrib
|
||||
```
|
||||
[0]
|
||||
https://www.wundertech.net/how-to-set-up-gpu-passthrough-on-proxmox/
|
||||
|
||||
https://www.virtualizationhowto.com/2023/10/proxmox-gpu-passthrough-step-by-step-guide/
|
||||
In my case that was enabled during the installation.
|
||||
|
||||
Once repos set up, use:
|
||||
|
||||
```shell
|
||||
apt update && apt install nvidia-detect -y
|
||||
```
|
||||
|
||||
##### [Error] Unable to locate package nvidia-detect
|
||||
|
||||
Ensure both `non-free` and `contrib` are in the repo file.
|
||||
|
||||
(File /etc/apt/sources.list)
|
||||
|
||||
####
|
||||
```shell
|
||||
nvidia-detect
|
||||
```
|
||||
```text
|
||||
Detected NVIDIA GPUs:
|
||||
00:10.0 VGA compatible controller [0300]: NVIDIA Corporation GM206 [GeForce GTX 960] [10de:1401] (rev a1)
|
||||
|
||||
Checking card: NVIDIA Corporation GM206 [GeForce GTX 960] (rev a1)
|
||||
Your card is supported by all driver versions.
|
||||
Your card is also supported by the Tesla drivers series.
|
||||
Your card is also supported by the Tesla 470 drivers series.
|
||||
It is recommended to install the
|
||||
nvidia-driver
|
||||
package.
|
||||
```
|
||||
|
||||
### Install nvidia driver
|
||||
|
||||
```shell
|
||||
apt install nvidia-driver
|
||||
```
|
||||
|
||||
We might receive a complaint regarding "conflicting modules".
|
||||
|
||||
Just restart the VM.
|
||||
|
||||
#### Reboot VM
|
||||
|
||||
```shell
|
||||
reboot
|
||||
```
|
||||
|
||||
#### nvidia-smi
|
||||
|
||||
VM has access to the Nvidia drivers/GPU
|
||||
|
||||
```shell
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
```text
|
||||
Fri Dec 15 00:00:36 2023
|
||||
+-----------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|
||||
|-------------------------------+----------------------+----------------------+
|
||||
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|
||||
| | | MIG M. |
|
||||
|===============================+======================+======================|
|
||||
| 0 NVIDIA GeForce ... On | 00000000:00:10.0 Off | N/A |
|
||||
| 0% 38C P8 11W / 160W | 1MiB / 4096MiB | 0% Default |
|
||||
| | | N/A |
|
||||
+-------------------------------+----------------------+----------------------+
|
||||
|
||||
+-----------------------------------------------------------------------------+
|
||||
| Processes: |
|
||||
| GPU GI CI PID Type Process name GPU Memory |
|
||||
| ID ID Usage |
|
||||
|=============================================================================|
|
||||
| No running processes found |
|
||||
+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
### Install Nvidia Container Runtime
|
||||
|
||||
#### Take snapshot
|
||||
|
||||
- [x] Done
|
||||
|
||||
#### Install curl
|
||||
|
||||
```shell
|
||||
apt-get install curl
|
||||
```
|
||||
|
||||
#### Add repo
|
||||
|
||||
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt
|
||||
|
||||
```shell
|
||||
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
|
||||
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
|
||||
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
|
||||
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
||||
```
|
||||
|
||||
```shell
|
||||
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf
|
||||
|
||||
|
||||
|
||||
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
|
||||
```
|
||||
|
||||
### Update Containerd config
|
||||
|
||||
#### Select nvidia-container-runtime as new runtime for Containerd
|
||||
|
||||
> No clue if this is a requirement! as afterward also did more changes to the configuration.
|
||||
|
||||
```shell
|
||||
sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml
|
||||
```
|
||||
|
||||
#### Reboot Containerd service
|
||||
|
||||
```shell
|
||||
sudo systemctl restart containerd
|
||||
```
|
||||
|
||||
#### Check status from Containerd
|
||||
|
||||
Check if Containerd has initialized correctly after restarting the service.
|
||||
|
||||
```shell
|
||||
sudo systemctl status containerd
|
||||
```
|
||||
|
||||
### Test nvidia runtime
|
||||
|
||||
#### Pull nvidia cuda image
|
||||
|
||||
I used the Ubuntu based container since I didn't find one specific for Debian.
|
||||
|
||||
```shell
|
||||
sudo ctr images pull docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04
|
||||
```
|
||||
|
||||
```text
|
||||
docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04: resolved |++++++++++++++++++++++++++++++++++++++|
|
||||
index-sha256:0654b44e2515f03b811496d0e2d67e9e2b81ca1f6ed225361bb3e3bb67d22e18: done |++++++++++++++++++++++++++++++++++++++|
|
||||
manifest-sha256:7d8fdd2a5e96ec57bc511cda1fc749f63a70e207614b3485197fd734359937e7: done |++++++++++++++++++++++++++++++++++++++|
|
||||
layer-sha256:25ad149ed3cff49ddb57ceb4418377f63c897198de1f9de7a24506397822de3e: done |++++++++++++++++++++++++++++++++++++++|
|
||||
layer-sha256:1698c67699a3eee2a8fc185093664034bb69ab67c545ab6d976399d5500b2f44: done |++++++++++++++++++++++++++++++++++++++|
|
||||
config-sha256:d13839a3c4fbd332f324c135a279e14c432e90c8a03a9cedc43ddf3858f882a7: done |++++++++++++++++++++++++++++++++++++++|
|
||||
layer-sha256:ba7b66a9df40b8a1c1a41d58d7c3beaf33a50dc842190cd6a2b66e6f44c3b57b: done |++++++++++++++++++++++++++++++++++++++|
|
||||
layer-sha256:c5f2ffd06d8b1667c198d4f9a780b55c86065341328ab4f59d60dc996ccd5817: done |++++++++++++++++++++++++++++++++++++++|
|
||||
layer-sha256:520797292d9250932259d95f471bef1f97712030c1d364f3f297260e5fee1de8: done |++++++++++++++++++++++++++++++++++++++|
|
||||
elapsed: 4.2 s
|
||||
```
|
||||
|
||||
#### Start container
|
||||
|
||||
Containerd already has access to the nvidia gpu/drivers
|
||||
|
||||
```shell
|
||||
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi nvidia-smi
|
||||
```
|
||||
|
||||
```text
|
||||
Thu Dec 14 23:18:55 2023
|
||||
+-----------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 |
|
||||
|-------------------------------+----------------------+----------------------+
|
||||
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|
||||
| | | MIG M. |
|
||||
|===============================+======================+======================|
|
||||
| 0 NVIDIA GeForce ... On | 00000000:00:10.0 Off | N/A |
|
||||
| 0% 41C P8 11W / 160W | 1MiB / 4096MiB | 0% Default |
|
||||
| | | N/A |
|
||||
+-------------------------------+----------------------+----------------------+
|
||||
|
||||
+-----------------------------------------------------------------------------+
|
||||
| Processes: |
|
||||
| GPU GI CI PID Type Process name GPU Memory |
|
||||
| ID ID Usage |
|
||||
|=============================================================================|
|
||||
| No running processes found |
|
||||
+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
### Set the GPU available in the Kubernetes Node
|
||||
|
||||
We `still` don't have the GPU added/available in the Node.
|
||||
|
||||
```shell
|
||||
kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t
|
||||
```
|
||||
|
||||
```text
|
||||
Node Available(GPUs) Used(GPUs)
|
||||
pi4.filter.home 0 0
|
||||
slave01.filter.home 0 0
|
||||
slave02.filter.home 0 0
|
||||
slave03.filter.home 0 0
|
||||
slave04.filter.home 0 0
|
||||
```
|
||||
|
||||
#### Update
|
||||
|
||||
Set Containerd config with the following settings.
|
||||
|
||||
Obv do a backup of the config before proceeding to modify the file.
|
||||
|
||||
```toml
|
||||
# /etc/containerd/config.toml
|
||||
version = 2
|
||||
[plugins]
|
||||
[plugins."io.containerd.grpc.v1.cri"]
|
||||
[plugins."io.containerd.grpc.v1.cri".containerd]
|
||||
default_runtime_name = "nvidia"
|
||||
|
||||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
|
||||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
|
||||
privileged_without_host_devices = false
|
||||
runtime_engine = ""
|
||||
runtime_root = ""
|
||||
runtime_type = "io.containerd.runc.v2"
|
||||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
|
||||
BinaryName = "/usr/bin/nvidia-container-runtime"
|
||||
```
|
||||
#### Restart containerd (again)
|
||||
|
||||
```shell
|
||||
sudo systemctl restart containerd
|
||||
```
|
||||
|
||||
#### Check status from Containerd
|
||||
|
||||
Check if Containerd has initialized correctly after restarting the service.
|
||||
|
||||
```shell
|
||||
sudo systemctl status containerd
|
||||
```
|
||||
|
||||
#### Set some labels to avoid spread
|
||||
|
||||
We will deploy Nvidia CRDs so will tag the Kubernetes nodes that **won't** have a GPU available to avoid running GPU related stuff on them.
|
||||
|
||||
```shell
|
||||
kubectl label nodes slave0{1..3}.filter.home nvidia.com/gpu.deploy.operands=false
|
||||
```
|
||||
|
||||
#### Deploy nvidia operators
|
||||
|
||||
"Why this `--set` flags?"
|
||||
|
||||
- Cause that's what worked out for me. Don't like it? Want to explore? Just try which combination works for you idk.
|
||||
|
||||
```shell
|
||||
helm install --wait --generate-name \
|
||||
nvidia/gpu-operator \
|
||||
--set operator.defaultRuntime="containerd"\
|
||||
-n gpu-operator \
|
||||
--set driver.enabled=false \
|
||||
--set toolkit.enabled=false
|
||||
```
|
||||
|
||||
### Check running pods
|
||||
|
||||
Check all the pods are running (or have completed)
|
||||
|
||||
```shell
|
||||
kubectl get pods -n gpu-operator -owide
|
||||
```
|
||||
```text
|
||||
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
|
||||
gpu-feature-discovery-4nctr 1/1 Running 0 9m34s 172.16.241.67 slave04.filter.home <none> <none>
|
||||
gpu-operator-1702608759-node-feature-discovery-gc-79d6bb94h6fht 1/1 Running 0 9m57s 172.16.176.63 slave03.filter.home <none> <none>
|
||||
gpu-operator-1702608759-node-feature-discovery-master-64c5nwww4 1/1 Running 0 9m57s 172.16.86.110 pi4.filter.home <none> <none>
|
||||
gpu-operator-1702608759-node-feature-discovery-worker-72wqk 1/1 Running 0 9m57s 172.16.106.5 slave02.filter.home <none> <none>
|
||||
gpu-operator-1702608759-node-feature-discovery-worker-7snt4 1/1 Running 0 9m57s 172.16.86.111 pi4.filter.home <none> <none>
|
||||
gpu-operator-1702608759-node-feature-discovery-worker-9ngnw 1/1 Running 0 9m56s 172.16.176.5 slave03.filter.home <none> <none>
|
||||
gpu-operator-1702608759-node-feature-discovery-worker-csnfq 1/1 Running 0 9m56s 172.16.241.123 slave04.filter.home <none> <none>
|
||||
gpu-operator-1702608759-node-feature-discovery-worker-k6dxf 1/1 Running 0 9m57s 172.16.247.8 slave01.filter.home <none> <none>
|
||||
gpu-operator-fcbd9bbd7-fv5kb 1/1 Running 0 9m57s 172.16.86.116 pi4.filter.home <none> <none>
|
||||
nvidia-cuda-validator-xjfkr 0/1 Completed 0 5m37s 172.16.241.126 slave04.filter.home <none> <none>
|
||||
nvidia-dcgm-exporter-q8kk4 1/1 Running 0 9m35s 172.16.241.125 slave04.filter.home <none> <none>
|
||||
nvidia-device-plugin-daemonset-vvz4c 1/1 Running 0 9m35s 172.16.241.127 slave04.filter.home <none> <none>
|
||||
nvidia-operator-validator-8899m 1/1 Running 0 9m35s 172.16.241.124 slave04.filter.home <none> <none>
|
||||
```
|
||||
|
||||
### Done!
|
||||
|
||||
```shell
|
||||
kubectl describe nodes | tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p' | grep -e Name -e nvidia.com | perl -pe 's/\n//' | perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g' | sed '1s/^/Node Available(GPUs) Used(GPUs)/' | sed 's/$/ 0 0 0/' | awk '{print $1, $2, $3}' | column -t
|
||||
```
|
||||
|
||||
```text
|
||||
Node Available(GPUs) Used(GPUs)
|
||||
pi4.filter.home 0 0
|
||||
slave01.filter.home 0 0
|
||||
slave02.filter.home 0 0
|
||||
slave03.filter.home 0 0
|
||||
slave04.filter.home 1 0
|
||||
```
|
||||
|
||||
### vGPU
|
||||
|
||||
I could use vGPU and split my GPU among multiple VMs, but, it would also mean that the GPU no longer posts to the Physical Monitor attached to the Proxmox PC/Server, which I would like to avoid.
|
||||
|
||||
Meanwhile, it's certainly not a requirement (and I only use the monitor on emergencies/whenever I need to touch the BIOS/Install a new OS), I **still** don't own a Serial connector, therefore I will consider making the change to use vGPU **in the future** (whenever I receive the package from Aliexpress, and I confirm it works).
|
||||
|
||||
|
||||
|
||||
[//]: # (```shell)
|
||||
|
||||
[//]: # (kubectl events pods --field-selector status.phase!=Running -n gpu-operator)
|
||||
|
||||
[//]: # (```)
|
||||
|
||||
[//]: # ()
|
||||
[//]: # (```shell)
|
||||
|
||||
[//]: # (kubectl get pods --field-selector status.phase!=Running -n gpu-operator | awk '{print $1}' | tail -n +2 | xargs kubectl events -n gpu-operator pods)
|
||||
[//]: # (```)
|
||||
|
||||
|
||||
## Jellyfin GPU Acceleration
|
||||
|
||||
- [ ] Configured Jellyfin with GPU acceleration
|
||||
- [ ] Apply the same steps for the VM01 previously deployed
|
||||
- [x] Configured Jellyfin with GPU acceleration
|
||||
|
||||
## Deploy master node on the Proxmox server
|
||||
|
||||
|
||||
2 Cores + 4GB RAM
|
||||
|
||||
## Update rest of the stuff/configs as required to match the new Network distribution
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user