pin nvidia driver version on #mirage so that next k8s upgrades and node pool upsizes do not create incompatibilities on our workloads due to driver auto-upgrades
struggle to recover #mirage from a node reboot-induced outage, due to NVIDIA driver being bumped by cloud provider from version 550 to 570, thus requiring a CUDA update on our images, but our Python code is not compatible, so need to fix LOADS of things
fix gated ai model this morning on #mirage which created downtime after k8s node restart since a model could not be pulled anymore from huggingface, i had to manually accept model ToS wtf