Back
Similar todos
fix broken nvidia a100 gpu server at vultr which has put #crisp mirage down for whole night due to being out-of-stock and no replacement physical node could be allocated
See similar todos

No replies yet

experience 6 hours GPU total downtime at Vultr, knocking down all #mirage AI services, will migrate to Scaleway cause Vultr are 🤡
See similar todos

No replies yet

Wake up at 4am with alarms due to a large DDoS targeting #crisp
See similar todos

No replies yet

no sleep as #dayjob infra was under heavy DDOS attack
See similar todos
recover from an evening of server hell & murphys law at #crisp due to a digitalocean backup slowing a mongodb cluster server so much that it led to full cluster collapse and 30m #crisp downtime
See similar todos

No replies yet

try to upgrade #crisp Mirage AI Kuberbetes cluster version, miserably fail at it cause of broken GPU image NVIDIA drivers from the cloud provider which stalled the upgrade process, destroy cluster and rebuild all infrastructure from scratch all evening 🥲
See similar todos
Wake up and see that auto recovery is working - as when I went to sleep it was blocked by X - and when I woke up it was refreshed enough times to start working again, going to deprioritize further work on this for now #watchdog !private
See similar todos

No replies yet

do not sleep during whole night cause something has been ddos-ing #crisp help centers hard for the last 6+ hours
See similar todos

No replies yet

spend 1h debugging some k8s complexity-induced issue after a node failure at cloud provider for #crisp mirage gpus
See similar todos

No replies yet

get awaken at 2am because of #crisp downtime
See similar todos

No replies yet

finish migrating #mirage kubernetes intel and nvidia gpu instances to scaleway, getting last-generation NVIDIA L40S + L4 GPUs, running much smoother now! (previously: old A40 and A16)
See similar todos

No replies yet

ran a few dozen backups all night
See similar todos

No replies yet

FINALLY #capgo is back on track after 24h downtime, what a ride, all decided to broke in the same time
See similar todos

No replies yet

Log into my remote monitoring UI and see that my resiliency fixes worked - Watchdog for X has been running continuously for over 24h with no crashes. No more manual restarts needed! #watchdog !private
See similar todos

No replies yet

updated #dotfiles and only had to partially restore my machine 🙄
See similar todos

No replies yet

deploy #crisp mirage ai improvements to production all morning
See similar todos

No replies yet

get awaken at 2am with total downtime alert for #crisp due to memleak (?) in cloudflare tunnel daemon on server
See similar todos

No replies yet

sleep late this morning to recover from yesterday's DDoS in the early morning #life
See similar todos

No replies yet

contact Hetzner because of random reboots during the night #spectate
See similar todos

No replies yet

provision batch of L4 and L40S GPUs at Scaleway for #mirage since our account got validated and quotas lifted
See similar todos

No replies yet