WIP

Similar todos

Valerian Saliou

PRO

@valerian

fix broken nvidia a100 gpu server at vultr which has put

#crisp mirage down for whole night due to being out-of-stock and no replacement physical node could be allocated

2023-11-15 08:02:56 UTC

See similar todos

No replies yet

Valerian Saliou

PRO

@valerian

experience 6 hours GPU total downtime at Vultr, knocking down all

#mirage AI services, will migrate to Scaleway cause Vultr are 🤡

2024-08-08 06:23:45 UTC

See similar todos

No replies yet

Valerian Saliou

PRO

@valerian

Wake up at 4am with alarms due to a large DDoS targeting

#crisp

2024-01-08 14:33:52 UTC

See similar todos

No replies yet

Stefan W

PRO

@efekto

no sleep as #dayjob infra was under heavy DDOS attack

2024-10-18 03:17:31 UTC

See similar todos

Valerian Saliou

PRO

@valerian

recover from an evening of server hell & murphys law at

#crisp due to a digitalocean backup slowing a mongodb cluster server so much that it led to full cluster collapse and 30m

#crisp downtime

2023-10-19 09:41:24 UTC

See similar todos

No replies yet

Valerian Saliou

PRO

@valerian

try to upgrade

#crisp Mirage AI Kuberbetes cluster version, miserably fail at it cause of broken GPU image NVIDIA drivers from the cloud provider which stalled the upgrade process, destroy cluster and rebuild all infrastructure from scratch all evening 🥲

2024-02-10 13:42:42 UTC

See similar todos

Ben Katz

@ben

Wake up and see that auto recovery is working - as when I went to sleep it was blocked by X - and when I woke up it was refreshed enough times to start working again, going to deprioritize further work on this for now

#watchdog !private

2024-08-17 04:49:09 UTC

See similar todos

No replies yet

Valerian Saliou

PRO

@valerian

do not sleep during whole night cause something has been ddos-ing

#crisp help centers hard for the last 6+ hours

2023-11-21 05:41:15 UTC

See similar todos

No replies yet

Valerian Saliou

PRO

@valerian

spend 1h debugging some k8s complexity-induced issue after a node failure at cloud provider for

#crisp mirage gpus

2024-02-15 01:18:28 UTC

See similar todos

No replies yet

Valerian Saliou

PRO

@valerian

get awaken at 2am because of

#crisp downtime

2023-09-08 07:06:26 UTC

See similar todos

No replies yet

Valerian Saliou

PRO

@valerian

finish migrating

#mirage kubernetes intel and nvidia gpu instances to scaleway, getting last-generation NVIDIA L40S + L4 GPUs, running much smoother now! (previously: old A40 and A16)

2024-08-13 14:07:54 UTC

See similar todos

No replies yet

Jeff Triplett ✨

PRO

@jefftriplett

ran a few dozen backups all night

2023-01-15 06:11:56 UTC

See similar todos

No replies yet

Martin Donadieu

PRO

@martindonadieu

FINALLY

#capgo is back on track after 24h downtime, what a ride, all decided to broke in the same time

2023-01-17 00:38:46 UTC

See similar todos

No replies yet

Ben Katz

@ben

Log into my remote monitoring UI and see that my resiliency fixes worked - Watchdog for X has been running continuously for over 24h with no crashes. No more manual restarts needed!

#watchdog !private