Online service reliability is crucial in the digital age. Even robust systems can face unexpected outages, affecting various platforms. Let’s explore the insights!
https://robotalp.com/blog/historys-major-downtimes-lessons-from-the-biggest-outages/
Easy answer. Don’t use platforms. Use protocols. Lemmy doesnt go down, Mastodon doesnt go down, nostr doesnt go down, Monero doesnt go down, Bitcoin does not go down.
Facebook goes down, zoom goes down, AWS goes down.
reason for that is isolation and reduncancy though. Most incidents/outages are the result of a change and in the cases you mentioned they are mitigated by the fact that not all instances receive updates at the same time. Presumably, the error is noticed in one place and traffic is then served by healthy instances.
By all accounts these are practices that significant service providers follow. In fact AWS typically rolls out updates to us-east-1 before updating other regions to use it as a canary to warn against issues.
With federated services, this is less of a conscious decision and tends to happen only because instance maintainers update on different schedules.
Blue-green deployments and failover are common mitigation strategies and mature organizations actively employ these. Conversely, these patterns are integral to the decentralized nature of the fediverse and other distributed solutions such as cdn.