Architecting for High Availability

# Architecting for High Availability

## CS291A: Scalable Internet Services

---

# High Availability Motivation

Modern web applications power important parts of our lives

* Banking

* Medical

* Communications

Having access to these services (high availability) at any time is increasingly
important.

???

As a user of these services, we expect them to be available at all times.
When they are not, it can be frustrating and even dangerous.
When we build these services, we need to think about how customers will depend on them
and what level of availability we need to provide.

---

# Expressing Availability

Services advertise availability as a "number of nines":

* Three nines: 99.9% uptime, 525.6 minutes of downtime a year (43.8/month)

* Four nines: 99.99% uptime, 52.6 minutes of downtime a year (4.38/month)
    * Desired by many business applications

* Five nines: 99.999% uptime, 5.3 minutes of downtime a year (26.5s/month)
    * Desired by communication companies

Downtime usually only includes unplanned outages. Scheduled outages, no matter
how frequent, are usually not incorporated in marketing material.

[Availability Table](https://sre.google/sre-book/availability-table/#tablea1)

???

What is considered downtime?
* Unplanned outages
* Are scheduled outages considered downtime? Usually not,
  but we can arhitect our system to minimuze downtime during
  certain types of maintenance, like software updates

---

# Measuring Availability SLIs/SLOs/SLAs

Service Level Indicators - SLIa
- In order to provide a HA service we need to agree on a metric
- Come up with a set of metrics that will measure and indicate availability
of your service

Service Level Objectives - SLOs
- Come up with with a list of goals based on metrics that you want to be achieved
by your service

Service Level Agreements - SLAs
- Making sure your clients can rely on your service to keep the agreed level
of availability and metrics

---

# What are possible causes of failures?

![Full network topology](full_stack.png)

---

# Possible causes of failures. What if?

* a server process hangs?
* a server process dies?
* an application server fails?
* a load balancer fails?
* a switch fails?
* a router fails?
* a connection to the Internet fails?
* DNS fails?
* the Internet fails?
* a database fails?
* an entire data center fails?

---

# Single point of failure

Can we identify service components which failure would bring down the whole service?

---

# Handling Application Server Failure

.left-column40.center[
![Single application server](application_server.png)
]
.right-column60[
We've already discussed the components of this sort of failure.

* Process-level isolation reduces hanging, or dead processes by having other
  processes to pick up the slack.

* A process in an infinite loop may add unnecessary load to the server, so it's
  important to have monitoring to detect such issues.

* Having a load-balanced configuration permits failing servers (high load or
  dead) by having a pool of other servers to direct traffic to.
]

???

A load balanced system can handle a server failure by directing traffic to others.
However, if the system is already operating at capacity, the loss of a server could cause a cascading failure.
Part of high availability is ensuring that the system can handle the loss of a server without causing a failure.
To do this we need to monitor the load on the servers and have a method to add capacity when needed.

---

# How do we Handle Load Balancer Failure?

![Single load balancer](single_load_balancer.png)

---

# Redundant Load Balancers

One is the primary, and the other is the failover.

Load balancers communicate via a heartbeat to determine the health of each
other.
]
.right-column60.center[
![Redundant load balancers](redundant_load_balancers.png)
]

> During a failover, what happens to the IP address?

---

# Load Balancer Failover

When the failover load balancer determines it needs to become the primary
(detected via lack of heartbeat), it issues a gratuitous ARP for the load
balancer's IP.

The gratuitous ARP (layer 2) will inform the switch that traffic should be
delivered to the port that leads to the new primary (formerly the failover).

All new packets will be transparently delivered to the new primary load
balancer.

Established flows can be supported depending on how much information sharing
occurs between the load balancers.

???

There are different ways to handle load balancer failure.
One way is to have two load balancers, one primary and one failover using a heartbeat to determine the health of each other.
If this is a proxy load balancer layer 4 or 7, the load balancers may share session information so that if one fails, the other can take over the session.

For a layer 4 packet re-writing load balancer, the load balancers may not need to share session information.
Another way is to have a router that implements ECMP (Equal Cost Multi-Path) routing to distribute traffic to multiple load balancers.
In this case, the load balancers do not need to communicate with each other.
The router will distribute traffic to the load balancers based on a hash of the packet header, like the source IP address.

---

# How do we Handle Switch Failure?

![Redundant load balancers](redundant_load_balancers.png)

---

# Redundant Switches

Use two switches!

Again, both a primary and failover are necessary using a heartbeat between them.

During failover similar issues occur but layer 2 should be stateless so less or
even no synchronization is needed.

---

# How do we Handle Router Failure?

![Redundant switches](redundant_switches.png)

---

# Redundant Routers

Use two!

This includes a similar heartbeat and failover system, but is not as trivial to switch between.

---

# How do we Handle Access to the Internet Failure?

![Redundant routers](redundant_routers.png)

---

# How do we Handle Access to the Internet Failure?

Use two!

Use separate internet service providers (ISPs).

---

# Routing with Two ISPs

> How do we handle routing when we have two ISPs?

## Outgoing Traffic (easy)

We control how we send outbound traffic so we have some options:

* Pick the cheapest or most reliable link
* Pick the _closer_ link

## Incoming Traffic (hard)

We cannot inform clients by which path to reach our IP.

However, we can give hints by using BGP (Border Gateway Protocol) to persuade
networks to prefer one path over another. (Path prepending, Community values)

---

# High Availability

![High Availability Topology](high_availability.png)

---

# Redundancy Everywhere!

So we have redundant systems in place.

> Can anything go wrong?

---

# Disaster Strikes!

![Hurricane Sandy from Satellite](hurricane_sandy_satellite.png)

---

# Hurricane Sandy via Ars Technica

![Hurricane Sandy Headline](hurricane_sandy_article.png)

???

2012 Hurricane took AWS data center offline.

---

# 2017 Hurricanes

![2017 Hurricanes Headline](2017_hurricanes.png)

> Yet another data center, west of Houston, was so well prepared for the storm
  — with backup generators, bunks and showers — that employees’ displaced
  family members took up residence and United States marshals used it as a
  headquarters until the weather passed.

---

# Availability Axiom (Pete Tenereillo)

Source: [http://www.tenereillo.com/GSLBPageOfShame.htm](http://www.tenereillo.com/GSLBPageOfShame.htm)

> The only way to achieve high-availability for browser based clients is to
> include the use of multiple A records.

An "A record" is an IPv4 address associated with a DNS host (AAAA for IPv6
addresses).

## Problem

The order of DNS responses with multiple A records is often not preserved by a
client's DNS server.

## Result

* For performance we want to send the browser to one data center.
* For availability we want to send the browser multiple A records.

We end up having to make a choice between performance and availability.

???

What is the performance problem with multiple A records?
* The backend state would need to be replicated across data centers in near real-time.
  - This is a difficult problem to solve.
  - CAP theorem says we can't have consistency and availability and performance in a distributed system.
---

# Multi-site High Availability

![High Availability in Two Sites](dual_site_availability.png)

What about the database?

---

# Now that we have architecture for high availability

We have redundant systems in place to handle failures

> What if your service is failing due to excessive load?

> How can your service be highly available when under heavy load?

* Rate Limiting

* Client backoffs

* Client retry logic/limits

* Request Deadlines

* Degraded Server Responses

* Server Load Shedding

[Google SRE Book - Handling Overload](https://sre.google/sre-book/handling-overload/)

[Google SRE Book - Handling Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/)
---

# Rate Limiting and Algorithms

- Prevent some users (mostly other services) from consuming too much of your resources
- Provide similar performance to all users
- Respond with status code 429 and appropriate header

## Popular rate limiting algorithms
- Fixed Window
- Sliding Window
- Leaky Bucket

---

# Monitoring and Alerting Tools Examples

- NewRelic
- DataDog