HTTP and Application Servers

class: center, middle

# HTTP and Application Servers

__CS291A__

Dr. Bryce Boe

October 11, 2018

---

# Today's Agenda

* TODO

* Web Servers Introduction

* HTTP Server Architectures

* C10K Problem

* Application Servers

---

# TO-DO

## Should be done:

* Chapters 1, 2, 9-11 in
  [HPBN](https://hpbn.co/) / Chapters 1-6 in
  the [Ruby on Rails Tutorial](https://www.railstutorial.org/book/beginning)

* Familiarity with [git](http://rogerdudler.github.io/git-guide/)

* Have at least 40 user stories for your project

---

# TO-DO

## Before Tuesday's Class:

* Read
  [Dynamic Load Balancing on Web-server Systems](http://www.ics.uci.edu/~cs230/reading/DLB.pdf)
  by Cardellini, Colajanni, and Yu.

## Before Tuesday's Lab

* Complete at least `2 * #TEAM_MEMBERS` stories that deliver value to your
  users (me).

* Complete chapters 7 through 10 in the
  [Ruby on Rails Tutorial](https://www.railstutorial.org/book/beginning)

* Hook [TravisCI](https://travis-ci.org/) up to your repository.

---

# Web Servers Introduction

After this lecture you should understand the trade-offs in the following stack
descriptions:

## NGINX + Passenger

NGINX is an event-driven HTTP server that handles HTTP requests to port 80 and
passes connections to instances of the application through Passenger. Multiple
concurrent connections are supported.

## Puma

Puma is an application server that speaks the HTTP protocol and allows both
thread-based and process-based concurrency.

---

# End to End HTTP

By now, we all should have a reasonable understanding of the HTTP protocol.

Recall that many browsers and clients exist that are able to:

* Open a TCP socket

* Send an HTTP request

* Have the request processed

* Receive the data in a response

* Reuse the socket for multiple requests

The software systems that handle the request are generally divided into two
parts:

* HTTP Servers

* Application Servers

---

# Separation of Responsibilities

> Why not use a single process to handle both the http request and the
> application logic?

The concerns and design goals of HTTP servers are different from those of
application servers.

---

# Server Responsibilities

## HTTP Server

* Provides a high performance HTTP implementation (handles concurrency)

* Extremely stable, and relatively static

* Very configurable and language/framework agnostic

## Application Server

* Written to support a specific language (e.g., Ruby), which may
  hinder performance

* Contains _business logic_ and is extremely dynamic

* Focus on optimizing human resources via abstractions, e.g.,
  model-view-controller (MVC) framework)

---

# HTTP Servers

![Netcraft survey of HTTP servers](netcraft_web_servers.png)

Latest:
https://news.netcraft.com/archives/2017/09/11/september-2017-web-server-survey.html

---

# HTTP Server Responsibilities

* Parse HTTP requests and craft HTTP responses _very_ fast

* Dispatch to the appropriate handler and return response

* Be stable and secure (lots of string parsing)

* Provide clean abstraction for application servers

>  How do web servers provide concurrency?

---

# TCP Server Concurrency Approaches

* Single Threaded (no concurrency)

* Process per request

* Process pool

* Thread per request

* Process/thread worker pool

* Event-driven

For select examples see:
[https://gist.github.com/bboe/6a7b03fcd110c4c6bbe5ec412f523428](https://gist.github.com/bboe/6a7b03fcd110c4c6bbe5ec412f523428)

---

# Single Threaded HTTP Servers

```python
bind() to port 80 and listen()
loop forever
    accept() a socket connection
    while we can read from the socket
        read() a request
        process that request
        write() its response
    close() the socket connection
```

> What happens if another request comes in while we're within the loop?

---

# Single Threaded Server Issues

If a single threaded web service does not process the request quickly, other
clients end up waiting or dropping their connections (head of line blocking).

We are building complex web applications not simple  web sites. As a result:

* The requests are usually more complicated than serving a file from disk.

* It is common to have a web request doing a significant amount of computation
  and business logic.

* It is common to have a web request result in connections to multiple external
  services, e.g., databases, and caching stores.

* These requests can be anything: lightweight or heavyweight, IO intensive or
  CPU intensive.

We can solve these problems if the thread of control that processes the request
is separate from the thread that `listen()`s and `accept()`s new connections.

---

# Process per Request HTTP Servers

Handle each request as a subprocess:

.left-column40[![forking web server](server_forking.png)]
.right-column60[
```python
bind() to port 80 and listen()
loop forever
    accept() a socket connection
*   if fork() == 0  # child process
        while we can read from the socket
            read() a request
            process that request
            write() its response
        close() the socket connection
*       exit()
```
]

---

# Process per Request HTTP Servers

## Strengths

* Simple

* Provides significant isolation between requests

## Weaknesses

* How much memory is required?

* What happens as the CPU load increases?

* How efficient is it to fire up a process on each request?

* How much setup and tear down work is necessary?

---

# Process Pool HTTP Servers

.left-column[![process pool web server](server_process_pool.png)]
.right-column[
Instead of spawning a process for each request, create a pool of N processes at
start-up and have them handle incoming requests when available.

The children processes `accept()` the incoming connections and use shared
memory to coordinate.

The parent process watches the load on its children and can adjust the pool
size as needed.
]

---

# Process Pool HTTP Servers

## Strengths

* Provides isolation between concurrent requests

* Children can die after _M_ requests to minimize memory leakage issues

* Process setup and tear down costs are minimized

* More predictable behavior under high load

## Weaknesses

* More complex than process per request

* Many processes can still mean a large amount of memory consumption

This web server architecture is provided by the Apache 2.x MPM "Prefork" module.

---

# Thread per Request HTTP Servers

Why use multiple processes at all? Instead we can use a single process and
spawn new threads for each request.

.left-column40[![http server thread per request](server_threaded.png)]
.right-column60[
```python
bind() to port 80 and listen()
loop forever
    accept() a socket connection
*   pthread_create()
        while we can read from the socket
            read() a request
            process that request
            write() its response
        close() the socket connection
        pthread_exit()
```
]

---

# Thread per Request HTTP Servers

## Strengths

* Relatively simple

* Reduced memory footprint compared to multi-processed

## Weaknesses

* Worker (request handling code)  must be thread-safe

* Setup and tear down needs to occur for each thread (or shared data
  needs to be thread-safe)

* What about memory leaks?

---

# Process/Thread Worker Pool Servers

.left-column[![http process/thread worker pool](server_worker_pool.png)]
.right-column[
Combination of the two techniques.

Master process spawns processes, each with many threads. Master maintains
process pool.

Processes coordinate through shared memory to `accept()` requests.

Fixed threads per request, scaling is done at the process level.
]

---

# Process/Thread Worker Pool Servers

## Strengths

* Faults isolated between processes, but not threads

* Threads reduce memory footprint

* Tunable level of isolation

* Controlling the number of processes and threads allows for predictable
  behavior under load

## Weaknesses

* Requires thread-safe code

* Uses more memory than an all-thread based approach

This web server architecture is provided by the Apache 2.x MPM "Worker" module.

---

class: center inverse middle

# C10K Problem

---

# C10K Problem

Originally posed in 1999 by Dan Kegel.

> Given a 1 GHz machine with 2GB of RAM, and a gigabit Ethernet card, can we
> support 10,000 simultaneous connections?

## 20,000 clients means each gets:

* 50 KHz of CPU

* 100 KB of RAM

* 50 Kb/second of network

"It shouldn't take any more horsepower than that to take four kilobytes from
the disk and send them to the network once a second for each of twenty thousand
clients."

> What makes managing concurrent connections difficult?

Source: [http://www.kegel.com/c10k.html](http://www.kegel.com/c10k.html)

---

# For each client the server is...

* Reading from the network socket

* Parsing its request

* Opening a file on disk

* Reading the file into memory

* Writing the memory to network

---

# For each client the server is... __blocking on I/O__

* Reading from the network socket (__blocking__)

* Parsing its request

* Opening a file on disk  (__blocking__)

* Reading the file into memory  (__blocking__)

* Writing the memory to network  (__blocking__)

---

# Waiting on I/O

Every time a process/thread is waiting on I/O it is not runnable, and it is not
cost-free:

* Each process/thread is considered every time the scheduler makes a decision

* Memory is occupied by the process, and its last load may have evicted other
  processes' memory from the cache

Massive concurrency slows down all processes/threads.

---

# How can we not wait on I/O?

Blocking system calls cause this problem.

> Can we accomplish our desired tasks without blocking?

## Yes! Asynchronous I/O

* `select()`: Provided a list of file descriptors, block only until at least
  one is ready for I/O (only usable up to 1024 file descriptors on Linux).

* `epoll_*()`: Register to listen for events on file descriptors. Again block
  only until at least one of the registered descriptors is ready for I/O

---

# Select example

Assume we have a list of sockets called fd_list.

```python
loop forever:
    select(fd_list, ...) // block until something has I/O to handle
    for fd in fd_list
        if fd is ready for IO
            handle_io(fd)
        else do nothing
```

## handle_io

* can include socket acceptance

* shouldn't make any blocking calls (use non-blocking variants)

* should avoid excessive computation
    * Use a separate thread, process, or worker pool for such purposes

---

# Event Driven Systems

Systems that operate in such a manner are called event driven system.

Often such systems can accomplish everything using only a single process and
thread, of course more may be needed for CPU-bound segments.

Well used examples:

* NGINX

* lighttpd

* netty (java)

* node.js (JavaScript)

* eventmachine (ruby)

* twisted (python)

---

# Event Driven HTTP Servers

## Strengths

* High performance under high load

* Predictable performance under high load

* No need to be thread-proof (unless specifically adding thread-concurrency)

## Weaknesses

* Poor isolation

* What happens if a bug causes an infinite loop?

* Extensions are hard to implement since they cannot use blocking syscalls

* Very complex

---

# Event Machine (Ruby) Example

Event driven code is dominated by callbacks:

```ruby
EM.run {
  page = EM::HttpRequest.new('http://google.ca/').get
  page.errback { p "Google is down! terminate?" }
  page.callback {
    a = EM::HttpRequest.new('http://google.ca/search?q=em').get
    a.callback { # callback nesting, ad infinitum }
    a.errback  { # error-handling code }
  }
}
```

---

# Callback Hell

.center[![Yo dawg, I heard you like JavaScript](callback_hell.png)]

With all these callbacks, event-drived programming _can_ very easily become
complicated.

---

# HTTP Server Architectures Review

* Sequential (single process and thread)
    * Easy
    * No concurrency

* Process per request
    * Greatest isolation
    * Largest memory footprint

* Thread per request
    * Less isolation
    * Smaller memory footprint

* Process/thread worker pool
    * Tunable compromise between processes and threads

* Event-driven
    * Great performance under high load
    * Difficult to extend
    * Reduced isolation

---

# Application Servers

We are building web applications, so we will require complex server-side logic.

We _can_ extend our HTTP servers to provide this logic through modules, but
there are benefits to separating application servers to distinct process(es).

* Application logic will be dynamic

* Application logic regularly uses high level (slow) languages

* Security concerns are easier (HTTP server can shield app server from
  malformed requests)

* Setup costs can be amortized if the app server is running continuously

When separating the two, our HTTP server should forward requests to the
application server(s).

---

# Inter-server Communication

> How does an HTTP Server communicate with the application server(s)?

---

# Inter-server Communication

## [CGI](https://en.wikipedia.org/wiki/Common_Gateway_Interface)

Spawn a process, pass HTTP headers as ENV variables and utilize STDOUT as the
response.

## [FastCGI](https://en.wikipedia.org/wiki/FastCGI), [SCGI](https://en.wikipedia.org/wiki/Simple_Common_Gateway_Interface)

Modifications to CGI to allow for persistent application server processes
(amortizes setup time).

## HTTP

Communicate via the HTTP protocol to a long-running process. (Essentially a
reverse-proxy configuration).

> Does it make sense to do this?

---

# Application Server Architectures

> What architecture should we use for our application server?

We have the same trade-offs to consider as with HTTP servers
(e.g. threads/processes/workers).

## Up Next

Let's take a quantitative look at various approaches used in actual Ruby
application servers.

We will not consider evented ruby application servers (e.g., EventMachine)
because Rails will not run on such application servers.

---

# Our Test Setup

![Demo App](demo_app.png)

The [Demo App](https://github.com/scalableinternetservices/demo) is a link
sharing website with:

* Multiple communities

* Each community can have many submissions

* Each submission can have a tree of comments

---

# Simulated Users

Using [Tsung](http://tsung.erlang-projects.org/) (erlang-based test framework)
we will simulate multiple users visiting the Demo App web service. Each user
will:

```
Visit the homepage (/)
  Wait randomly between 0 and 2 seconds
Request community creation form
  Wait randomly between 0 and 2 seconds
Submit new community form
Request new link submission form
  Wait randomly between 0 and 2 seconds
Submit new link submission form
  Wait randomly between 0 and 2 seconds
Delete the link
  Wait randomly between 0 and 2 seconds
Delete the community
```

---

# Test Process

There are six phases of testing each lasting 60 seconds:

1. (0-59s) Every second a new simulated user arrives

2. (60-119s) Every second 1.5 new simulated users arrive

3. (120-179s) Every second 2 new simulated users arrive

4. (180-239s) Every second 2.5 new simulated users arrive

5. (240-299s) Every second 3 new simulated users arrive

6. (300-359s) Every second 3.5 new simulated users arrive

__Note__: Each user corresponds to seven requests and a user may wait up to ten
seconds with the delays.

---

# Test Environment

All tests were conducted on a single Amazon EC2 m3-medium instance.

* 1 vCPU

* 3.75 GB RAM

The tests used the `Puma` web application server (unless otherwise specified).

The `database_optimizations` branch of the demo app was used to run the tests:
[https://github.com/scalableinternetservices/demo/tree/database_optimizations](https://github.com/scalableinternetservices/demo/tree/database_optimizations)

---

# Single Thread/Process (Users)

.center[![Single Thread/Process Users](demo_single_users.png)]

---

# Single Thread/Process (Page Load)

.left-column20[
Decrease in performance around 60s (1.5 new users per second)

Mean duration's spike is around 200 seconds.
]
.right-column80[
.center[![Single Thread/Process Page Load](demo_single_page_load.png)]
]

---

# Four Processes (Users)

![Four Processes Users](demo_four_users.png)

---

# Four Processes (Page Load)

.left-column20[
Decrease in performance around 240s (3 new users per second)

Mean duration's spike is just below 18 seconds.
]
.right-column80[
![Four Processes Page Load](demo_four_load.png)
]

---

# Sixteen Processes (Users)

![Sixteen Processes Users](demo_sixteen_users.png)

---

# Sixteen Processes (Page Load)

.left-column20[
Decrease in performance around 240s (3 new users per second)

Mean duration's spike is just below 14 seconds.

Little improvement over 4 processes especially considering up to 4x memory
usage.
]
.right-column80[
![Sixteen Processes Page Load](demo_sixteen_load.png)
]

---

# Threads instead of processes?

> What do you think will happen?

---

# Four Threads (Users)

![Four Threads Users](demo_four_threads_users.png)

---

# Four Threads (Page Load)

.left-column20[
Still decrease in performance around 240s, but more stable until then.

Mean duration's spike is about 14 seconds.
]
.right-column80[
![Four Threads Page Load](demo_four_threads_load.png)
]

---

# 32 Threads (Users)

![32 Threads Users](demo_32_threads_users.png)

---

# 32 Threads (Page Load)

.left-column20[
Decrease in performance beginning around 300s (3.5 new users per second)

Mean duration's spike is under 2 seconds.
]
.right-column80[
![32 Threads Page Load](demo_32_threads_load.png)
]

---

# Side note: Ruby interpreters

There are different versions of the Ruby interpreter. Different workloads may
benefit from using different interpreters.

## MRI (Matz's Ruby Interpreter)

* The reference version

* Written in C

* Has a global interpreter lock (GIL) that prevents true thread-concurrency

## JRuby

* Written in Java

* Does not have GIL

---

# Application Server Options

In this class you will be able to compare the performance of a handful of
application servers:

## via elastic beanstalk configuration

* Puma (worker pool)

* Phusion Passenger (worker pool)

---

# Puma

Originally designed for Rubinius (GIL-less ruby interpreter).

Claims to require less memory than others (for the server itself)

Specifically made to work with thread-based parallelism, but also supports
multiple processes each with a tunable number of threads.

---

# Phusion Passenger

Passenger is a ruby web application server that can be added as a module to
either Apache or NGINX.

Passenger works as a worker pool adjusting the number of processes that
handle requests. Originally did not support threads within the processes.

---

# Puma and Passenger

Both Puma and Phusion Passenger:

* are relatively easy to configure

* enable processes to be forked after ruby/rails is loaded

> Why might you want to wait to load the application prior to forking?

---

# Thread Safety Note

If you can use thread-parallelism, do it! But, making your code thread safe
isn't always obvious.

Things to consider:

* Your code

* Your many dependencies