class: center, middle # HTTP ## CS291A: Scalable Internet Services --- # Agenda * The Life Cycle of a Web Request * HTTP Requests and Responses * HTTP Performance (HPBN, chapters 9-11) * Project 1 --- class: center inverse middle # The Life Cycle of a Web Request --- # Two Endpoints A __web browser__ is a process (at least one) that runs on an operating system. It- * responds to user input * renders the display * utilizes the network -- A __web server__ is a process (at least one) that runs on an operating system. It- * responds to network requests * loads resources that may come from file system, database, other servers --- # Core Components of a Web request * __Web server__: Opens a TCP socket to listen for requests * __Browser__: Makes a DNS query to obtain an IP address for www.reddit.com * __Browser__: Initiates a TCP connection to the IP address * __Web server__: Accepts the TCP connection * __Web server__: Adds TLS context to the TCP connection * __Browser__: Wraps the TCP connection in a TLS session * __Browser__: Sends an HTTP request over the TLS session * __Web server__: Parses the request, fetches and sends the requested resources --- class: center inverse middle # HTTP --- # Chrome Network Table By the end of this lecture, you should understand this output: .center[![Chrome Network Table](chrome_network_table.png)] --- # History > When was HTTP created? -- The year is 1990. The Internet has existed for ~20 years, email for ~8. Tim Berners-Lee has the idea to combine __HyperText__ and the __Internet__. He creates the first version of __HTTP__ and __HTML__. __HyperText__: Links documents together via __HyperLinks__ > When was the first web browser created? -- * 1993: Mosaic, the first web browser, is created at UIUC * 1994: Marc Andreessen leaves UIUC and founds Netscape with Jim Clark --- # The First Web Server ![The First Web Server](first_web_server.png) --- # HTTP Explained In its original, simplest version: * Open a TCP socket (standard port is 80) * Send an ASCII request for a resource (`GET mypage.html`) * Response comes back containing only the content of `mypage.html` * Close the TCP socket -- HTTP was originally designed to: * Send HTML (was later extended to send anything) * To facilitate interaction between a web browser and a web server (now also used heavily for server-to-server communication) --- class: center inverse middle # HTTP Request/Response Components --- # HTTP Request * __Method/Verb__: What do you want to do (e.g., __GET__, __POST__) * __Resource__: What is the logical path of the resource you want * __HTTP Version__: What version of HTTP are you using? * __Headers__: Standard way to specify or request optional behavior * __Body__: Content that is being sent to the server (e.g., a file upload) --- # HTTP Response * __HTTP Version__: Indicates the HTTP version the server is "speaking" * __Status Code__: Indicates success, failure, or other possible [response codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) * __Headers__: Provides meta data about the response. * __Body__: The primary payload of the response (for a successful _GET_ request, this contains the resource content) --- # HTTP Components Example ![Components of HTTP request and response](http_components.png) --- class: center inverse middle # HTTP Methods/Verbs --- # GET * Request a copy of a _resource_ * The request should have no side-effects (i.e., doesn't change server state) * __GOOD__: `GET /fluffy_kitty.jpg` * __BAD__: `GET /users/sign_out` --- # POST * Sends data to the server * Generally used to create a _resource_ * Has side-effects (e.g., creates a resource) * Not idempotent (i.e., making the same request twice creates two separate, but similar resources) * "Do you want to submit your form again?" --- # PUT * Sends data to the server * Often used to update an existing resource * Has side-effects * Should be idempotent (e.g., updating a resource twice should result in the same effect to the resource) --- # DELETE * Destroys a resource * Has side-effects * Should be idempotent * __GOOD__: `DELETE /session/
` (for log out) --- # HEAD * Exactly like GET but excludes the body in the response > When might you want to use HTTP HEAD? --- # HTTP Method/Verb Misuse Correct HTTP method/verb usage is not enforced, and is often misused. > What do you think the following is supposed to do? ```http GET /post/5?action=hide ``` -- > What other problem do you think can occur with the misuse of HTTP methods/verbs? --- # Other HTTP Methods/Verbs * CONNECT * Establish a connection through a proxy server * OPTIONS * See which methods/verbs are available for a resource * Used for [CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) requests * TRACE * Used to reflect the HTTP request back --- # HTTP Resource Specifies a logical hierarchy to access a resource: * __GOOD__: `/gp/product/1565925092/` * __BAD__: `/index.jsp?page_id=4251` -- ## Query String * The portion of the resource after a _question mark_ (`?`) * Used to assist in locating the resource * Values are assigned using the equal token (e.g., `sort=top`) * Multiple values can be concatenated via _ampersand_ (`&`) Example:
--- # HTTP Version Version strings are often used in protocols to make it easy to evolve the protocol. With HTTP different versions have different behavior: * (1991) HTTP 0.9 (retroactively versioned): Single line protocol with no headers. Only __GET__ supported: `GET index.html` * (1996) HTTP 1.0: Added headers, and a version string * (1999) HTTP 1.1: Connection keep-alive by default, additional caching mechanisms. __Primary HTTP version used today__ * (2015) HTTP 2.0: Binary framing, header compression, many other optimizations. Discussed in more detail in a future lecture * (2022) HTTP 3.0: QUIC/UDP. Lower latency. Packet Loss. Discussed in more detail in a future lecture. --- class: center inverse middle # HTTP Headers --- # HTTP Headers Overview Provides metadata for the request and response. HTTP Header keys are case _InSenSiTiVe_. > What HTTP headers do you know of? --- # HTTP Headers: Accept (request) Indicates the desired format of the requested resource. ```http Accept: text/html ``` I desire the resource in html format. ```http Accept: application/json ``` I desire the resource as a JSON document. ```http Accept: application/json,application/xml ``` I prefer a JSON document, but an XML document will do. --- # HTTP Headers: Content-Type (request/response) Indicates the format of the request or response body. ```http Content-Type: text/html ``` The body of this request/response is an HTML document. ```http Content-Type: application/json ``` The body of this request/response is a JSON document. Other common request content-types: * `application/x-www-form-urlencoded` (HTTP form) * `multipart/form-data` (HTTP form with file upload) --- # HTTP Headers: Accept-* (request) ## Accept-Encoding Indicates preferred encoding for the response body. ```http Accept-Encoding: bzip2,gzip ``` I prefer the data compressed via bzip2. If that cannot be done, please gzip the response. ## Accept-Language Indicates the preferred language for the resource ```http Accept-Language: es,en-US ``` I prefer Spanish, but will accept US-English. --- # HTTP Headers: Host (request) Indicates the DNS hostname associated with the desired resource. Required in an HTTP 1.1 request. > Why is the HOST header required? -- Compare the responses to the following: ```bash curl -I http://184.108.40.206 --header 'Host: cs291.com' curl -I http://220.127.116.11 --header 'Host: twitter.github.io' ``` --- # HTTP Headers: User-Agent (request) Indicates information about the client (web browser, crawler, tool) to the web server. ```http User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36 ``` Can be used to serve different content to different clients. However, try to avoid doing so. --- # HTTP Headers: Set-Cookie, Cookie ## Set-Cookie (response) A response header informing the browser to use the provided cookies in subsequent requests to the server. ## Cookie (request) A request header containing the data previously set by the server via a `Set-Cookie` header. -- > What kind of information might one put in a cookie? -- > What security concerns may exist with cookies? --- # Other HTTP Headers ## Caching Related Headers * ETag * Date * Last-Modified * Cache-Control * Age ## Security Related Headers * Content-Security-Policy * Strict-Transport-Security * X-Frame-Options `X-` prefixed headers are not part of the official specification and may later become _standardized_. --- class: center inverse middle # HTTP Statuses --- # HTTP Statuses Overview The HTTP response status indicates the outcome of the request. Status codes fall into one of five categories: * 1XX - Informational * 2XX - Successful * 3XX - Redirection * 4XX - Client Error * 5XX - Server Error Ref: [HTTP Status Code Definitions](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html) --- # Common HTTP Statuses * __200 OK__: The requested resource is being returned. * __301 Moved Permanently__: The resource has been moved and the browser should always use the new URL provided via the `Location: https://...` header. * __302 Found__: The resource can be found at another location via the `Location` header. * __304 Not Modified__: When request includes a cache header (e.g., `If-Modified-Since`) this status indicates the content hasn't changed saving resources from being fetched, rendered, and sent. * __403 Forbidden__: The request is not authorized to access the resource. * __404 Not Found__: The resource does not exist. * __500 Internal Server Error__: Something unexpected happened on the server. * __503 Service Unavailable__: Temporary failure on the server-side possibly with an upstream server. --- # HTTP Request Body HTTP PUT and POST requests typically have a body associated with them. HTML _form_ elements usually result in a POST request with a `x-www-form-urlencoded` type. Example: ```bash curl --trace-ascii - http://httpbin.org/post \ --data 'username=bboe&comment=Hi There' ``` ```http POST /post HTTP/1.1 Host: httpbin.org User-Agent: curl/7.43.0 Accept: */* Content-Length: 30 Content-Type: application/x-www-form-urlencoded username=bboe&comment=Hi There ``` --- # HTTP Response Body The body of a response usually contains the requested resource content. ```http HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 9973 ... ``` --- # Try it! Direct Input Server ## Listen for a TCP connection ```bash nc -l localhost -p 5000 ``` ## Browse to:
## Paste the following into your terminal ```http HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 111
HTTP is easy!
``` __Note__: Ensure the newline gets copied between the headers and the body. --- # HTTP Performance Prior to HTTP 1.1, one TCP connection was used for a single HTTP request. > What performance flaws exist in this design? --- # TCP Connection Delay Establishing a TCP connection requires 1 round-trip. ![TCP round trip visualized](tcp_round_trip.png) Image Source: “High Performance Browser Networking,” by Ilya Grigorik --- # Slow Start and Congestion Avoidance The early phases of a TCP connection are bandwidth constrained. .center[![TCP congestion mechanisms](tcp_congestion.png)] Image Source: “High Performance Browser Networking,” by Ilya Grigorik --- # TCP Congestion Window Size (cwnd) It takes a fair amount of time to get up-to-speed. Making new connections per HTTP request is not terribly efficient. .center[![TCP connection window size](tcp_cwnd.png)] Image Source: “High Performance Browser Networking,” by Ilya Grigorik --- # HTTP Keep-Alive HTTP 1.1 officially added support for the `Connection` header most commonly used as: `Connection: keep-alive` In fact, with HTTP 1.1, the default is `keep-alive`. The alternative, and the way to signal the end of the HTTP session is: `Connection: close` With a keep-alive HTTP session, the server waits some amount of time for an additional request after processing the most recent request. --- # HTTP Keep-Alive Session .center[![HTTP Session with Two Requests](http_session.png)] Image Source: “High Performance Browser Networking,” by Ilya Grigorik --- # HTTP Pipelining With HTTP keep-alive, we can make multiple requests on a single TCP connection. Success! But, we're still waiting for the current response before we can issue the next request. Why wait? --- # HTTP Pipelining Session .center[![HTTP Session with Two Requests](http_pipelining.png)] Image Source: “High Performance Browser Networking,” by Ilya Grigorik --- # Think About it > What sort of issues might occur with HTTP pipelining? --- # Pipelining Issues ## Head of Line Blocking Head of line blocking means that a request that takes a long time is blocking another request from occurring. -- ## Extra work for the server Pipeling may require the server to buffer future responses while blocked on the head of the line. These extra resources can exhaust the server. Furthermore, if an error occurs the server may end up doing the same _work_ twice. -- ## Adoption Many intermediaries (proxies, caches) simply do not support HTTP pipelining thus making the feature less appealing. --- # More Speed A single web page may have tens of resources. In practice obtaining each resource serially over the same TCP connection is too slow. > What can be done to get more speed? --- # Concurrent HTTP Sessions Most browsers will open up to __six__ concurrent TCP connections to the same server. .center[![Concurrent HTTP Sessions](concurrent_http.png)] Image Source: “High Performance Browser Networking,” by Ilya Grigorik --- # Even more speed According to the HTTP Archive the average number of resources for websites they crawled is `~100`. With up to six connections each HTTP session must fetch approximately `16` resources. With head-of-line blocking this may still be too slow. > What can we do? --- # Domain Sharding _Domain sharding_ is the process of separating resources to different domains, e.g., `i.ytimg.com`, `s.ytimg.com`. The web browser will make up to `6` connections for each domain. Reduces page load time for some work-loads (test to see if it's right for you). --- # Hacks that work Concurrent TCP sessions and domain sharding are hacks to get more performance out of an existing system (HTTP/1.1). Using these hacks makes application development and deployment more complicated. ## Other Performance Related Hacks * CSS/JS concatenation and minimization * Image spriting ## The Future is Here In a future lecture we'll talk about how HTTP/2 obviates many of these hacks. --- # Chrome Network Table Revisited Does this table make sense now? .center[![Chrome Network Table](chrome_network_table.png)]