+ - 0:00:00
Notes for current slide
Notes for next slide

Relational Databases

CS291A: Scalable Internet Services

1 / 106

Relational Databases

After today you should...

  • understand the importance of relational databases in architecting scalable Internet systems
2 / 106

Relational Databases

After today you should...

  • understand the importance of relational databases in architecting scalable Internet systems

  • understand how to design with concurrency in mind

3 / 106

Relational Databases

After today you should...

  • understand the importance of relational databases in architecting scalable Internet systems

  • understand how to design with concurrency in mind

  • have a basic idea regarding how to speed up your data layer

4 / 106

Relational Databases

After today you should...

  • understand the importance of relational databases in architecting scalable Internet systems

  • understand how to design with concurrency in mind

  • have a basic idea regarding how to speed up your data layer

    • determining where to start
5 / 106

Relational Databases

After today you should...

  • understand the importance of relational databases in architecting scalable Internet systems

  • understand how to design with concurrency in mind

  • have a basic idea regarding how to speed up your data layer

    • determining where to start

    • understanding which rails-specific techniques to use

6 / 106

Relational Databases

After today you should...

  • understand the importance of relational databases in architecting scalable Internet systems

  • understand how to design with concurrency in mind

  • have a basic idea regarding how to speed up your data layer

    • determining where to start

    • understanding which rails-specific techniques to use

    • being able to identify the source of slowness in an SQL query

7 / 106

Stateful Requests from Clients

Clients requests regularly include data updates

Many stateless application servers responding to requests

Clients can be served by any application server

Database connectivity

8 / 106

Application Server Needs

  • Data that is shared between requests and servers
9 / 106

Application Server Needs

  • Data that is shared between requests and servers

  • Fast access to data

10 / 106

Application Server Needs

  • Data that is shared between requests and servers

  • Fast access to data

  • High availability of shared data

11 / 106

Application Server Needs

  • Data that is shared between requests and servers

  • Fast access to data

  • High availability of shared data

  • Intuitive utilization of the data layer

if david.balance > 100
david.withdrawal(100)
mary.deposit(100)
end
12 / 106

Data Layer Options

There are quite a few existing solutions for a stable data layer.

Relational (SQL)

  • MySQL (MariaDB, Percona)
  • PostgreSQL
  • Oracle
  • MS SQL
13 / 106

Data Layer Options

There are quite a few existing solutions for a stable data layer.

Relational (SQL)

  • MySQL (MariaDB, Percona)
  • PostgreSQL
  • Oracle
  • MS SQL

Non-relational (NoSQL)

  • Cassandra
  • MongoDB
  • Redis
14 / 106

SQL v. NoSQL

15 / 106

SQL v. NoSQL

Relational Databases...

  • are a general-purpose persistence layer
  • offer more features
  • have a limited ability to scale horizontally
16 / 106

SQL v. NoSQL

Relational Databases...

  • are a general-purpose persistence layer
  • offer more features
  • have a limited ability to scale horizontally

Non-relational Databases...

  • often are more specialized
  • require more from the application layer
  • are better at scaling horizontally
17 / 106

SQL v. NoSQL

Relational Databases...

  • are a general-purpose persistence layer
  • offer more features
  • have a limited ability to scale horizontally

Non-relational Databases...

  • often are more specialized
  • require more from the application layer
  • are better at scaling horizontally

If your needs fit within an RDBMS's (relational database management system) ability to scale, they tend to be the best. If your scaling needs exceed RDBMS capabilities, go non-relational.

18 / 106
  • If you need ACID properties, you should use a relational database.
    • ACID: Atomicity, Consistency, Isolation, Durability

Database Outline

Today we will discuss:

  • Relational databases
    • Concurrency control
    • SQL query analysis
19 / 106

Database Outline

Today we will discuss:

  • Relational databases
    • Concurrency control
    • SQL query analysis

Later we will discuss:

  • Scaling options for relational databases:
    • Sharding
    • Service Oriented Architectures (SOA)
    • Distinguishing reads from writes
  • A survey of NoSQL options
20 / 106

Database Transactions

Transactions are a database concept that allows a system to guarantee certain semantic properties.

Transactions provide control over concurrency.

Having rigorously defined guarantees means we can build correct systems using these databases.

21 / 106

Database ACID Properties - Refresher

22 / 106

Database ACID Properties - Refresher

Atomicity

  • Complete everything, or do nothing
  • No partial application of a transaction
23 / 106

Database ACID Properties - Refresher

Atomicity

  • Complete everything, or do nothing
  • No partial application of a transaction

Consistency

  • The database should be consistent both at the beginning and end of a transaction
  • Consistency is defined by the integrity constraints (e.g., foreign keys, NOT NULL, etc.)
24 / 106

Database ACID Properties - Refresher

Atomicity

  • Complete everything, or do nothing
  • No partial application of a transaction

Consistency

  • The database should be consistent both at the beginning and end of a transaction
  • Consistency is defined by the integrity constraints (e.g., foreign keys, NOT NULL, etc.)

Isolation

  • A transaction should not see the effects of other uncommitted transactions
25 / 106

Database ACID Properties - Refresher

Atomicity

  • Complete everything, or do nothing
  • No partial application of a transaction

Consistency

  • The database should be consistent both at the beginning and end of a transaction
  • Consistency is defined by the integrity constraints (e.g., foreign keys, NOT NULL, etc.)

Isolation

  • A transaction should not see the effects of other uncommitted transactions

Durability

  • Once committed, the transaction's effects should not disappear
26 / 106

Overlapping Concerns

Atomicity and Durability are related and are generally provided by journaling.

Consistency and Isolation are provided by concurrency control (usually implemented via locking).

27 / 106

Overlapping Concerns

Atomicity and Durability are related and are generally provided by journaling.

Consistency and Isolation are provided by concurrency control (usually implemented via locking).

  • Questions?
28 / 106

Definition: Schedule

A schedule is an abstract model used to describe the execution of transactions that run in a database.

T1 T2 T3
R(X)
W(X)
Commit
R(Y)
W(Y)
Commit
R(Z)
W(Z)
Commit

Time goes top to bottom.

29 / 106

Conflicting Actions

Two actions are said to be in conflict if:

30 / 106

Conflicting Actions

Two actions are said to be in conflict if:

  • the actions belong to different transactions
31 / 106

Conflicting Actions

Two actions are said to be in conflict if:

  • the actions belong to different transactions

  • at least one of the actions is a write operation

32 / 106

Conflicting Actions

Two actions are said to be in conflict if:

  • the actions belong to different transactions

  • at least one of the actions is a write operation

  • the actions access the same object (read or write)

33 / 106

Conflicting Actions

Two actions are said to be in conflict if:

  • the actions belong to different transactions

  • at least one of the actions is a write operation

  • the actions access the same object (read or write)

Conflicting Actions

T1 T2
R(X)
W(X)
34 / 106

Non-Conflicting Examples

All reads

T1 T2
R(X)
R(X)
35 / 106

Non-Conflicting Examples

All reads

T1 T2
R(X)
R(X)

Write to different object

T1 T2
R(X)
W(Y)
36 / 106

Question

Can we blindly execute transactions in parallel?

37 / 106

Question

Can we blindly execute transactions in parallel?

Answer

No

  • Dirty Read Problem

  • Incorrect Summary Problem

38 / 106

Dirty Read Problem

A transaction (T2) reads a value written by another transaction (T1) that is later rolled back*.

The result of the T2 transaction will put the database in an incorrect state.

T1 T2
W(X)
R(X)
Cancel
W(Y)
Commit
39 / 106

Incorrect Summary Problem

A transaction (T1) computes a summary over the values of all the instances of a repeated data-item. While that occurs, another transaction (T2) updates some instances of data-item.

The resulting summary will not reflect a correct result for any deterministic order of the transactions (T1 then T2, or T2 then T1).

T1 T2
R(X*)
W(X^n)
Commit
R(X*)
W(Y)
Commit
40 / 106

Schedule Types

A schedule is serial if

  • the transactions are executed non-interleaved
  • one transaction finishes before the next one starts
41 / 106

Schedule Types

A schedule is serial if

  • the transactions are executed non-interleaved
  • one transaction finishes before the next one starts

Two schedules are conflict equivalent if

  • they involve the same set of transactions
  • every pair of conflicting actions are ordered in the same way
42 / 106

Schedule Types

A schedule is serial if

  • the transactions are executed non-interleaved
  • one transaction finishes before the next one starts

Two schedules are conflict equivalent if

  • they involve the same set of transactions
  • every pair of conflicting actions are ordered in the same way

A schedule is conflict serializable if

  • the schedule is conflict equivalent to a serial schedule
43 / 106

Schedule Types

A schedule is serial if

  • the transactions are executed non-interleaved
  • one transaction finishes before the next one starts

Two schedules are conflict equivalent if

  • they involve the same set of transactions
  • every pair of conflicting actions are ordered in the same way

A schedule is conflict serializable if

  • the schedule is conflict equivalent to a serial schedule

A schedule is recoverable if

  • transactions commit only after all transactions whose changes they read, commit
44 / 106

Not Conflict Serializable

T1 T2
R(A)
W(A)
R(A)
W(A)
R(B)
W(B)
R(B)
W(B)

When transactions are serialized, the schedule on the left is conflict equivalent to neither of:

T1 T2
R(A)
W(A)
R(B)
W(B)
R(A)
W(A)
R(B)
W(B)

or

T1 T2
R(A)
W(A)
R(B)
W(B)
R(A)
W(A)
R(B)
W(B)
45 / 106

Why Serializable?

Why is it important that we have a serializable schedule?

Why not simply execute serially?

Database connectivity

46 / 106

Serial Schedule

Having a serial schedule is important for consistent results. For example it is good when you are keeping track of your bank balance in the database.

A serial execution of transactions is safe but slow.

Most general purpose relational databases default to employing conflict-serializable and recoverable schedules.

How do these RDBMS employ conflict-serializable and recoverable schedules?

47 / 106

Database Locks

In order to implement a database whose schedules are both conflict serializable and recoverable, locks are used.

  • A lock is a system object associated with a shared resource such as a data item, a row, or a page in memory.

  • A database lock may need to be acquired by a transaction before accessing the object.

  • Locks prevent undesired, incorrect, or inconsistent operations on shared resources by concurrent transactions.

48 / 106

Two Types of Database Locks

Write-lock

  • Blocks writes and reads

  • Also called exclusive lock

49 / 106

Two Types of Database Locks

Write-lock

  • Blocks writes and reads

  • Also called exclusive lock

Read-lock

  • Blocks writes

  • Also called shared lock (other reads can happen concurrently)

50 / 106

Two-Phase Locking

Two-phase locking (2PL) is a concurrency control method that guarantees serializability.

The two phases are:

  • Acquire Locks

  • Release Locks

51 / 106

Two-Phase Locking

Two-phase locking (2PL) is a concurrency control method that guarantees serializability.

The two phases are:

  • Acquire Locks

  • Release Locks

Two-phase locking Protocol

52 / 106

Two-Phase Locking

Two-phase locking (2PL) is a concurrency control method that guarantees serializability.

The two phases are:

  • Acquire Locks

  • Release Locks

Two-phase locking Protocol

  • Each transaction must obtain a shared lock on an object before reading.
53 / 106

Two-Phase Locking

Two-phase locking (2PL) is a concurrency control method that guarantees serializability.

The two phases are:

  • Acquire Locks

  • Release Locks

Two-phase locking Protocol

  • Each transaction must obtain a shared lock on an object before reading.

  • Each transaction must obtain an exclusive lock on an object before writing.

54 / 106

Two-Phase Locking

Two-phase locking (2PL) is a concurrency control method that guarantees serializability.

The two phases are:

  • Acquire Locks

  • Release Locks

Two-phase locking Protocol

  • Each transaction must obtain a shared lock on an object before reading.

  • Each transaction must obtain an exclusive lock on an object before writing.

  • If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.

55 / 106

Two-Phase Locking

Two-phase locking (2PL) is a concurrency control method that guarantees serializability.

The two phases are:

  • Acquire Locks

  • Release Locks

Two-phase locking Protocol

  • Each transaction must obtain a shared lock on an object before reading.

  • Each transaction must obtain an exclusive lock on an object before writing.

  • If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.

  • A transaction cannot request additional locks once it releases any locks.

56 / 106

Two-Phase Locking: Potential Issue

During the second phase locks can be released as soon as they are no longer needed. This permits other transactions to obtain those locks.

T1 T2
R(A)
W(A)
unlock(A)
R(A)
... ...
cancel
57 / 106

Two-Phase Locking: Potential Issue

During the second phase locks can be released as soon as they are no longer needed. This permits other transactions to obtain those locks.

T1 T2
R(A)
W(A)
unlock(A)
R(A)
... ...
cancel

What happens to T2 when T1 cancels the transaction?

58 / 106

Strong Strict Two-Phase Locking

2PL can result in cascading rollbacks.

SS2PL allows only conflict serializable schedules.

59 / 106

Strong Strict Two-Phase Locking

2PL can result in cascading rollbacks.

SS2PL allows only conflict serializable schedules.

Strong Strict Two-Phase Locking Protocol

  • Each transaction must obtain a shared lock on an object before reading.

  • Each transaction must obtain an exclusive lock on an object before writing.

  • If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.

60 / 106

Strong Strict Two-Phase Locking

2PL can result in cascading rollbacks.

SS2PL allows only conflict serializable schedules.

Strong Strict Two-Phase Locking Protocol

  • Each transaction must obtain a shared lock on an object before reading.

  • Each transaction must obtain an exclusive lock on an object before writing.

  • If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.

  • All locks held by a transaction are released when the transaction completes.

This approach avoids the problem with cascading rollbacks.

61 / 106

Strong Strict Two-Phase Locking

2PL can result in cascading rollbacks.

SS2PL allows only conflict serializable schedules.

Strong Strict Two-Phase Locking Protocol

  • Each transaction must obtain a shared lock on an object before reading.

  • Each transaction must obtain an exclusive lock on an object before writing.

  • If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.

  • All locks held by a transaction are released when the transaction completes.

This approach avoids the problem with cascading rollbacks.

What is the downside to SS2PL?

62 / 106

Concurrency Control in Rails

63 / 106

Concurrency Control in Rails

We have many application servers running our application.

We are using a relational database to ensure that each request observes a consistent view of the database.

What is required in rails to obtain this consistency?

Database connectivity

64 / 106
  • None of our ACID properties are guaranteed at the application layer.
  • Even with the database providing these guarantees, our application can still have issues.
    • For example, two requests updating the same object
    • How do we handle this?

Concurrency Control in Rails

Rails uses two types of concurrency control:

Optimistic Locking

Assume that database modifications are going to succeed. Throw an exception when they do not.

65 / 106

Concurrency Control in Rails

Rails uses two types of concurrency control:

Optimistic Locking

Assume that database modifications are going to succeed. Throw an exception when they do not.

Pessimistic Locking

Ensure that database modifications will succeed by explicitly avoiding conflict.

66 / 106

Optimistic Locking in Rails

Any ActiveRecord model will automatically utilize optimistic locking if an integer lock_version field is added to the object's table.

Whenever such an ActiveRecord object is read from the database, that object contains its associated lock_version.

When an update for the object occurs, Rails compares the object's lock_version to the most recent one in the database.

67 / 106

Optimistic Locking in Rails

Any ActiveRecord model will automatically utilize optimistic locking if an integer lock_version field is added to the object's table.

Whenever such an ActiveRecord object is read from the database, that object contains its associated lock_version.

When an update for the object occurs, Rails compares the object's lock_version to the most recent one in the database.

If they differ a StaleObjectException is thrown.

Otherwise, the data is written to the database, and the lock_version value is incremented.

68 / 106

Optimistic Locking in Rails

Any ActiveRecord model will automatically utilize optimistic locking if an integer lock_version field is added to the object's table.

Whenever such an ActiveRecord object is read from the database, that object contains its associated lock_version.

When an update for the object occurs, Rails compares the object's lock_version to the most recent one in the database.

If they differ a StaleObjectException is thrown.

Otherwise, the data is written to the database, and the lock_version value is incremented.

This optimistic locking is an application-level construct. The database does nothing more than storing the lock_version values.

https://api.rubyonrails.org/classes/ActiveRecord/Locking/Optimistic.html

69 / 106

Optimistic Locking Example

c1 = Person.find(1)
c1.name = "X Æ A-12"
c2 = Person.find(1)
c2.gender = "X-Ash-A-12"
c1.save! # Succeeds
c2.save! # throws StaleObjectException
rails g migration AddLockVersionToPeople lock_version:integer
70 / 106

Optimistic Locking Example

c1 = Person.find(1)
c1.name = "X Æ A-12"
c2 = Person.find(1)
c2.gender = "X-Ash-A-12"
c1.save! # Succeeds
c2.save! # throws StaleObjectException
rails g migration AddLockVersionToPeople lock_version:integer

Strengths

  • Predictable performance
  • Lightweight
71 / 106

Optimistic Locking Example

c1 = Person.find(1)
c1.name = "X Æ A-12"
c2 = Person.find(1)
c2.gender = "X-Ash-A-12"
c1.save! # Succeeds
c2.save! # throws StaleObjectException
rails g migration AddLockVersionToPeople lock_version:integer

Strengths

  • Predictable performance
  • Lightweight

Weaknesses

  • Have to write error handling code
  • (or) errors will propagate to your users
72 / 106

Pessimistic Locking in Rails

Easily done by calling lock along with ActiveRecord find.

Whenever an ActiveRecord object is read from the database with that option an exclusive lock is acquired for the object.

While this lock is held, the database prevents others from obtaining the lock, reading from, and writing to the object. The others are blocked until the object is unlocked.

Implemented using the SELECT FOR UPDATE SQL.

https://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html

73 / 106

Pessimistic Locking Example

Request 1

transaction do
person = Person.lock.find(1)
person.name = "X Æ A-12"
person.save! # works fine
end

Request 2

transaction do
person = Person.lock.find(1)
person.name = "X-Ash-A-12"
person.save! # works fine
end

This approach works great, yet it is not commonly used.

What could go wrong?

74 / 106

Pessimistic Locking Issue: Blocking

transaction do
person = Person.lock.find(1)
person.name = "X Æ A-12"
my_long_procedure
person.save!
end

Request 2

transaction do
person = Person.lock.find(1)
person.name = "X-Ash-A-12"
my_long_procedure
person.save!
end
75 / 106

Pessimistic Locking Issue: Deadlock

transaction do
family = Family.lock.find(3)
family.count += 1
family.save!
person = Person.lock.find(1)
person.name = "X Æ A-12"
person.save!
end

Request 2

transaction do
person = Person.lock.find(1)
person.name = "X-Ash-A-12"
person.save!
family = Family.lock.find(3)
family.count += 1
family.save!
end
76 / 106

Pessimistic Locking: Summary

Strengths

  • Failed transactions are incredibly rare or nonexistent
77 / 106

Pessimistic Locking: Summary

Strengths

  • Failed transactions are incredibly rare or nonexistent

Weaknesses

  • Have to worry about deadlocks and avoiding them

  • Performance is less predictable

78 / 106

Next Step: Finding DB Bottlenecks

We now have a Rails app connected to PostgreSQL and it is slower than you'd like.

You think the bottleneck may be the database.

How do we find out?

79 / 106

Demo App Database Optimizations

Let's use an example from the Demo app!

By changing the way you interact with the Rails ORM (ActiveRecord) you can significantly improve performance.

The following examples are all contained in the database_optimizations branch on github.

Submissions Index View

80 / 106

Demo App ActiveRecord Time

With the Demo app's master branch deployed on a m3_medium instance with:

  • 20 Communities
  • 400 Submissions
  • 8000 Comments

A request to / resulted in:

Completed 200 OK ActiveRecord: 220.6ms.

Why so slow?

Submissions Index View

81 / 106

Demo App Investigation

First Step: Find out what Rails is doing.

In development mode, Rails will output the SQL it generates and executes to the application server log.

To (temporarily) enable debugging in production mode, change config/environments/production.rb to contain:

config.log_level = :debug
82 / 106

Submission Index Controller and View

Controller

class SubmissionsController < ApplicationController
def index
@submissions = Submission.all
end
end

View

<% @submissions.each do |submission| %>
<tr>
<td><%= link_to(submission.title, submission.url) %></td>
<td><%= submission.url %></td>
<td><%= submission.community.name %></td>
<td>
<%= link_to("#{submission.comments.size} comments",
submission, class: 'btn btn-primary btn-xs') %>
</td>
</tr>
<% end %>
83 / 106

Generated SQL Statements

Processing by SubmissionsController#index as HTML
Submission Load (0.5ms) SELECT `submissions`.* FROM `submissions`
Community Load (0.3ms) SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 1
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 2
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 3
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]
...
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 20 LIMIT 1 [["id", 20]]
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 400
84 / 106

Generated SQL Statements

Processing by SubmissionsController#index as HTML
Submission Load (0.5ms) SELECT `submissions`.* FROM `submissions`
Community Load (0.3ms) SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 1
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 2
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 3
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]
...
SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 20 LIMIT 1 [["id", 20]]
SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 400

That is a lot of SELECT queries!

85 / 106

Number of Queries Bottleneck

We are issuing a ton of SELECT queries. The overhead associated with each is slowing us down.

How can we fix this problem?

86 / 106

Number of Queries Bottleneck

We are issuing a ton of SELECT queries. The overhead associated with each is slowing us down.

How can we fix this problem?

Issue fewer queries.

  • Do not ask for the community each time

  • Do not ask for the number of comments each time

87 / 106

Reducing (N+1) Queries in Rails

(Before) Without includes

class SubmissionsController < ApplicationController
def index
@submissions = Submission.all
end
end
88 / 106

Reducing (N+1) Queries in Rails

(Before) Without includes

class SubmissionsController < ApplicationController
def index
@submissions = Submission.all
end
end

(After) With includes

class SubmissionsController < ApplicationController
def index
@submissions = Submission.includes(:comments)
.includes(:community).all
end
end
89 / 106

Reducing (N+1) Queries in Rails

(Before) Without includes

class SubmissionsController < ApplicationController
def index
@submissions = Submission.all
end
end

(After) With includes

class SubmissionsController < ApplicationController
def index
@submissions = Submission.includes(:comments)
.includes(:community).all
end
end

Result

Result: ActiveRecord 39.6ms
Submission Load (0.9ms) SELECT `submissions`.* FROM `submissions`
Comment Load (38.3ms) SELECT `comments`.* FROM `comments` WHERE `comments`.`submission_id` IN (1, 2,... 399, 400)
Community Load (0.4ms) SELECT `communities`.* FROM `communities` WHERE `communities`.`id` IN (1, 2, ...19, 20)
90 / 106

SQL Explain

Sometimes things are still slow even when the number of queries is minimized.

SQL provides an EXPLAIN statement that can be used to analyze individual queries.

When a query starts with EXPLAIN...

  • the query is not actually executed

  • the produced output will help us identify potential improvements

  • e.g. sequential vs index scan, startup vs total cost

https://www.postgresql.org/docs/9.1/sql-explain.html

91 / 106

SQL Optimizations

There are three primary ways to optimize SQL queries:

92 / 106

SQL Optimizations

There are three primary ways to optimize SQL queries:

  • Add or modify indexes
93 / 106

SQL Optimizations

There are three primary ways to optimize SQL queries:

  • Add or modify indexes

  • Modify the table structure

94 / 106

SQL Optimizations

There are three primary ways to optimize SQL queries:

  • Add or modify indexes

  • Modify the table structure

  • Directly optimize the query

95 / 106

SQL Indexes

What is an index?

96 / 106

SQL Indexes

What is an index?

An index is a fast, ordered, compact structure (often B-tree) for identifying row locations.

97 / 106

SQL Indexes

What is an index?

An index is a fast, ordered, compact structure (often B-tree) for identifying row locations.

When an index is provided on a column that is to be filtered (searching for a particular item), the database is able to quickly find that information.

98 / 106

SQL Indexes

What is an index?

An index is a fast, ordered, compact structure (often B-tree) for identifying row locations.

When an index is provided on a column that is to be filtered (searching for a particular item), the database is able to quickly find that information.

Indexes can exist on a single column, or across multiple columns. Multi-column indexes are useful when filtering on two columns (e.g., CS classes that are not full).

99 / 106

Adding Indexes in Rails

To add an index on the name field of the Product table, create a migration containing:

class AddNameIndexProducts < ActiveRecord::Migration
def change
add_index :products, :name
end
end
100 / 106

Related: Foreign Keys

By default, when dealing with relationships between ActiveRecord objects, Rails will validate the constraints in the application layer.

For example, an Employee object should have a Company that they work for.

Assuming the relationship is defined properly, Rails will enforce that when creating an Employee, the associated Company exists.

Many databases, have built-in support for enforcing such constraints. With rails, one can also take advantage of the database's foreign key support via add_foreign_key.

class AddForeignKeyToOrders < ActiveRecord::Migration
def change
add_foreign_key :employees, :companies
end
end
101 / 106

Optimize the Table Structure

Indexes work best when they can be kept in memory.

Sometimes changing the field type, or index length can provide significant memory savings.

If appropriate some options are:

  • Reduce the length of a VARCHAR index if appropriate

  • Use a smaller unsigned integer type

  • Use an integer or enum field for statuses rather than a text-based value

102 / 106

Directly Optimize the Query

Before

explain select count(*) from txns where parent_id - 1600 = 16340;
select_type: SIMPLE
table: txns
type: index
key: index_txns_on_reverse_txn_id
rows: 439186
Extra: Using where; Using index
103 / 106

Directly Optimize the Query

Before

explain select count(*) from txns where parent_id - 1600 = 16340;
select_type: SIMPLE
table: txns
type: index
key: index_txns_on_reverse_txn_id
rows: 439186
Extra: Using where; Using index

After

explain select count(*) from txns where parent_id = 16340 + 1600
select_type: SIMPLE
table: txns
type: const
key: index_txns_on_reverse_txn_id
rows: 1
Extra: Using index
104 / 106

Summary

  • understand the importance of relational databases in architecting scalable Internet systems

  • understand how to design with concurrency in mind

  • have a basic idea regarding how to speed up the data layer

    • determine where to start

    • understand which rails-specific techniques to use

    • identify the source of slowness in an SQL query

105 / 106
106 / 106

Relational Databases

After today you should...

  • understand the importance of relational databases in architecting scalable Internet systems
2 / 106
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow