After today you should...
After today you should...
understand the importance of relational databases in architecting scalable Internet systems
understand how to design with concurrency in mind
After today you should...
understand the importance of relational databases in architecting scalable Internet systems
understand how to design with concurrency in mind
have a basic idea regarding how to speed up your data layer
After today you should...
understand the importance of relational databases in architecting scalable Internet systems
understand how to design with concurrency in mind
have a basic idea regarding how to speed up your data layer
After today you should...
understand the importance of relational databases in architecting scalable Internet systems
understand how to design with concurrency in mind
have a basic idea regarding how to speed up your data layer
determining where to start
understanding which rails-specific techniques to use
After today you should...
understand the importance of relational databases in architecting scalable Internet systems
understand how to design with concurrency in mind
have a basic idea regarding how to speed up your data layer
determining where to start
understanding which rails-specific techniques to use
being able to identify the source of slowness in an SQL query
Clients requests regularly include data updates
Many stateless application servers responding to requests
Clients can be served by any application server
Data that is shared between requests and servers
Fast access to data
Data that is shared between requests and servers
Fast access to data
High availability of shared data
Data that is shared between requests and servers
Fast access to data
High availability of shared data
Intuitive utilization of the data layer
if david.balance > 100 david.withdrawal(100) mary.deposit(100)end
There are quite a few existing solutions for a stable data layer.
There are quite a few existing solutions for a stable data layer.
If your needs fit within an RDBMS's (relational database management system) ability to scale, they tend to be the best. If your scaling needs exceed RDBMS capabilities, go non-relational.
Transactions are a database concept that allows a system to guarantee certain semantic properties.
Transactions provide control over concurrency.
Having rigorously defined guarantees means we can build correct systems using these databases.
Atomicity
Atomicity
Consistency
NOT
NULL
, etc.)Atomicity
Consistency
NOT
NULL
, etc.)Isolation
Atomicity
Consistency
NOT
NULL
, etc.)Isolation
Durability
Atomicity and Durability are related and are generally provided by journaling.
Consistency and Isolation are provided by concurrency control (usually implemented via locking).
Atomicity and Durability are related and are generally provided by journaling.
Consistency and Isolation are provided by concurrency control (usually implemented via locking).
A schedule is an abstract model used to describe the execution of transactions that run in a database.
T1 | T2 | T3 |
---|---|---|
R(X) | ||
W(X) | ||
Commit | ||
R(Y) | ||
W(Y) | ||
Commit | ||
R(Z) | ||
W(Z) | ||
Commit |
Time goes top to bottom.
Two actions are said to be in conflict if:
Two actions are said to be in conflict if:
Two actions are said to be in conflict if:
the actions belong to different transactions
at least one of the actions is a write operation
Two actions are said to be in conflict if:
the actions belong to different transactions
at least one of the actions is a write operation
the actions access the same object (read or write)
Two actions are said to be in conflict if:
the actions belong to different transactions
at least one of the actions is a write operation
the actions access the same object (read or write)
T1 | T2 |
---|---|
R(X) | |
W(X) |
T1 | T2 |
---|---|
R(X) | |
R(X) |
T1 | T2 |
---|---|
R(X) | |
R(X) |
T1 | T2 |
---|---|
R(X) | |
W(Y) |
Can we blindly execute transactions in parallel?
Can we blindly execute transactions in parallel?
No
Dirty Read Problem
Incorrect Summary Problem
A transaction (T2) reads a value written by another transaction (T1) that is later rolled back*.
The result of the T2 transaction will put the database in an incorrect state.
T1 | T2 |
---|---|
W(X) | |
R(X) | |
Cancel | |
W(Y) | |
Commit |
A transaction (T1) computes a summary over the values of all the instances of a repeated data-item. While that occurs, another transaction (T2) updates some instances of data-item.
The resulting summary will not reflect a correct result for any deterministic order of the transactions (T1 then T2, or T2 then T1).
T1 | T2 |
---|---|
R(X*) | |
W(X^n) | |
Commit | |
R(X*) | |
W(Y) | |
Commit |
T1 | T2 |
---|---|
R(A) | |
W(A) | |
R(A) | |
W(A) | |
R(B) | |
W(B) | |
R(B) | |
W(B) |
When transactions are serialized, the schedule on the left is conflict equivalent to neither of:
T1 | T2 |
---|---|
R(A) | |
W(A) | |
R(B) | |
W(B) | |
R(A) | |
W(A) | |
R(B) | |
W(B) |
or
T1 | T2 |
---|---|
R(A) | |
W(A) | |
R(B) | |
W(B) | |
R(A) | |
W(A) | |
R(B) | |
W(B) |
Why is it important that we have a serializable schedule?
Why not simply execute serially?
Having a serial schedule is important for consistent results. For example it is good when you are keeping track of your bank balance in the database.
A serial execution of transactions is safe but slow.
Most general purpose relational databases default to employing conflict-serializable and recoverable schedules.
How do these RDBMS employ conflict-serializable and recoverable schedules?
In order to implement a database whose schedules are both conflict serializable and recoverable, locks are used.
A lock is a system object associated with a shared resource such as a data item, a row, or a page in memory.
A database lock may need to be acquired by a transaction before accessing the object.
Locks prevent undesired, incorrect, or inconsistent operations on shared resources by concurrent transactions.
Blocks writes and reads
Also called exclusive lock
Blocks writes and reads
Also called exclusive lock
Blocks writes
Also called shared lock (other reads can happen concurrently)
Two-phase locking (2PL) is a concurrency control method that guarantees serializability.
The two phases are:
Acquire Locks
Release Locks
Two-phase locking (2PL) is a concurrency control method that guarantees serializability.
The two phases are:
Acquire Locks
Release Locks
Two-phase locking (2PL) is a concurrency control method that guarantees serializability.
The two phases are:
Acquire Locks
Release Locks
Two-phase locking (2PL) is a concurrency control method that guarantees serializability.
The two phases are:
Acquire Locks
Release Locks
Each transaction must obtain a shared lock on an object before reading.
Each transaction must obtain an exclusive lock on an object before writing.
Two-phase locking (2PL) is a concurrency control method that guarantees serializability.
The two phases are:
Acquire Locks
Release Locks
Each transaction must obtain a shared lock on an object before reading.
Each transaction must obtain an exclusive lock on an object before writing.
If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.
Two-phase locking (2PL) is a concurrency control method that guarantees serializability.
The two phases are:
Acquire Locks
Release Locks
Each transaction must obtain a shared lock on an object before reading.
Each transaction must obtain an exclusive lock on an object before writing.
If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.
A transaction cannot request additional locks once it releases any locks.
During the second phase locks can be released as soon as they are no longer needed. This permits other transactions to obtain those locks.
T1 | T2 |
---|---|
R(A) | |
W(A) | |
unlock(A) | |
R(A) | |
... | ... |
cancel |
During the second phase locks can be released as soon as they are no longer needed. This permits other transactions to obtain those locks.
T1 | T2 |
---|---|
R(A) | |
W(A) | |
unlock(A) | |
R(A) | |
... | ... |
cancel |
What happens to T2 when T1 cancels the transaction?
2PL can result in cascading rollbacks.
SS2PL allows only conflict serializable schedules.
2PL can result in cascading rollbacks.
SS2PL allows only conflict serializable schedules.
Each transaction must obtain a shared lock on an object before reading.
Each transaction must obtain an exclusive lock on an object before writing.
If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.
2PL can result in cascading rollbacks.
SS2PL allows only conflict serializable schedules.
Each transaction must obtain a shared lock on an object before reading.
Each transaction must obtain an exclusive lock on an object before writing.
If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.
All locks held by a transaction are released when the transaction completes.
This approach avoids the problem with cascading rollbacks.
2PL can result in cascading rollbacks.
SS2PL allows only conflict serializable schedules.
Each transaction must obtain a shared lock on an object before reading.
Each transaction must obtain an exclusive lock on an object before writing.
If a transaction holds an exclusive lock on an object, no other transaction can obtain any lock on that object.
All locks held by a transaction are released when the transaction completes.
This approach avoids the problem with cascading rollbacks.
What is the downside to SS2PL?
We have many application servers running our application.
We are using a relational database to ensure that each request observes a consistent view of the database.
What is required in rails to obtain this consistency?
Rails uses two types of concurrency control:
Assume that database modifications are going to succeed. Throw an exception when they do not.
Rails uses two types of concurrency control:
Assume that database modifications are going to succeed. Throw an exception when they do not.
Ensure that database modifications will succeed by explicitly avoiding conflict.
Any ActiveRecord model will automatically utilize optimistic locking if an
integer lock_version
field is added to the object's table.
Whenever such an ActiveRecord object is read from the database, that object
contains its associated lock_version
.
When an update for the object occurs, Rails compares the object's
lock_version
to the most recent one in the database.
Any ActiveRecord model will automatically utilize optimistic locking if an
integer lock_version
field is added to the object's table.
Whenever such an ActiveRecord object is read from the database, that object
contains its associated lock_version
.
When an update for the object occurs, Rails compares the object's
lock_version
to the most recent one in the database.
If they differ a StaleObjectException
is thrown.
Otherwise, the data is written to the database, and the lock_version
value is
incremented.
Any ActiveRecord model will automatically utilize optimistic locking if an
integer lock_version
field is added to the object's table.
Whenever such an ActiveRecord object is read from the database, that object
contains its associated lock_version
.
When an update for the object occurs, Rails compares the object's
lock_version
to the most recent one in the database.
If they differ a StaleObjectException
is thrown.
Otherwise, the data is written to the database, and the lock_version
value is
incremented.
This optimistic locking is an application-level construct. The database does
nothing more than storing the lock_version
values.
https://api.rubyonrails.org/classes/ActiveRecord/Locking/Optimistic.html
c1 = Person.find(1)c1.name = "X Æ A-12"c2 = Person.find(1)c2.gender = "X-Ash-A-12"c1.save! # Succeedsc2.save! # throws StaleObjectException
rails g migration AddLockVersionToPeople lock_version:integer
c1 = Person.find(1)c1.name = "X Æ A-12"c2 = Person.find(1)c2.gender = "X-Ash-A-12"c1.save! # Succeedsc2.save! # throws StaleObjectException
rails g migration AddLockVersionToPeople lock_version:integer
c1 = Person.find(1)c1.name = "X Æ A-12"c2 = Person.find(1)c2.gender = "X-Ash-A-12"c1.save! # Succeedsc2.save! # throws StaleObjectException
rails g migration AddLockVersionToPeople lock_version:integer
Easily done by calling lock
along with ActiveRecord find
.
Whenever an ActiveRecord object is read from the database with that option an exclusive lock is acquired for the object.
While this lock is held, the database prevents others from obtaining the lock, reading from, and writing to the object. The others are blocked until the object is unlocked.
Implemented using the SELECT FOR UPDATE
SQL.
https://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html
Request 1
transaction do person = Person.lock.find(1) person.name = "X Æ A-12" person.save! # works fineend
Request 2
transaction do person = Person.lock.find(1) person.name = "X-Ash-A-12" person.save! # works fineend
This approach works great, yet it is not commonly used.
What could go wrong?
transaction do person = Person.lock.find(1) person.name = "X Æ A-12" my_long_procedure person.save!end
Request 2
transaction do person = Person.lock.find(1) person.name = "X-Ash-A-12" my_long_procedure person.save!end
transaction do family = Family.lock.find(3) family.count += 1 family.save! person = Person.lock.find(1) person.name = "X Æ A-12" person.save!end
Request 2
transaction do person = Person.lock.find(1) person.name = "X-Ash-A-12" person.save! family = Family.lock.find(3) family.count += 1 family.save!end
Have to worry about deadlocks and avoiding them
Performance is less predictable
We now have a Rails app connected to PostgreSQL and it is slower than you'd like.
You think the bottleneck may be the database.
How do we find out?
Let's use an example from the Demo app!
By changing the way you interact with the Rails ORM (ActiveRecord) you can significantly improve performance.
The following examples are all contained in the database_optimizations branch on github.
With the Demo app's master branch deployed on a m3_medium
instance
with:
A request to /
resulted in:
Completed 200 OK ActiveRecord: 220.6ms.
Why so slow?
First Step: Find out what Rails is doing.
In development mode, Rails will output the SQL it generates and executes to the application server log.
To (temporarily) enable debugging in production mode, change
config/environments/production.rb
to contain:
config.log_level = :debug
Controller
class SubmissionsController < ApplicationController def index @submissions = Submission.all endend
View
<% @submissions.each do |submission| %> <tr> <td><%= link_to(submission.title, submission.url) %></td> <td><%= submission.url %></td> <td><%= submission.community.name %></td> <td> <%= link_to("#{submission.comments.size} comments", submission, class: 'btn btn-primary btn-xs') %> </td> </tr><% end %>
Processing by SubmissionsController#index as HTMLSubmission Load (0.5ms) SELECT `submissions`.* FROM `submissions`Community Load (0.3ms) SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 1SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 2SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 3SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]...SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 20 LIMIT 1 [["id", 20]]SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 400
Processing by SubmissionsController#index as HTMLSubmission Load (0.5ms) SELECT `submissions`.* FROM `submissions`Community Load (0.3ms) SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 1SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 2SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 3SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 1 LIMIT 1 [["id", 1]]...SELECT `communities`.* FROM `communities` WHERE `communities`.`id` = 20 LIMIT 1 [["id", 20]]SELECT COUNT(*) FROM `comments` WHERE `comments`.`submission_id` = 400
SELECT
queries!We are issuing a ton of SELECT
queries. The overhead associated with each is
slowing us down.
How can we fix this problem?
We are issuing a ton of SELECT
queries. The overhead associated with each is
slowing us down.
How can we fix this problem?
Do not ask for the community each time
Do not ask for the number of comments each time
(Before) Without includes
class SubmissionsController < ApplicationController def index @submissions = Submission.all endend
(Before) Without includes
class SubmissionsController < ApplicationController def index @submissions = Submission.all endend
(After) With includes
class SubmissionsController < ApplicationController def index @submissions = Submission.includes(:comments) .includes(:community).all endend
(Before) Without includes
class SubmissionsController < ApplicationController def index @submissions = Submission.all endend
(After) With includes
class SubmissionsController < ApplicationController def index @submissions = Submission.includes(:comments) .includes(:community).all endend
Result
Result: ActiveRecord 39.6msSubmission Load (0.9ms) SELECT `submissions`.* FROM `submissions`Comment Load (38.3ms) SELECT `comments`.* FROM `comments` WHERE `comments`.`submission_id` IN (1, 2,... 399, 400)Community Load (0.4ms) SELECT `communities`.* FROM `communities` WHERE `communities`.`id` IN (1, 2, ...19, 20)
Sometimes things are still slow even when the number of queries is minimized.
SQL provides an EXPLAIN
statement that can be used to analyze individual
queries.
When a query starts with EXPLAIN
...
the query is not actually executed
the produced output will help us identify potential improvements
e.g. sequential vs index scan, startup vs total cost
https://www.postgresql.org/docs/9.1/sql-explain.html
There are three primary ways to optimize SQL queries:
There are three primary ways to optimize SQL queries:
There are three primary ways to optimize SQL queries:
Add or modify indexes
Modify the table structure
There are three primary ways to optimize SQL queries:
Add or modify indexes
Modify the table structure
Directly optimize the query
What is an index?
What is an index?
An index is a fast, ordered, compact structure (often B-tree) for identifying row locations.
What is an index?
An index is a fast, ordered, compact structure (often B-tree) for identifying row locations.
When an index is provided on a column that is to be filtered (searching for a particular item), the database is able to quickly find that information.
What is an index?
An index is a fast, ordered, compact structure (often B-tree) for identifying row locations.
When an index is provided on a column that is to be filtered (searching for a particular item), the database is able to quickly find that information.
Indexes can exist on a single column, or across multiple columns. Multi-column indexes are useful when filtering on two columns (e.g., CS classes that are not full).
To add an index on the name
field of the Product
table, create a migration
containing:
class AddNameIndexProducts < ActiveRecord::Migration def change add_index :products, :name endend
By default, when dealing with relationships between ActiveRecord objects, Rails will validate the constraints in the application layer.
For example, an Employee
object should have a Company
that they work for.
Assuming the relationship is defined properly, Rails will enforce that when
creating an Employee
, the associated Company
exists.
Many databases, have built-in support for enforcing such constraints. With rails, one can also take advantage of the database's foreign key support via add_foreign_key.
class AddForeignKeyToOrders < ActiveRecord::Migration def change add_foreign_key :employees, :companies endend
Indexes work best when they can be kept in memory.
Sometimes changing the field type, or index length can provide significant memory savings.
If appropriate some options are:
Reduce the length of a VARCHAR index if appropriate
Use a smaller unsigned integer type
Use an integer or enum field for statuses rather than a text-based value
Before
explain select count(*) from txns where parent_id - 1600 = 16340;select_type: SIMPLEtable: txnstype: indexkey: index_txns_on_reverse_txn_idrows: 439186Extra: Using where; Using index
Before
explain select count(*) from txns where parent_id - 1600 = 16340;select_type: SIMPLEtable: txnstype: indexkey: index_txns_on_reverse_txn_idrows: 439186Extra: Using where; Using index
After
explain select count(*) from txns where parent_id = 16340 + 1600select_type: SIMPLEtable: txnstype: constkey: index_txns_on_reverse_txn_idrows: 1Extra: Using index
understand the importance of relational databases in architecting scalable Internet systems
understand how to design with concurrency in mind
have a basic idea regarding how to speed up the data layer
determine where to start
understand which rails-specific techniques to use
identify the source of slowness in an SQL query
After today you should...
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |