Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

Distributed Transactions & Replication

Transaction Management in R*

Unravels details of logging & messages sent.

Assumptions

update in place, WAL
batched force of log records

Desired Characteristics

guaranteed xact atomicity
ability to "forget" outcome of commit ASAP
minimal log writes & message traffic
optimized performance in no-failure case
exploitation of completely or partially R/O xacts
maximize ability to perform unilateral abort

In order to minimize logging and comm:

rare failures do not deserve extra overhead in normal processing
hierarchical commit better than 2P

Normal Processing (2PC)

Coordinator Log	Messages	Subordinate Log
	PREPARE
		prepare/abort
	VOTE YES/NO
commit/abort
	COMMIT/ABORT
		commit/abort
	ACK
end

since subords force abort (& commit) before ACKing, they never need to ask coord about final outcome.
Rule: never need to ask something you used to know; log before ACKing.
Guarantees atomicity.

Total cost:
subords: 2 forced log-writes (prepare/commit), 2 messages (YES/ACK)
coord: 1 forced log write (commit), 1 async log write (end), 2 messages/subord (prepare/commit)

2PC & Failures

Recovery process per site handles xacts committing at crash, as well as incoming recovery messages.

on restart, read log and accumulate committing xact info in main mem
if you discover a local xact in the prepared state, contact coord to find out what to do
if you discover a local xact that didn’t get prepared, UNDO it, write abort record, and forget
if a local xact was committing (i.e. this is the coord), then send out COMMIT msgs to subords that haven’t ACKed. Similar for aborting.

Upon discovering a failure elsewhere:

    If a coord discovers that a subord is unreachable...
          while waiting for its vote: coord aborts xact as usual
          while waiting for an ACK: coord gives xact to recovery manager
    If a subord discovers that a coord is unreachable...
          if it hasn’t sent a YES vote yet: abort ("unilateral abort")
          if it has sent a YES vote, subord gives xact to recovery manager

If a recovery mgr receives an inquiry from a subord in prepared state
if main mem info says xact is committing or aborting, send COMMIT/ABORT
if main mem info says nothing...?

An Aside: Hierarchical 2PC

If you have a tree-shaped process graph.

     root (which talks to user) is a coordinator
     leaves are subordinates
     interior nodes are both
          after receiving PREPARE, propagate it to children
          vote after children; any NO below causes a NO vote
          after receiving a COMMIT record, force-write log, ACK to parent, and propagate to children. Similar for ABORT.

Presumed Abort
     recall... if main-mem info say nothing, coord says ABORT
     SO... coord can forget a xact immediately after deciding to abort it! (write abort record, THEN forget)
     abort can be async write
     no ACKS required from subords on ABORT
     no need to remember names of subords in abort record, nor write end record after abort
     if coord sees subord has failed, need not pass xact to recovery system; can just ABORT.

Now, look at R/O xacts:
     subords who have only read send READ VOTEs instead of YES VOTEs, release locks, write no log records
     logic is: READ & YES = YES, READ & NO = NO, READ & READ = READ
     if all votes are READ, there’s no second phase
     commit record at coord includes only YES sites
     Tallying up the R/O work:
          nobody writes log records
          nonleaf processes send one message to children
          children send one message (to parent)

Presumed Commit

Idea: let’s invert the logic above, since commit is the fast path:

     require ACK for ABORT, not COMMIT
     subords force abort records, not commit
     no information? Presume commit!

Problem:

subord prepares to commit
coord crashes
on restart, coord aborts transaction and forgets it
subord asks about transaction, coord says "no info = commit!"
subord commits, everyone else does not

Solution:
     coord records names of subords on stable storage before allowing them to prepare ("collecting" record)
     then it can tell them about aborts on restart
     everything else analogous (mirror) to Presumed Abort
     Tallying up the R/O work:
          nonleaf writes collecting (forced) and commit (async)
          nonleaf sends one message to all children (PREPARE)
          children send one message (to parent)

Performance analysis in paper:
     PA > 2PC (> = "beats")
     PA > PC for R/O transactions
     for xacts with only one write subord, PC > PA (equal log writes, PA needs an ACK from subord)
     for n-1 writing subords, PC >> PA (equal logging, but PA forces n-1 times when PC does not – commit records of subords. Also PA send n extra messages)
     choice between PA and PC could be made on a transaction-by transaction basis

Gray, et al on Replication

The Upshot: Deadlock/reconciliation rates grow exponentially with replication factor. This gets even worse with disconnected operation (mobile computers).

Background

eager replication (deadlocks) vs. lazy replication (reconciliations)
group (update anywhere) vs. master (primary copy)
scaleup pitfall: replication looks fine on small demos, dies when you scale up
system delusion: lots of inconsistent versions floating around, too hard to reconcile them all

Important Observations

in group mode, every update generates an update at all nodes (i.e. nodes times more work). This generates actions2 more work!
Eager Group replication:

deadlocks grow like nodes3
deadlocks grow like actions5 (where actions = # of actions per transaction)
mobile nodes cannot run when disconnected

Eager Master replication

Reduces deadlocks: like a single-site system with higher TPS
Still grow like actions5

Lazy Group replication

transaction which would wait under eager needs to be reconciled here

Assume wait is rare. Deadlock is rare2 (i.e. very unlikely)
so reconciliation is much more common than deadlock

reconciliations grow like TPS2´ (actions ´ nodes)3

Lazy Master replication

like RPC mechanism of previous paper (reads should send read-lock requests to master)
deadlock rate at master grows with (TPS´ Nodes)2´ actions4
mobile nodes can’t run while disconnected

Intuitions for a Solution

checkbook example
Lotus Notes example: convergence semantics (with no new updates & connectivity, everybody will eventually get the same state). Uses timestamps.
lost update problem (more recent account update wins, old one is lost)

solution: commutative operations (e.g. increment/decrement)

Two-Tier Replication

The world consists of base nodes and mobile nodes
mobile nodes contain 2 versions of objects: a (maybe stale) master version, and a tentative version
2 types of xacts:

base xacts work only on master data, involve at most 1 mobile node
tentative xacts work only on local tentative data. Only involve data mastered at base nodes or the local node (no other mobile nodes)
on reconnect:

tentative versions are removed
tentative xacts are rerun as real xacts
before committing the base xacts, an acceptance criterion is used to make sure the results are close enough to the original tentative versions

Features:

mobile nodes may make tentative updates
base xacts execute with single-copy serializability
xact is durable when the base xact completes
replicas at all sites converge to base system state
if all xacts commute, there are no reconciliations

Their "solution" to the dangers:

use lazy master with timestamps & commutativity to avoid high deadlock rates
use 2-tier replication to handle disconnected operation