Distributed Transactions & Replication
Transaction Management in R*
Unravels details of logging & messages sent.
Assumptions
- update in place, WAL
- batched force of log records
Desired Characteristics
- guaranteed xact atomicity
- ability to "forget" outcome of commit ASAP
- minimal log writes & message traffic
- optimized performance in no-failure case
- exploitation of completely or partially R/O xacts
- maximize ability to perform unilateral abort
In order to minimize logging and comm:
- rare failures do not deserve extra overhead in normal processing
- hierarchical commit better than 2P
Normal Processing (2PC)
Coordinator Log |
Messages |
Subordinate Log |
|
PREPARE |
|
|
|
prepare*/abort* |
|
VOTE YES/NO |
|
commit*/abort* |
|
|
|
COMMIT/ABORT |
|
|
|
commit*/abort* |
|
ACK |
|
end |
|
|
|
|
|
since subords force abort (& commit) before ACKing, they never need to ask coord
about final outcome.
Rule: never need to ask something you used to know; log before ACKing.
Guarantees atomicity.
Total cost:
subords: 2 forced log-writes (prepare/commit), 2 messages
(YES/ACK)
coord: 1 forced log write (commit), 1 async log write (end), 2
messages/subord (prepare/commit)
2PC & Failures
Recovery process per site handles xacts committing at crash, as well as incoming
recovery messages.
- on restart, read log and accumulate committing xact info in main mem
- if you discover a local xact in the prepared state, contact coord to find out what to do
- if you discover a local xact that didnt get prepared, UNDO it, write abort record,
and forget
- if a local xact was committing (i.e. this is the coord), then send out COMMIT msgs to
subords that havent ACKed. Similar for aborting.
Upon discovering a failure elsewhere:
If a coord discovers that a subord is unreachable...
while waiting for its vote: coord
aborts xact as usual
while waiting for an ACK: coord
gives xact to recovery manager
If a subord discovers that a coord is unreachable...
if it hasnt sent a YES vote
yet: abort ("unilateral abort")
if it has sent a YES vote, subord
gives xact to recovery manager
If a recovery mgr receives an inquiry from a subord in prepared state
if main mem info says xact is
committing or aborting, send COMMIT/ABORT
if main mem info says nothing...?
An Aside: Hierarchical 2PC
If you have a tree-shaped process graph.
root (which talks to user) is a coordinator
leaves are subordinates
interior nodes are both
after receiving PREPARE, propagate
it to children
vote after children; any NO below
causes a NO vote
after receiving a COMMIT record,
force-write log, ACK to parent, and propagate to children. Similar for ABORT.
Presumed Abort
recall... if main-mem info say nothing, coord says ABORT
SO... coord can forget a xact immediately after deciding to abort
it! (write abort record, THEN forget)
abort can be async write
no ACKS required from subords on ABORT
no need to remember names of subords in abort record, nor write
end record after abort
if coord sees subord has failed, need not pass xact to recovery
system; can just ABORT.
Now, look at R/O xacts:
subords who have only read send READ VOTEs instead of YES VOTEs,
release locks, write no log records
logic is: READ & YES = YES, READ & NO = NO, READ &
READ = READ
if all votes are READ, theres no second phase
commit record at coord includes only YES sites
Tallying up the R/O work:
nobody writes log records
nonleaf processes send one message
to children
children send one message (to
parent)
Presumed Commit
Idea: lets invert the logic above, since commit is the fast path:
require ACK for ABORT, not COMMIT
subords force abort records, not commit
no information? Presume commit!
Problem:
- subord prepares to commit
- coord crashes
- on restart, coord aborts transaction and forgets it
- subord asks about transaction, coord says "no info = commit!"
- subord commits, everyone else does not
Solution:
coord records names of subords on stable storage before allowing
them to prepare ("collecting" record)
then it can tell them about aborts on restart
everything else analogous (mirror) to Presumed Abort
Tallying up the R/O work:
nonleaf writes collecting (forced)
and commit (async)
nonleaf sends one message to all
children (PREPARE)
children send one message (to
parent)
Performance analysis in paper:
PA > 2PC (> = "beats")
PA > PC for R/O transactions
for xacts with only one write subord, PC > PA (equal log
writes, PA needs an ACK from subord)
for n-1 writing subords, PC >> PA (equal logging, but PA
forces n-1 times when PC does not commit records of subords. Also PA send n
extra messages)
choice between PA and PC could be made on a transaction-by
transaction basis
Gray, et al on Replication
The Upshot: Deadlock/reconciliation rates grow exponentially with replication factor.
This gets even worse with disconnected operation (mobile computers).
Background
- eager replication (deadlocks) vs. lazy replication (reconciliations)
- group (update anywhere) vs. master (primary copy)
- scaleup pitfall: replication looks fine on small demos, dies when you scale up
- system delusion: lots of inconsistent versions floating around, too hard to reconcile
them all
Important Observations
- in group mode, every update generates an update at all nodes (i.e. nodes times more
work). This generates actions2 more work!
- Eager Group replication:
- deadlocks grow like nodes3
- deadlocks grow like actions5 (where actions = # of actions per transaction)
- mobile nodes cannot run when disconnected
- Eager Master replication
- Reduces deadlocks: like a single-site system with higher TPS
- Still grow like actions5
- Lazy Group replication
- transaction which would wait under eager needs to be reconciled here
- Assume wait is rare. Deadlock is rare2 (i.e. very unlikely)
- so reconciliation is much more common than deadlock
- reconciliations grow like TPS2“ (actions “ nodes)3
- Lazy Master replication
- like RPC mechanism of previous paper (reads should send read-lock requests to master)
- deadlock rate at master grows with (TPS“ Nodes)2“ actions4
- mobile nodes cant run while disconnected
Intuitions for a Solution
- checkbook example
- Lotus Notes example: convergence semantics (with no new updates & connectivity,
everybody will eventually get the same state). Uses timestamps.
- lost update problem (more recent account update wins, old one is lost)
- solution: commutative operations (e.g. increment/decrement)
Two-Tier Replication
- The world consists of base nodes and mobile nodes
- mobile nodes contain 2 versions of objects: a (maybe stale) master version, and a
tentative version
- 2 types of xacts:
- base xacts work only on master data, involve at most 1 mobile node
- tentative xacts work only on local tentative data. Only involve data mastered at base
nodes or the local node (no other mobile nodes)
- on reconnect:
- tentative versions are removed
- tentative xacts are rerun as real xacts
- before committing the base xacts, an acceptance criterion is used to make sure the
results are close enough to the original tentative versions
- Features:
- mobile nodes may make tentative updates
- base xacts execute with single-copy serializability
- xact is durable when the base xact completes
- replicas at all sites converge to base system state
- if all xacts commute, there are no reconciliations
- Their "solution" to the dangers:
- use lazy master with timestamps & commutativity to avoid high deadlock rates
- use 2-tier replication to handle disconnected operation
|