2-phase commit #1755
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1755
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
[this in with a description of the 2-phase commit protocol, and how it improves write consistency / allows smaller writes for MDMF without regressions]fill
See also #1845.
The difficulty of distributed two-phase commit in general is that if the Transaction Manager fails after telling some of the Resource Managers to prepare but before either telling them to commit or telling them to rollback, then they are stuck in this prepared state (i.e. locked).
(See Gray and Reuter's book "Transaction Processing", and see also Gray-1995-“Consensus On Transaction Commit”.)
The role of Transaction Manager in this future extension of Tahoe-LAFS would be filled by the LAFS storage client (i.e. the LAFS gateway) and the roles of Resource Managers would be filled by LAFS storage servers.
There is, of course, no way for a Resource Manager to tell the difference between their Transaction Manager having failed versus being slow or being temporarily disconnected from the network, other than the passage of time with the absence of a new message (either "commit" or "rollback") from the Transaction Manager.
In general, this can become intractable for large distributed systems with many resources being locked, many Transaction Managers which need to fail over to one another (using Paxos to elect a new leader, I suppose), and frequent write-contention.
But in practical terms, I expect Tahoe-LAFS will be able to use 2-phase-commit ("2PC") nicely, because typically the scope of what is locked, who is doing the locking, and how much write-contention we have to support, are all relatively narrow. That is, for the use cases that we expect to be asked to handle, only a single mutable file/dir is locked at a time, and only one or a small number of computers have the write cap to a single mutable file/dir.
I think we intend to support the use case that a small number of writers have shared write access to a mutable file/dir and they may occasionally write at the same time as each other, but we do not intend to support the use case that where a large or dynamic set of writers have write access to the same resources, and there may be continuous write collisions that never pause long enough for the distributed system to stabilize.
(I think this is sufficient because I think people who use Tahoe-LAFS will typically use immutables and single-writer-mutables for most of their state management, and rely on shared-writer-mutables only for the sort of "last link in the chain" that can't be managed any other way.)
Another way that Tahoe-LAFS is less fragile than most distributed 2-phase-commit systems is that we've already long since accepted that inconsistency can happen (different storage servers have different versions of a mutable file), and we have mechanisms (repair) in place to recover from that.
So unlike traditional 2PC, 2PC-for-LAFS doesn't have to bear the burden of preventing inconsistency from ever occurring in the distributed system. 2PC for us is just to help multiple writers to coordinate with one another more efficiently, and to help reduce the rate of inconsistency arising within a single storage server. I.e. to allow upload or modification of a mutable share, which may require multiple messages from LAFS storage client to LAFS storage server, without opening a large time window in which a failure of either end or of the connection between them would leave an inconsistent share on that server.
So anyway, we have to come up with a plan for how storage servers (who are playing the role of Resource Manager) handle the case that the storage client (LAFS gateway, Transaction Manager) has told them to prepare and hasn't yet told them whether they should commit or rollback, and then a lot of time passes. As a first strawman argument, I propose a simple hardcoded, fixed, long timeout. Let's say one hour. If your LAFS client hasn't told you whether to commit or to rollback within an hour of asking you to prepare, then you will unilaterally roll back.
See also [//pipermail/tahoe-dev/2012-November/007854.html this mailing list thread].
Replying to zooko:
I'd suggest a shorter timeout than this, say 5 minutes. This is assuming the variation we discussed at the summit where clients can upload their new file contents to a holding area on each server before actually taking the lock. In that case, the lock timeout only needs to be long enough to allow all the servers to receive a lock request and confirm it, then for the client to receive all those confirmations, send out commit messages, and each server to receive their commit message.
If that takes longer than 5 minutes, something is seriously wrong. (Note that as long as the number of servers responding before the timeout is at least the happiness threshold, the file will still be updated. The fact that other servers may time out does not cause any inconsistency that we can't tolerate.)
I meant to say that the reason I don't like the longer timeout, is that if for example a client's network connection dropped at just the wrong point, the file would be unavailable for writes by other clients for the duration of the timeout.
I think #1920 is an example of a failure (seen in the wild) of a kind that would be prevented by 2PC.
Adding Cc: warner just because I want him to pay attention to this ticket. Not sure if adding Cc: warner works for that...
I'd like to work on this at the Tahoe summit in November.