repair of mutable files/directories should not increment the sequence number #1209

Open
opened 2010-09-23 23:38:26 +00:00 by gdt · 17 comments
Owner

Particularly with my root directory, I often find that 9 shares of seqN are available compared to 10 desired. I do a repair, and this results in 10 shares of seqN+1 and then 9 are deleted. Then the next day there are 9 of seqN+1 and 1 of seqN, and the file is again not healthy. This repeats daily.

It seems that the missing seqN shares should be generated and placed, and then when servers churn more it's likely that 10 can still be found, and no unrecoverable versions. Perhaps I don't get something, but the current behavior is not stable with intermittent servers.

I have observed this problem with directories, but it seems likely that it applies to all immutable files.

Particularly with my root directory, I often find that 9 shares of seqN are available compared to 10 desired. I do a repair, and this results in 10 shares of seqN+1 and then 9 are deleted. Then the next day there are 9 of seqN+1 and 1 of seqN, and the file is again not healthy. This repeats daily. It seems that the missing seqN shares should be generated and placed, and then when servers churn more it's likely that 10 can still be found, and no unrecoverable versions. Perhaps I don't get something, but the current behavior is not stable with intermittent servers. I have observed this problem with directories, but it seems likely that it applies to all ~~im~~mutable files.
tahoe-lafs added the
code
minor
defect
1.8β
labels 2010-09-23 23:38:26 +00:00
tahoe-lafs added this to the undecided milestone 2010-09-23 23:38:26 +00:00

gdt: did you mean "mutable" instead of "immutable"? Immutable files don't have a sequence number!

gdt: did you mean "mutable" instead of "immutable"? Immutable files don't have a sequence number!
zooko added
1.8.0
and removed
1.8β
labels 2010-09-29 12:22:35 +00:00
Author
Owner

Sorry, I really meant directories in particular. I have edited the summary and description.

Sorry, I really meant directories in particular. I have edited the summary and description.
tahoe-lafs changed title from repair of immutable files should not increment the sequence number to repair of directories (all immutable files) should not increment the sequence number 2010-09-29 13:35:29 +00:00
Author
Owner

Clarify that the ticket is really about directories, but that it likely applies to all immutable files.

Clarify that the ticket is really about directories, but that it likely applies to all immutable files.
tahoe-lafs changed title from repair of directories (all immutable files) should not increment the sequence number to repair of directories (all immutable files?) should not increment the sequence number 2010-09-29 13:36:31 +00:00

Replying to gdt:

Clarify that the ticket is really about directories, but that it likely applies to all immutable files.

You mean it likely applies to all mutable files, right? :-) Directories are normally mutable, although it is also possible to have immutable directories. But immutable directories don't have sequence numbers. :-)

Replying to [gdt](/tahoe-lafs/trac-2024-07-25/issues/1209#issuecomment-80197): > Clarify that the ticket is really about directories, but that it likely applies to all immutable files. You mean it likely applies to all *mutable* files, right? :-) Directories are normally mutable, although it is also possible to have immutable directories. But immutable directories don't have sequence numbers. :-)
tahoe-lafs modified the milestone from undecided to soon 2010-10-09 17:02:50 +00:00
tahoe-lafs changed title from repair of directories (all immutable files?) should not increment the sequence number to repair of mutable files/directories should not increment the sequence number 2010-10-09 17:04:13 +00:00
tahoe-lafs modified the milestone from soon to 1.9.0 2011-01-14 18:11:43 +00:00
davidsarah commented 2011-01-14 18:50:53 +00:00
Author
Owner

#1210 was a duplicate. Its description was:

If there are 1 shares of seqN and 10 shares of seqM, M>N, the file is not healthy. The fix should be to remove the seqN share and not molest the seqM shares, instead of incrementing the version. This contributes to instability.

#1210 was a duplicate. Its description was: > If there are 1 shares of seqN and 10 shares of seqM, M>N, the file is not healthy. The fix should be to remove the seqN share and not molest the seqM shares, instead of incrementing the version. This contributes to instability.
tahoe-lafs added
major
and removed
minor
labels 2011-01-14 18:50:53 +00:00
tahoe-lafs modified the milestone from 1.9.0 to soon 2011-08-14 01:14:33 +00:00
warner added
code-mutable
and removed
code
labels 2012-03-18 01:07:05 +00:00

This is a great idea, especially since one of the failure modes for mutable files (when a share is migrated to a new server, causing the write-enabler to become wrong, causing the share to get "stuck" at some old version) is that it's never going to be able to get rid of that old share, so the file will always appear "unhealthy". In that case, constantly clobbering the perfectly-valid most-recent-version shares is obviously wrong.

This is a great idea, especially since one of the failure modes for mutable files (when a share is migrated to a new server, causing the write-enabler to become wrong, causing the share to get "stuck" at some old version) is that it's never going to be able to get rid of that old share, so the file will always appear "unhealthy". In that case, constantly clobbering the perfectly-valid most-recent-version shares is obviously wrong.
davidsarah commented 2012-03-29 20:20:43 +00:00
Author
Owner

Brian wrote on tahoe-dev:

I haven't looked at that code for a long time, but it occurs to me that what it wants is a checker-results flag that says whether the unhealthy status is due to the presence of multiple versions or not. In terms of the ServerMap object, I think that's just:

len(sm.recoverable_versions()) + len(sm.unrecoverable_versions()) > 1

We only need to do the full download+upload cycle (which increments the version number) if there are multiple versions present. If we're just missing a couple of shares (or some are corrupted and could be replaced), then the number of versions ==1 and we should do a non-incrementing form of repair.

I think we'll need new Repairer code to do that, though. Something to set the new version equal to the existing version, to avoid sending updates to existing correct shares, and something to make sure the generated test-vectors are ok with all that. Not trivial, but not a huge task either.

Brian wrote on tahoe-dev: > I haven't looked at that code for a long time, but it occurs to me that what it wants is a checker-results flag that says whether the unhealthy status is due to the presence of multiple versions or not. In terms of the `ServerMap` object, I think that's just: > `len(sm.recoverable_versions()) + len(sm.unrecoverable_versions()) > 1` > We only need to do the full download+upload cycle (which increments the version number) if there are multiple versions present. If we're just missing a couple of shares (or some are corrupted and could be replaced), then the number of versions ==1 and we should do a non-incrementing form of repair. > I think we'll need new Repairer code to do that, though. Something to set the new version equal to the existing version, to avoid sending updates to existing correct shares, and something to make sure the generated test-vectors are ok with all that. Not trivial, but not a huge task either.
davidsarah commented 2012-03-29 20:22:53 +00:00
Author
Owner

Greg Troxel wrote:

More than that - if we have 1 share of M and all shares of N, for N > M, then we really just want to purge (or ignore?) the M share, and not molest the N shares.

The test for this should be like:

  set up 5 servers S
  upload some files
  while 20 iterations
    for s in S
      take s off line
      run repair
      take s back

With the current code, you get repair every time and 100 increments, I think. With your proposal, I think it's still true.

The above test is how the pubgrid feels to me, or used to.

Greg Troxel wrote: > More than that - if we have 1 share of M and all shares of N, for N > M, then we really just want to purge (or ignore?) the M share, and not molest the N shares. > The test for this should be like: ``` set up 5 servers S upload some files while 20 iterations for s in S take s off line run repair take s back ``` > With the current code, you get repair every time and 100 increments, I think. With your proposal, I think it's still true. > The above test is how the pubgrid feels to me, or used to.
davidsarah commented 2012-03-29 20:25:49 +00:00
Author
Owner

Brian replied:

On 3/29/12 12:01 PM, Greg Troxel wrote:

More than that - if we have 1 share of M and all shares of N, for N >
M, then we really just want to purge (or ignore?) the M share, and not
molest the N shares.

Ah, good point. We really only need a new version if there are multiple competing versions with the same sequence number, and if that sequence number is the highest seen. Repair is tricky in that case anyways, since at the Tahoe level we can't do an automatic merge, so we're certainly losing information (if it's just a directory modification, then the directory.py code can re-apply the modification, so that one case might be safe).

Hm, ServerMap.needs_merge() is pretty close already, but it only looks at recoverable versions (it tells you that an update will lose information that would have been recoverable if you'd read the individual versions first.. there are alternate cases where it doesn't matter because the other versions weren't recoverable anyways).

We should add a method to ServerMap that tells us whether a new version is needed or not.

The above test is how the pubgrid feels to me, or used to.

Yup, that test looks right.

Brian replied: > On 3/29/12 12:01 PM, Greg Troxel wrote: > > > > More than that - if we have 1 share of M and all shares of N, for N > > > M, then we really just want to purge (or ignore?) the M share, and not > > molest the N shares. > > Ah, good point. We really only need a new version if there are multiple competing versions with the same sequence number, and if that sequence number is the highest seen. Repair is tricky in that case anyways, since at the Tahoe level we can't do an automatic merge, so we're certainly losing information (if it's just a directory modification, then the directory.py code can re-apply the modification, so that one case might be safe). > > Hm, `ServerMap.needs_merge()` is pretty close already, but it only looks at recoverable versions (it tells you that an update will lose information that would have been recoverable if you'd read the individual versions first.. there are alternate cases where it doesn't matter because the other versions weren't recoverable anyways). > > We should add a method to `ServerMap` that tells us whether a new version is needed or not. > > > The above test is how the pubgrid feels to me, or used to. > > Yup, that test looks right.
davidsarah commented 2012-03-29 20:33:59 +00:00
Author
Owner

Replying to Brian:

We really only need a new version if there are multiple competing versions with the same sequence number, and if that sequence number is the highest seen.

Counterexample: suppose there is a recoverable version with sequence number S, and an unrecoverable version with sequence number S+1. (There are no versions with sequence number >= S+2.) Then, the new sequence number needs to be S+2, in order for clients not to use the shares from S+1. Deleting the shares with sequence number S+1 is also possible but inferior, since sequence numbers should be monotonically increasing to minimize race conditions.

We should add a method to ServerMap that tells us whether a new version is needed or not.

+1.

Replying to [Brian](/tahoe-lafs/trac-2024-07-25/issues/1209#issuecomment-80212): > We really only need a new version if there are multiple competing versions with the same sequence number, and if that sequence number is the highest seen. Counterexample: suppose there is a recoverable version with sequence number S, and an unrecoverable version with sequence number S+1. (There are no versions with sequence number >= S+2.) Then, the new sequence number needs to be S+2, in order for clients not to use the shares from S+1. Deleting the shares with sequence number S+1 is also possible but inferior, since sequence numbers should be monotonically increasing to minimize race conditions. > > We should add a method to `ServerMap` that tells us whether a new version is needed or not. +1.
tahoe-lafs modified the milestone from soon to 1.10.0 2012-03-29 20:38:22 +00:00
davidsarah commented 2012-03-29 20:56:00 +00:00
Author
Owner

Proposed algorithm:

  1. Find the highest recoverable sequence number, R.
  2. Find the highest sequence number for which there exist any shares, S.
  3. If R == S, repair version R using the algorithm in /tahoe-lafs/trac-2024-07-25/issues/6192#comment:80210 . Otherwise, download version R and upload it to version S+1.

This would also fix #1004, #614, and #1130. IIUC, an implementation of the /tahoe-lafs/trac-2024-07-25/issues/6192#comment:80210 algorithm is being worked on in #1382.

Proposed algorithm: 1. Find the highest recoverable sequence number, R. 2. Find the highest sequence number for which there exist any shares, S. 3. If R == S, repair version R using the algorithm in [/tahoe-lafs/trac-2024-07-25/issues/6192](/tahoe-lafs/trac-2024-07-25/issues/6192)#[comment:80210](/tahoe-lafs/trac-2024-07-25/issues/1209#issuecomment-80210) . Otherwise, download version R and upload it to version S+1. This would also fix #1004, #614, and #1130. IIUC, an implementation of the [/tahoe-lafs/trac-2024-07-25/issues/6192](/tahoe-lafs/trac-2024-07-25/issues/6192)#[comment:80210](/tahoe-lafs/trac-2024-07-25/issues/1209#issuecomment-80210) algorithm is being worked on in #1382.
Author
Owner

I think davidsarah's proposed algorithm is a good choice. A few comments:

  • if there are shares of a version Q < R, then S = R, not Q. This follows from the algorithm, but in a design doc perhaps should be made more obvious: stray shares of a version less than the highest recoverable version are not a problem.
  • In the case where R is repaired, stray shares of a lower version should be removed.
  • in the case where S+1 is uploaded, shares of R, and actually shares of <=S should be removed.
  • if R is recoverable and there are shares of S > R, then it's really not clear what should happen. One possibility is to wait for a while (days?), keeping checking, and hoping there are enough S. But this is probably a very unlikely, and it's unclear what ought to happen, so I would defer that nuance to later.
I think davidsarah's proposed algorithm is a good choice. A few comments: * if there are shares of a version Q < R, then S = R, not Q. This follows from the algorithm, but in a design doc perhaps should be made more obvious: stray shares of a version less than the highest recoverable version are not a problem. * In the case where R is repaired, stray shares of a lower version should be removed. * in the case where S+1 is uploaded, shares of R, and actually shares of <=S should be removed. * if R is recoverable and there are shares of S > R, then it's really not clear what should happen. One possibility is to wait for a while (days?), keeping checking, and hoping there are enough S. But this is probably a very unlikely, and it's unclear what ought to happen, so I would defer that nuance to later.
davidsarah commented 2012-11-20 07:23:58 +00:00
Author
Owner

Replying to gdt:

I think davidsarah's proposed algorithm is a good choice. A few comments:

  • if there are shares of a version Q < R, then S = R, not Q. This follows from the algorithm, but in a design doc perhaps should be made more obvious: stray shares of a version less than the highest recoverable version are not a problem.
  • In the case where R is repaired, stray shares of a lower version should be removed.
  • in the case where S+1 is uploaded, shares of R, and actually shares of <=S should be removed.

I agree. To make this more explicit:

  1. Find the highest recoverable sequence number, R. If there is no recoverable sequence number, abort.
  2. Find the highest sequence number for which there exist any shares, S.
  3. If R == S,
  4. If the client's happiness threshold is met for shares of sequence number Recovered, remove all known shares with sequence numbers < Recovered.

(The reason to only do the removal when the Recovered version is happy, is in case of a partition where different clients can see different subsets of servers. In that case, removing shares is a bad idea unless we know that the Recovered version has been stored reliably, not just recoverably. Also, we shouldn't remove shares from servers that weren't considered in steps 1 and 2, if they have connected in the meantime.)

  • if R is recoverable and there are shares of S > R, then it's really not clear what should happen. One possibility is to wait for a while (days?), keeping checking, and hoping there are enough S. But this is probably a very unlikely, and it's unclear what ought to happen, so I would defer that nuance to later.

The algorithm says to upload to version S+1 in that case. I think this is the right thing.

Replying to [gdt](/tahoe-lafs/trac-2024-07-25/issues/1209#issuecomment-80216): > I think davidsarah's proposed algorithm is a good choice. A few comments: > * if there are shares of a version Q < R, then S = R, not Q. This follows from the algorithm, but in a design doc perhaps should be made more obvious: stray shares of a version less than the highest recoverable version are not a problem. > * In the case where R is repaired, stray shares of a lower version should be removed. > * in the case where S+1 is uploaded, shares of R, and actually shares of <=S should be removed. I agree. To make this more explicit: 1. Find the highest recoverable sequence number, R. If there is no recoverable sequence number, abort. 2. Find the highest sequence number for which there exist any shares, S. 3. If R == S, * Repair version R using the algorithm in [/tahoe-lafs/trac-2024-07-25/issues/6192](/tahoe-lafs/trac-2024-07-25/issues/6192)#[comment:80210](/tahoe-lafs/trac-2024-07-25/issues/1209#issuecomment-80210) . Set Recovered := R. Otherwise, * Download version R and upload it to version S+1. Set Recovered := S+1. 4. If the client's happiness threshold is met for shares of sequence number Recovered, remove all known shares with sequence numbers < Recovered. (The reason to only do the removal when the Recovered version is happy, is in case of a partition where different clients can see different subsets of servers. In that case, removing shares is a bad idea unless we know that the Recovered version has been stored reliably, not just recoverably. Also, we shouldn't remove shares from servers that weren't considered in steps 1 and 2, if they have connected in the meantime.) > * if R is recoverable and there are shares of S > R, then it's really not clear what should happen. One possibility is to wait for a while (days?), keeping checking, and hoping there are enough S. But this is probably a very unlikely, and it's unclear what ought to happen, so I would defer that nuance to later. The algorithm says to upload to version S+1 in that case. I think this is the right thing.
PRabahy commented 2013-02-18 15:52:21 +00:00
Author
Owner

I believe that davidsarah's algorithm should help for most cases. Does it make sense to use a vector clock (http://en.wikipedia.org/wiki/Vector_clock) for even more robustness. I don't believe this should happen often (ever), but if there is a partial brain split where several nodes can connect to more than S storage servers but less than half the total storage servers in the grid there could still be significant churn. (Did that last sentence make any sense?) I'm probably over-thinking this solution so feel free to ignore me.

I believe that davidsarah's algorithm should help for most cases. Does it make sense to use a vector clock (<http://en.wikipedia.org/wiki/Vector_clock>) for even more robustness. I don't believe this should happen often (ever), but if there is a partial brain split where several nodes can connect to more than S storage servers but less than half the total storage servers in the grid there could still be significant churn. (Did that last sentence make any sense?) I'm probably over-thinking this solution so feel free to ignore me.
Author
Owner

Regarding vector clock, I wouldn't say overthinking, but I think that tahoe needs to have a design subspace within the space of all network and user behaviors. To date, tahoe has been designed for (by observing what's been done) the case where all servers are almost always connected. I'm talking about a case where at any given time most servers are connected, and most servers are connected almost all the time. In that world, not being connected is still unusual and to be avoided, but when you have 20 servers, the odds that one is missing are still pretty high.

So I think this ticket should stay focused on the mostly-connected case. If you want to work on a far more distributed less available use case, good luck, but I think it's about 50x the work of fixing /tahoe-lafs/trac-2024-07-25/issues/6271.

Regarding vector clock, I wouldn't say overthinking, but I think that tahoe needs to have a design subspace within the space of all network and user behaviors. To date, tahoe has been designed for (by observing what's been done) the case where all servers are almost always connected. I'm talking about a case where at any given time most servers are connected, and most servers are connected almost all the time. In that world, not being connected is still unusual and to be avoided, but when you have 20 servers, the odds that one is missing are still pretty high. So I think this ticket should stay focused on the mostly-connected case. If you want to work on a far more distributed less available use case, good luck, but I think it's about 50x the work of fixing [/tahoe-lafs/trac-2024-07-25/issues/6271](/tahoe-lafs/trac-2024-07-25/issues/6271).
davidsarah commented 2013-02-19 06:25:41 +00:00
Author
Owner

The vector clock algorithm is designed to ensure causal ordering (assuming cooperative peers) in a general peer-to-peer message passing system. It's inefficient and overcomplicated for this case, and the assumptions it relies on aren't met in any case.

The vector clock algorithm is designed to ensure causal ordering (assuming cooperative peers) in a general peer-to-peer message passing system. It's inefficient and overcomplicated for this case, and the assumptions it relies on aren't met in any case.
daira commented 2013-06-18 00:49:19 +00:00
Author
Owner

Incidentally, the comment:19 algorithm never makes things worse even in the case of partition, because it only deletes shares of a lower version than a version that is stored happily, even when run on an arbitrary partition of a grid.

Incidentally, the comment:19 algorithm never makes things worse even in the case of partition, because it only deletes shares of a lower version than a version that is stored happily, even when run on an arbitrary partition of a grid.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1209
No description provided.