bug in repairer causes sporadic hangs in unit tests #616
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#616
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
There is a bug in
DownUpConnector._satisfy_reads_if_possible()
:source:src/allmydata/immutable/repairer.py@20090112214120-e01fd-7d241072d30b14d3e243829e952e8c8440e6c461#L127
It should be putting
leftover
bytes back into theself.bufs
and the rest into the result, not putting all-but-leftover
bytes back and the rest into the result! In cases where the input chunks have come in different sizes than the read requests, this bug could lead to a read request getting more or fewer bytes than it requested. This could lead to data corruption (although not irreversibly so -- it would then upload the same sequence of bytes but in different-sized blocks, which would screw up the integrity checking code but not the ciphertext).Fortunately, in our current code, the writes and the read requests are always of the same sizes (the block size), so this doesn't happen in practice. I've added an assertion in changeset:c59940852b94ba45 just to make it fail safely if this were to happen in practice. I have started writing unit tests for
DownUpConnector._satisfy_reads_if_possible()
-- it turns out that we need unit tests in addition to the functional tests that I already wrote: source:src/allmydata/test/test_repairer.py.This explains the sporadic "lost progress" failure in the functional tests. Hm... Could it also explain the "lost progress" behavior that Brian and I witnessed on the testgrid when this code was newly committed to trunk? I hope not, because that would mean that I am wrong about the writes and reads always having the same sizes. But I'm pretty sure I am right about that.
as mentioned in #611, we disabled the repair-from-corruption tests, and have only rarely seen lost-progress in the remaining repair-from-deletion test.
Zooko fixed one bug in the repairer which would have caused lost-progress, but didn't see any other obvious ones.
I've seen lost-progress in repair-from-deletion twice now (after zooko's fix), but it's pretty rare (and therefore hard to analyze). Since repair-from-deletion is supposed to be deterministic, the only entropy source remaining is the order in which download reads and upload writes are interleaved, which means it's going to be a long hard struggle to capture enough information for analysis.
So we're going to push this one out to 1.3.1 . We'd like to have a perfect repairer in 1.3.0, but we also want to have a 1.3.0 soon, and a repairer which hangs once out of every thousand uses might be good enough for that.
fixed in 1.3.0 by changeset:d7dbd6675efa2f25