leasedb: NonExistentShareError: can't find [share] in shares
table #1921
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1921
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The attached incident was seen when running a test with a large number of uploads with the OpenStack cloud backend. The most relevant part seems to be:
I'll avoid speculation in the ticket description, but I don't think this is specific to OpenStack.
Attachment incident-2013-02-20--23-06-20Z-jprthci.flog.bz2 (9798 bytes) added
Okay, now for the speculation:
Notice that the accounting crawler is running concurrently with a PUT. The accounting crawler is what is causing the list requests for prefix 'lp', 'lq', etc.
What triggers the incident is that the crawler sees a share that unexpectedly disappeared, i.e. it has a leasedb entry but no chunk object(s):
Almost immediately afterward, the concurrent PUT request fails. That request is for the same share that the crawler saw "disappear". It seems as though what happened is that there was a race between the crawler and the share creation:
I think this will only happen when the accounting crawler is looking at a prefix while a share is being uploaded to it (that is not difficult to reproduce if you do enough uploads).
[sigh, I just noticed that I leaked the Auth-Token. It's fine, the Auth-Token is only valid for 24h and I'll delete the container contents after that anyway. But that does need fixing.]And
Replying to davidsarah:
This has been fixed; header values are no longer logged.
I believe this was fixed in commits 4cd54e36 ("leasedb/accounting crawler: only treat stable shares as disappeared or unleased") and 9ebc0e8b ("OpenStack: fix a type error introduced by the fix to #1921") on the 1909-cloud-openstack branch. Note that this is not on the 1818-leasedb branch, and in general I need to review all patches on
1909-cloud-openstack1819-cloud-merge to see whether they are applicable to 1818-leasedb. I have just noted this on #1818.The problem was indeed not specific to OpenStack (or the cloud backend). The leasedb design doc had the correct design, which was for the accounting crawler to treat shares in states other than STABLE as leased, but that requirement had been missed in the implementation. That reminds me that we need better tests for edge cases in the accounting crawler.
I'll leave this ticket open until the fix is on the 1818-leasedb branch.
I saw another instance of this after the patch that was supposed to have fixed it :-( I haven't investigated that yet.
#1987 may be the same bug as this. I'm not marking them as duplicates because I'm not sure of that yet.
If this is caused by a race between accounting-crawler and leasedb updates, then it would be fixed by #1833.
Replying to zooko:
#1833 proposes a long-running process that is scheduled to delete shares. (They can't be deleted synchronously because an HTTP DELETE is not synchronous.) This would presumably be based on, or directly copied from the current accounting crawler code, and so would probably have the same race conditions wrt other share operations.
Replying to [daira]comment:11:
You're right, there could still be race conditions. However, wouldn't there be fewer possible race conditions, because now the decref (expire-lease or remove-lease) operation is atomic with the decide-to-rm operation, where with the current lease-crawler design, those two operations can race with each other?
Nitpick: "synchronous with" != "atomic with". Two things can be synchronous, but an error causes one of them to happen and the other not. The decref would only be synchronous with, not atomic with the decide-to-rm operation if #1833 were implemented. (This is why Noether's failure handling model is better: synchronous operations are always atomic! :-)
I don't know for sure whether #1833 would fix this particular race condition, but my intuition is that it wouldn't because the race isn't between the decref and the decide-to-rm.
Ok My 4th try at posting.
I would like to test this bug by uploading 1000's of files to the GRID. Please verify which server/client versions I should use.
This bug is relevant only to the cloud branch. To check out that branch, install
git
and then do:Then you would need to set up an account at an OpenStack provider such as Rackspace or HP Cloud (if you want to reproduce the bug as originally discovered) or some other supported cloud service. There is some documentation on that in here, but you'll probably need additional help (ask on the #tahoe-lafs channel on Freenode) since that documentation isn't complete.
By the way, if you're just starting to contribute to Tahoe, then your efforts are certainly welcomed, but I would recommend looking at some other bug than this hard-to-provoke race condition in experimental code!
#2204 is possibly a duplicate. (Note that that's on the most recent HEAD of the 1819-cloud-merge branch, so if it's a duplicate then this bug is not fixed.)
Milestone renamed
renaming milestone
I believe this is a different issue from #2204. I've fixed #2204 in the branch for #2237. It had to due with recovery from a lost leasedb and apparently nothing to do with a race condition as described here (nor with other interactions with the bucket crawler). The fix was contained entirely within the write implementation provided by
ShareSet
.Moving open issues out of closed milestones.
The established line of development on the "cloud backend" branch has been abandoned. This ticket is being closed as part of a batch-ticket cleanup for "cloud backend"-related tickets.
If this is a bug, it is probably genuinely no longer relevant. The "cloud backend" branch is too large and unwieldy to ever be merged into the main line of development (particularly now that the Python 3 porting effort is significantly underway).
If this is a feature, it may be relevant to some future efforts - if they are sufficiently similar to the "cloud backend" effort - but I am still closing it because there are no immediate plans for a new development effort in such a direction.
Tickets related to the "leasedb" are included in this set because the "leasedb" code is in the "cloud backend" branch and fairly well intertwined with the "cloud backend". If there is interest in lease implementation change at some future time then that effort will essentially have to be restarted as well.