Tahoe process gone wild #1824
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1824
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I woke up this morning (~07:20 PDT) to find my machine sluggish to respond. It turned out to be my Volunteer Grid 1 tahoe node, having grown to 24GB virtual, 12GB (of 16GB) real. It was not responsive to the web port. It was using about 0.3% CPU.
It appears to have hung in the middle of my regularly scheduled repair/lease process, which started 14 hours ago; here is the output from that. Note that this cronjob mail didn't come to me until I killed the process, so I suspect it hung in the middle of the third repair, and the tracebacks are merely due to the death of the TCP server.
Further details will be added later; I have no time to hunt for logs now.
When you back to this, please attach any incident report files from the relevant time period, and please paste in the output of
tahoe --version
. By the way, I didn't know Volunteer Grid 1 was still in operation! Does it work reliably for you? How many storage servers and how many users does it gave?Replying to kpreid:
This confusing report is probably due to #766, and probably unrelated to the main issue in this ticket.
There are no recent incident report files. There are no recent twistd.log entries. My tahoesvc.log is more interesting, and is full of repetitions (0.11 second intervals) of the following sequence, followed at the end by the startup log after I killed it.
Going back to the beginning of this error (85 megabytes of logs ago), I find this as immediately preceding the spew, and nothing recent before it:
OK, so:
None
. Wasn't there a previous ticket about that?This tahoe-dev thread might be the same problem: https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007829.html
Marking this as Priority: Critical, because it seems more important to me than most of the tickets marked Priority: Major.
Iantcho posted an incident report (trigger: he pushed the button to generate an incident report):
[//pipermail/tahoe-dev/2012-November/007836.html https://tahoe-lafs.org/pipermail/tahoe-dev/2012-November/007836.html]
Had this happen again today. However, slightly different logs — there was nothing immediately before the repeating message by timestamp. Here is the first repeating message and a couple of its predecessors:
Thank you, kpreid! That last exception showing that
rref
wasNone
is a smoking gun that may be useful!Okay, I just (re-?) noticed that the
rref
isNone
was reported by kpreid earlier (comment:89750) and that then davidsarah wrote:Replying to davidsarah:
Why do you say there is an infinite deferred loop? Your proposed approach sounds good to me!
#1939 is a possible duplicate.
kpreid, can you check whether the incident logs on #1939 are consistent with this ticket?
They aren't, I think. There's no major spam and the errors don't look similar.
Replying to kpreid:
Okay, thanks for looking at that.
Replying to davidsarah:
#1636. Which was fixed in Tahoe-LAFS 1.9.2. kpreid: what version of Tahoe-LAFS generated the output you show here?
The traceback contains a mention of "1.9.0.post28". The same string is still in my
build/lib/allmydata/_version.py
, if that's the right place to look. I don't know offhand whether I've updated at all since I had this error, but if I haven't the git revision is changeset:dd2e7a1127353a60 plus the patch at (@@https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1628#comment:89755@@) .(Why yes, I'm not especially keeping my Tahoe in tip-top shape these days...)
Replying to kpreid:
Great! Then I believe this issue to be fixed in 1.9.2, because it is an instance of #1636. Daira, what did you mean in comment:89751 by "the unhandled exception is causing an infinite deferred loop. Let's fix that separately, even if we change this particular case to be handled."? Do you think we can close this ticket as as dup of #1636 now?
kpreid: you could try upgrading to 1.9.2 (or the shiny new 1.10 pre-release!) and then let us know if this problem or anything like it ever recurs. Thanks for the bug report!
The tracebacks in comment:89750 and comment:89761 do look the same as in #1636.
What I meant was: why was the traceback repeating every 0.11 seconds in comment:89750? There's no mention of that symptom on #1636, and it's difficult to tell whether the patches to retrieve.py in /tahoe-lafs/trac-2024-07-25/issues/6668#comment:89757 would have incidentally fixed it.
Replying to daira:
Maybe it wasn't a loop, but simply that kpreid's deep-check-and-repair job was going to a new file every 0.11 seconds.
That seems unlikely: I have a total of 15 files.
I looked at the code a bit and I didn't notice a way for a loop to happen. The stack trace from the original report features this line:
But in [the 1.9.2 version of retrieve.py]source:1.9.2/src/allmydata/mutable/retrieve.py?rev=5480#L725 and in the [current version of retrieve.py]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c there isn't the same code.
Hm... let's see. There is no way that I can see that
_validation_or_decoding_failed
could lead to a loop, [in 1.9.0]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L720. Is there any way that there could be a loop that calls out to_validation_or_decoding_failed
over and over? Well, [this stuff here]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L612 is complicated. Could it have a loop? No, I don't see it. Any errback from the call to_validation_or_decoding_failed
should... Wait, what happens when you get an exception from your errback? I assume it just errbacks again? Hm, this code is very confusing because there are two variables in different scopes nameddl
. Anyway, I think the intent is that the errback goes here, where it terminates and doesn't loop: [loop()]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d9c1064d42a322b58d3243923d95fd56235d7c89#L252.The [modern code]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L614 that replaced this doesn't have the two variables named
dl
. It still looks like you'll get an exception from the errback (named_handle_bad_shares
in this version). Hm, in the current version that exception will then get ignored by [line 642]source:git/src/allmydata/mutable/retrieve.py?annotate=blame&rev=d8c536847b1ea577f9ac5b6aa98c7ce5d1961c8c#L642dl = deferredutil.gatherResults(ds)
sincedeferredutil.gatherResults()
always setsconsumeErrors=True
. Could that lead to a loop?Daira: help!