Hanging on dead reference? #1875
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1875
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptoms:
I left a
tahoe backup --verbose $LOCAL_PATH tahoe:$GRID_PATH
process running last night. This was a child of a logging script I wrote calledlogcmd
; please see Footnote 1 below for important stdio buffering details.tahoe ls tahoe:
. It appeared to hang.^C
then reran it, and it appeared to hang, so I killed that.tahoe list-aliases
, to verify that does not hang.After those steps, I did these things, but do not remember the order:
tahoe ls tahoe:
a third time and it gave anUnrecoverableFileError
.backup
terminal to see an embedded exception (inside a string literal of another exception) mentioningUploadUnhappinessError
,PipelineError
, andDeadReferenceError
.After all of the above, I tried
tahoe ls alias:
again and it immediately gave the correct listing of names I expected.Hypothesis 1:
In this case, all of the following are hypothesized to be true:
logcmd
(see Footnote 1) kept an exception traceback in memory instead of flushing it to the terminal.logcmd
did not detect that thebackup
process had exited. (Otherwise it would flush the output.)tahoe backup
process.tahoe ls
processes to hang.tahoe ls
to exit with theUnrecoverableFileError
.tahoe ls
invocation succeeded.This hypothesis would fit especially if my laptop disabled networking after a period of inactivity, or if the network was disabled by an access point and my laptop did not automatically renew a dhcp lease, and when I started poking it in the morning it resumed networking.
One mark of evidence against this is that I had successfully browsed for a bit before the above commands.
Hypothesis 2:
Assume the following:
logcmd
(see Footnote 1) did not hold onto exception output for any notable period of time, but flushed the traceback soon after it was generated.tahoe ls
was related, or even a cause of, the exception in thetahoe backup
process.tahoe
orfoolscap
will not timeout on its own, but requires other activity before an exception is triggered.If
logcmd
did not introduce stdio buffering problems, then it seems unlikely that thetahoe backup
exception would have appeared just as I was runningtahoe ls
commands, given that it had been running for ~6 hours.In other words, there's a strong correlation between the
tahoe ls
invocations and thetahoe backup
exception. The hypothesis is that the former somehow trigger the latter.The last bullet-point implies that some kinds of networking errors (maybe
DeadReferenceError
or something about pipelining) do not time out, but instead require some other activity before an exception is raised. If this hypothesis is true, I consider this a bug.Footnote 1: The
backup
process was a child of a logging-utility python script I wrote, namedlogcmd
which generally has these features (sorry no public source yet):subprocess.Popen(...)
.Popen.poll
and continues running as long as there is no childreturncode
or as long as that code indicates the child has been stopped or resumed. (See os.WIF* docs)select()
on the child's stdout and stderr with no timeout.file.readline
on the select returns.Footnote 2: This was based on a belief that
list-aliases
does no networking and I wanted to distinguish between networking errors or some more general tahoe hanging issue.Commands and Output:
Here are some sanitized cut'n'pastes of the commands described above.
tahoe backup
The end of the
stderr
stream oftahoe backup
is:Using python to print the packed string literal shows:
tahoe ls
The first two invocations just show a
KeyboardInterrupt
where I^C
'd them. The third invocation looks like this:The fourth invocation looks like:
I've published the source code of
logcmd
to a github project. The version actually used in this ticket is linked here.There are no unit tests yet, and I've discovered one bug: Buffered output may not be written when a process exits.
Was this on your LeastAuthority account? If so, the storage server crashed at around 03:19 UTC (2012-11-22), and I restarted it at 04:39 UTC.
I suspect that the storage server crashed because it was out of memory for processing a large file. It is possible that the reads performed by the
tahoe ls
took memory additional to that needed for the backup, pushing the total usage over the threshold that caused a crash (or alternatively, the backup may just have happened to process a large file at about the time of thels
).tahoe list-aliases
indeed does no networking.