intermittent test_system failure #1768
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1768
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The windows buildslave reported a failure in test_system today, in
changeset:b2dcbbb6, which went away upon a rebuild.
https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/FreeStorm%20WinXP-x86%20py2.6/builds/103
Relevant console output:
and corresponding test.log contents:
The "Unhandled Error" (as opposed to the usual "Unhandled Error in
Deferred") means that something called twisted's
log.err()
with a Failure or Exception, but nothing else (in particular not a
why= argument). This is a pretty common (perhaps unfortunate)
pattern in our code: tahoe has roughly 47 such calls, and Foolscap
has about 29. So we don't really know what caught the exception
(and therefore what ought to be handling it differently). The
exception itself was a DeadReferenceError, caused by a connection
being replaced while a remote call was outstanding. And most of the
remote calls were
get_version
, which we do immediately aftera getReference or connectTo completes (with one extra
RIUploadHelper upload()
call). The Dirty Reactor Errorindicates that something failed to shut down a Tub in the error
path.
I'm not really sure what's going on here. It feels like the client
is being bounced while it's in the middle of making some
connections, as if we're not waiting for the client to finish
starting up before we shut it down again. It also feels like there
are some
log.err()
calls that should be replaced with theless-serious
log.msg()
, since they don't really justifyflunking the tests. (Any log.err will flunk a test, unless the test
specifically checks for and discards them).
Throwing a log.err at the end of a Deferred chain when you don't
really care about it anymore is pretty convenient and safe (you
won't be losing information about unexpected errors). But it's
still a sort of "XXX this point is never reached" sort of marker.
Having them get invoked during tests might suggest that control is
actually reaching those points.
Another probable instance of this error on the same buildslave (build 170):
This may be related to or a duplicate of #1757. (Alternatively, if they are distinct bugs, then comment:89018 could be an instance of #1757 instead of this bug.)
Hm that stack trace in the final
ERROR
in comment:89018 shows thatconnectionDropped
is getting fired twice:http://foolscap.lothar.com/trac/browser/foolscap/pb.py?annotate=blame&rev=900c2b4a7aaf9370a76d97a59ebe8943d2dac353#L1070
alternate hosting of same file:
https://github.com/warner/foolscap/blob/900c2b4a7aaf9370a76d97a59ebe8943d2dac353/foolscap/pb.py#l1070
This reminds me of #653, in which we determined that most likely there is a bug in foolscap which causes
notifyOnDisconnect
to not get called sometimes when it should get called. Putting these two together, it makes me think that foolscap can currently err on both sides — double-invoking a "respond to disconnect" event sometimes (but we've seen this only on Windows) and zero-invoking it other times.As I've mentioned before, I really think "responding to disconnect events" is a losing game. To me it smells like "clean shutdown logic" in an app. It tends to be buggy, it is labor-intensive to implement and debug it, and it can never be 100% correct (because shutdowns of an app are sometimes hard and because disconnections of a network are sometimes undetectable).
So my preference for "crash-only programming" (in which you don't expend engineering effort trying to design and implement the "clean shutdown" case) is perhaps related to my preference for "crash-only networking", in which you assume that your application won't get a reliable notification of disconnect, and you don't expend engineering effort trying to deliver one.
So I would be somewhat more interested in removing "respond to disconnect events" features from foolscap and from Tahoe-LAFS (see #816, #1975) than in debugging this. However, I'm definitely not very happy with the current situation, where the unit tests sometimes spuriously (??) fail on Windows due to this probable bug in foolscap.
I think #1757 was a duplicate of this.
In tests there should be no unexpected disconnections anyway, right? So is the cause of the disconnection just Windows networking being flaky, or do similar errors happen on other platforms?
Our windows buildslave demonstrated this again:
https://tahoe-lafs.org/buildbot-tahoe-lafs/builders/Marcus%20Cygwin%20WinXP/builds/140/
The relevant excerpt from stdout is:
And the relevant excerpt from trial log is:
Replying to daira:
I don't think it happens on any other platform.
The reason that I'm looking at this right now is that Nathan has submitted a patch #1659, but that patch needs to be tested on Windows, and the buildbot is currently not running tests on Windows, possibly due to this issue.
This may be the same error: https://gist.github.com/merickson/27bbaf793c2eff7bae49
This is a similar failure, on Ubuntu 12.04 Precise, not Windows: https://travis-ci.org/tahoe-lafs/tahoe-lafs/jobs/142880571