"client node probably started" #71
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#71
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
It would be nice if we could remove the "probably" from that message. How about doing a Foolscap "Hi there" with it? (That was Sam Stoller's suggestion.)
The "probably" is there because the runner process has no clear way of knowing if the new process dies right away or continues running.
I've got some code in buildbot which watches the logfile and looks for the message that indicates startup has been successful.. perhaps we could snarf it for this purpose.
What do you mean by a 'Foolscap "Hi there"' message?
-Brian
If the runner process has some positive indication that the new process started up long enough to perform some action (such as writing a message to the log or connecting back to the runner process with Foolscap and saying "Hi there"), then the runner process should inform the user that the process has started, without the "probably".
So, yes, snarfing that code from buildbot would be fine with me.
I'm working on the foolscap approach. I believe it's possible to connect to the node and call get_version, so I'll use that if possible. (I've started by modifying the runner tests to start a node and fail if "probably started" appears in the output.)
This would fit nicely into the theme of v0.6.1: documentation, packaging, user-friendliness, etc.
I'd advise the logfile-scanning approach. Benefits:
who is starting the node, at exactly the time and place they need to see it
installed). Logfile writing is the only requirement
Downsides:
I think I have a good-enough solution for this in buildbot, but I think it involves
limited functionality
Bumping to v0.7 milestone.
Nejucomo: if you aren't planning to fix this ticket, would you please take your name off the "assigned" field?
I built a prototype of this, watching twistd.log until the introducer has been contacted. I suspect it will have interactions with windows though (forking), and it probably breaks the 'start -m' (multiple nodes) functionality.
I plan to make it work better once I've gotten more progress down on #197.
Attachment prototype.diff (10798 bytes) added
prototype implementation
I forget exactly how many people I have watched going through the Tahoe install and launch process. About half a dozen. Every single one has exclaimed at "Client node probably started.". I just watched another person do it, and they too exclaimed in exactly the same way, so let's say it's seven out of seven.
I will add another vote that "probably" is not a very reassuring word choice. While things seem to be working, I still am unclear as to why I've been told that tahoe has only "probably started"
Incidentally, I just learned that modern twistd can be run as a library. See http://divmod.org/trac/browser/trunk/Axiom/axiom/scripts/axiomatic.py for an example. This would make it easier to avoid the extra subprocess, and might make it easier to provide a more confident answer to this ticket.
In general, if we can instantiate the Client before the fork, then the parent process can be sure that:
To feel confident that the Client actually got started, we'll need to establish some form of communication between the "tahoe start" parent and the actual node process, whether that means tailing the logfile or connecting to the control.furl .
Jeremy Visser has packaged Tahoe-LAFS v1.6.1 for Ubuntu Lucid. He tried to test his package by following these instructions: http://allmydata.org/source/tahoe-lafs/trunk/docs/running.html but he got stuck and gave up on testing it (until I reminded him to try again). So I asked why he had given up:
It looks like http://twistedmatrix.com/trac/ticket/823 would solve this ticket with its
--wait
option.See also #602 which is about "probably not started" not being sufficiently detailed and #529 which is about detecting problems on startup and failing loudly instead of quietly, and #371 which is about a common problem on startup.
We could just adapt the approach suggested in the twisted ticket (and implemented in this patch) rather than waiting for twisted to adopt it. That would also allow us to receive arbitrary messages from the child process and print them, addressing /tahoe-lafs/trac-2024-07-25/issues/5130#comment:53 for example.
It would be nice to contribute to Twisted. We either do so directly by contributing patches and code review for Twisted #823 and then waiting for it to be deployed and the using it in Tahoe-LAFS, or at least we could work on a patch within Tahoe-LAFS but be sure to carefully cross-link it with the relevant Twisted tickets and to try to get a similar patch committed to Twisted.
I think this is resolved by changeset:ac3b26ecf29c08cb .. anyone want to confirm?
I ran
tahoe start
and it didn't print out any uncertainty-inducing messages:Hm, news-needed.
Hey does this mean that we can start running all these tests on cygwin and/or windows now:
[test_runner.py]source:trunk/src/allmydata/test/test_runner.py?annotate=blame&rev=4800#L38
It looks like both of these conditions which force tests to be skipped are now irrelevant and all tests should be runnable, but I'm not sure.
Replying to zooko:
Possibly, I will investigate that.
Replying to zooko:
Apparently not.
The cygwin part of this is #908, and is due to a bug in
twisted.internet.utils
on cygwin that apparently causes it to hang. (I haven't tested it with recent cygwin, but it wouldn't have been affected by changeset:ac3b26ecf29c08cb.)For native Windows, we currently skip the
test_runner.RunNode
tests because of #27 (twistd doesn't daemonize on windows). That is,tahoe start
behaves liketahoe run
on Windows, which is too different for the tests to work. It looks non-trivial to make them work without fixing either #27 or #1121 (test 'tahoe run').Thanks for investigating!
Replying to [davidsarah]comment:31:
More precisely:
tahoe start
now behaves liketahoe run
. Prior to changeset:ac3b26ecf29c08cb, it [used os.system]source:src/allmydata/scripts/startstop_node.py@4641#L96 to runtwistd
, which put the node in a different process to thetahoe
command, although that process did not then daemonize. Since changeset:ac3b26ecf29c08cb, it runs the node in the same process as thetahoe
command. Hmm, is that a regression?(Here is the code for
twistd
on Windows, and here is for Unix.)Replying to [davidsarah]comment:34:
I don't think anybody benefited from or cared about the fact that it used to run it in a separate process. It just made it harder to kill it on Windows.