'tahoe start -m' no longer works #1262

Closed
opened 2010-11-17 22:35:42 +00:00 by warner · 22 comments

The recent import-and-call-twistd change (changeset:ac3b26ecf29c08cb) unfortunately breaks
the -m/--multiple option, which is what lets me do
./bin/tahoe start -m ../MY-TESTNET/node-* to spin up a 5-node
local test grid in a single command. It also breaks tahoe restart -m,
which I use even more frequently as I iterate over a fix.

The symptom is that the first node-startup terminates the 'tahoe start'
command, so that none of the subsequent nodes are started. The call to
twistd.run() is terminating the program, so it never makes it to the
second pass through the loop.

I really prefer import-and-call over fork-and-exec, so I want to keep it
and find some other way to fix -m. The fix will probably be to make
the -m path use fork() just inside the loop over all target
directories, something like:

for d in args:
  if os.fork() == 0:
    os.chdir(d)
    setup_sys_argv()
    twistd.run()
    # this line is never reached
  else:
    # we're the parent, just keep looping
    pass
  os.exit(0)

One downside is that any errors which occur in twistd.run() (like
ImportErrors, or problems in tahoe.cfg) will be eaten by the
child, so the overall 'tahoe start' exit code won't reflect the error.
The stderr would still be emitted (although interleaved painfully).
Another potential downside is that I've heard of one OS (freebsd?
OS-X?) which allows fork()-then-exec() but forbids
fork()-then-anything-else, and this would fail badly on such a system.

The recent import-and-call-twistd change (changeset:ac3b26ecf29c08cb) unfortunately breaks the -m/--multiple option, which is what lets me do `./bin/tahoe start -m ../MY-TESTNET/node-*` to spin up a 5-node local test grid in a single command. It also breaks `tahoe restart -m`, which I use even more frequently as I iterate over a fix. The symptom is that the first node-startup terminates the 'tahoe start' command, so that none of the subsequent nodes are started. The call to `twistd.run()` is terminating the program, so it never makes it to the second pass through the loop. I really prefer import-and-call over fork-and-exec, so I want to keep it and find some other way to fix -m. The fix will probably be to make the -m path use fork() just inside the loop over all target directories, something like: ``` for d in args: if os.fork() == 0: os.chdir(d) setup_sys_argv() twistd.run() # this line is never reached else: # we're the parent, just keep looping pass os.exit(0) ``` One downside is that any errors which occur in twistd.run() (like `ImportErrors`, or problems in tahoe.cfg) will be eaten by the child, so the overall 'tahoe start' exit code won't reflect the error. The stderr would still be emitted (although interleaved painfully). Another potential downside is that I've heard of one OS (freebsd? OS-X?) which allows fork()-then-exec() but forbids fork()-then-anything-else, and this would fail badly on such a system.
warner added the
code-nodeadmin
major
defect
1.8.0
labels 2010-11-17 22:35:42 +00:00
warner added this to the undecided milestone 2010-11-17 22:35:42 +00:00
Author

Attachment start-m.diff (2026 bytes) added

preliminary patch: use fork() to start/restart all nodes but the last

**Attachment** start-m.diff (2026 bytes) added preliminary patch: use fork() to start/restart all nodes but the last
Author

That patch only uses fork() when there are multiple targets, and doesn't use
fork on the last target (i.e. it calls twistd.run() inline, as it did
before). So it ought to behave exactly the same when start/restart is invoked
with just a single target. It seems to work for my testgrid.

Writing a unit test might be a slight nuisance: I think it'd require a fork
or a spawn within the test case. But we're probably already doing that
somewhere, so maybe we've got a good starting point.

That patch only uses fork() when there are multiple targets, and doesn't use fork on the last target (i.e. it calls `twistd.run()` inline, as it did before). So it ought to behave exactly the same when start/restart is invoked with just a single target. It seems to work for my testgrid. Writing a unit test might be a slight nuisance: I think it'd require a fork or a spawn within the test case. But we're probably already doing that somewhere, so maybe we've got a good starting point.

test needed! regression!

test needed! regression!
zooko modified the milestone from undecided to 1.8.1 2010-11-18 02:18:28 +00:00

This is a regression from 1.8.0, and a serious one, but there is no unit test. I will see if I can figure out how to write a unit test for this.

This is a regression from 1.8.0, and a serious one, but there is no unit test. I will see if I can figure out how to write a unit test for this.

Ran out of time for tonight. If anybody wants to have a crack at unit-testing this patch, that would be great!

Ran out of time for tonight. If anybody wants to have a crack at unit-testing this patch, that would be great!
davidsarah commented 2010-11-20 21:11:53 +00:00
Owner

os.fork() is unsupported on Windows. Either we need to:

  • spawn a subprocess, e.g. using twisted.internet.utils.getProcessOutputAndValue, which does work on Windows; or
  • document that -m is only supported on other platforms (which would still be a regression relative to 1.8.0) and remove that option from the help on Windows.
`os.fork()` is unsupported on Windows. Either we need to: * spawn a subprocess, e.g. using `twisted.internet.utils.getProcessOutputAndValue`, which does work on Windows; or * document that `-m` is only supported on other platforms (which would still be a regression relative to 1.8.0) and remove that option from the help on Windows.
davidsarah commented 2010-11-20 21:14:23 +00:00
Owner

Doc for twisted.internet.utils: http://twistedmatrix.com/documents/10.1.0/api/twisted.internet.utils.html . We use it already in test_runner.

Doc for `twisted.internet.utils`: <http://twistedmatrix.com/documents/10.1.0/api/twisted.internet.utils.html> . We use it already in `test_runner`.

I would be reluctant to adopt a patch which makes the functionality different on different platforms.

I would be reluctant to adopt a patch which makes the functionality different on different platforms.

The 1.8.1 release is past due, and it is blocked by this issue and by #1264.

Perhaps we should go ahead and release 1.8.1 as-is, with the -m option broken (and, re #1264, with download performance when K is large worse than v1.7).

The 1.8.1 release is past due, and it is blocked by this issue and by #1264. Perhaps we should go ahead and release 1.8.1 as-is, with the `-m` option broken (and, re #1264, with download performance when `K` is large worse than v1.7).
Author

I'm cool with leaving this regression until later.

FWIW, I've been happy with the os.fork() patch locally. Also, -m is really just for developers (since normal users should only be running one node per disk, and -m is really just a shortcut to spin up a whole local testgrid at once). I'd be surprised (but pleased) if anyone other than me had ever used it. So documenting that -m doesn't work on non-os.fork() platforms would be fine with me (probably best done by conditionally inserting the argument into the Usage class depending upon hasattr(os, "fork")).

I'm cool with leaving this regression until later. FWIW, I've been happy with the os.fork() patch locally. Also, `-m` is really just for developers (since normal users should only be running one node per disk, and `-m` is really just a shortcut to spin up a whole local testgrid at once). I'd be surprised (but pleased) if anyone other than me had ever used it. So documenting that `-m` doesn't work on non-`os.fork()` platforms would be fine with me (probably best done by conditionally inserting the argument into the `Usage` class depending upon `hasattr(os, "fork")`).
Author

Incidentally, it was OS-X (10.6 "Snow Leopard") which sometimes complains about fork-without-immediate-exec. See here (in particular the part that says "Warning: When launching separate processes using the fork function, you must always follow a call to fork with a call to exec or a similar function. Applications that depend on the Core Foundation, Cocoa, or Core Data frameworks (either explicitly or implicitly) must make a subsequent call to an exec function or those frameworks may behave improperly."). Also see the Twisted discussion here and maybe the Adium discussion here. Something to do with protecting apps against weird thread behavior.

I think the consensus is that programs launched from the Dock under 10.6 are
not allowed to use most OS-X-specific APIs (Cocoa, Quartz, etc) after a
fork but before an exec. Fortunately, Tahoe isn't
launched from the Dock (until someone builds a launcher app), and I don't
think it's using any of those OS-X APIs at all.

So I don't think we need to worry about this. I've been testing the patch on
my OS-X 10.6 box without problems.

Incidentally, it was OS-X (10.6 "Snow Leopard") which sometimes complains about fork-without-immediate-exec. See [here](http://developer.apple.com/library/ios/#documentation/cocoa/conceptual/Multithreading/AboutThreads/AboutThreads.html) (in particular the part that says "Warning: When launching separate processes using the fork function, you must always follow a call to fork with a call to exec or a similar function. Applications that depend on the Core Foundation, Cocoa, or Core Data frameworks (either explicitly or implicitly) must make a subsequent call to an exec function or those frameworks may behave improperly."). Also see the Twisted discussion [here](http://permalink.gmane.org/gmane.comp.python.twisted/21758) and maybe the Adium discussion [here](http://trac.adium.im/ticket/13976#comment:28). Something to do with protecting apps against weird thread behavior. I think the consensus is that programs launched from the Dock under 10.6 are not allowed to use most OS-X-specific APIs (Cocoa, Quartz, etc) after a `fork` but before an `exec`. Fortunately, Tahoe isn't launched from the Dock (until someone builds a launcher app), and I don't think it's using any of those OS-X APIs at all. So I don't think we need to worry about this. I've been testing the patch on my OS-X 10.6 box without problems.
Author

Attachment start-m-with-test.diff (10761 bytes) added

better patch: includes test, removes -m from --help when unavailable

**Attachment** start-m-with-test.diff (10761 bytes) added better patch: includes test, removes -m from --help when unavailable
davidsarah commented 2010-11-26 05:23:18 +00:00
Owner

_node_has_started -> _nodes_have_started. Otherwise LGTM.

`_node_has_started` -> `_nodes_have_started`. Otherwise LGTM.
Author

Cool, pushed in changeset:f3adb037ae0d22eb. One thing I caught at the last minute: the new test needed a distinct workdir ("test_multiple_clients" instead of the cut-and-paste leftover "test_client"), otherwise the two tests could interfere with each other. (I observed a test_multiple_clients failure at the very end, where it threw an exception while trying to delete a non-existent hotline file: the whole c2/ directory was missing).

Cool, pushed in changeset:f3adb037ae0d22eb. One thing I caught at the last minute: the new test needed a distinct workdir ("test_multiple_clients" instead of the cut-and-paste leftover "test_client"), otherwise the two tests could interfere with each other. (I observed a test_multiple_clients failure at the very end, where it threw an exception while trying to delete a non-existent hotline file: the whole c2/ directory was missing).
warner added the
fixed
label 2010-11-26 23:17:17 +00:00
Author

Drat, there was a problem with the windows buildslave (--multiple was being removed for 'tahoe create-client' too, which can handle it just fine), and the fix was getting messy, so I reverted that patch in changeset:f36bda278014589a.

If David-Sarah's work to come up with an os.fork replacement for windows works out, that will make things much easier (the actual 'tahoe start' code is working fine, it's just the "don't advertise --multiple on platforms that can't accept it" code that broke).

If not, I'll look into refactoring BasedirMixin to make this easier to get right. For reference, we can accept --multiple on all platforms for tahoe create-client and other create-node commands. We do not accept --multiple ever for tahoe run. And we accept --multiple only on platforms that have os.fork (i.e. on non-windows) for tahoe start and tahoe restart.

Drat, there was a problem with the windows buildslave (`--multiple` was being removed for 'tahoe create-client' too, which can handle it just fine), and the fix was getting messy, so I reverted that patch in changeset:f36bda278014589a. If David-Sarah's work to come up with an os.fork replacement for windows works out, that will make things much easier (the actual '`tahoe start`' code is working fine, it's just the "don't advertise `--multiple` on platforms that can't accept it" code that broke). If not, I'll look into refactoring `BasedirMixin` to make this easier to get right. For reference, we can accept `--multiple` on all platforms for `tahoe create-client` and other create-node commands. We do not accept `--multiple` ever for `tahoe run`. And we accept `--multiple` only on platforms that have `os.fork` (i.e. on non-windows) for `tahoe start` and `tahoe restart`.
warner removed the
fixed
label 2010-11-27 01:17:26 +00:00
warner reopened this issue 2010-11-27 01:17:26 +00:00
Author

For reference, here's one of the windows buildslave errors which happened with changeset:f3adb037ae0d22eb:

[ERROR]: allmydata.test.test_runner.CreateNode.test_client

Traceback (most recent call last):
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 281, in test_client
    self.do_create("client")
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 200, in do_create
    rc, out, err = self.run_tahoe(argv)
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 189, in run_tahoe
    rc = runner.runner(argv, stdout=out, stderr=err)
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\runner.py", line 65, in runner
    config.parseOptions(argv)
  File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 231, in parseOptions
    self.subOptions.parseOptions(rest)
  File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 237, in parseOptions
    self.parseArgs(*args)
  File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\common.py", line 80, in parseArgs
    if self.allow_multiple and self['multiple']:
exceptions.KeyError: 'multiple'
For reference, here's one of the windows buildslave errors which happened with changeset:f3adb037ae0d22eb: ``` [ERROR]: allmydata.test.test_runner.CreateNode.test_client Traceback (most recent call last): File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 281, in test_client self.do_create("client") File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 200, in do_create rc, out, err = self.run_tahoe(argv) File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\test\test_runner.py", line 189, in run_tahoe rc = runner.runner(argv, stdout=out, stderr=err) File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\runner.py", line 65, in runner config.parseOptions(argv) File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 231, in parseOptions self.subOptions.parseOptions(rest) File "C:\Python26\lib\site-packages\twisted\python\usage.py", line 237, in parseOptions self.parseArgs(*args) File "c:\users\buildslave\builds\tahoe-lafs\dcoder_win7-64_py2.6\build\src\allmydata\scripts\common.py", line 80, in parseArgs if self.allow_multiple and self['multiple']: exceptions.KeyError: 'multiple' ```
Author

Sigh. Alright, Zooko has convinced me that keeping -m working (at least for tahoe start -m) is not worth needing to manage different --help texts for windows. I've agreed to remove -m support and write a local shell script or something to do the equivalent function (zooko suggests function tahoestartm { for NODE in ${*} ; do tahoe start $NODE ; done } ).

I'll be bitter about it, though :-). I should keep a list of useful features that I've had to remove because they weren't easy to implement under windows, especially when nobody would ever have noticed that they were missing under windows. I'm pretty sure that I'm the only person who has ever used tahoe start -m, and it will take me some time to retrain my fingers to type something else.

The code will be cleanest if we remove -m from everything (did you know it was accepted by tahoe create-client, tahoe create-keygen, and tahoe create-stats-gatherer? yeah, didn't think so :-). I don't think anyone will miss it.

I'll put together a patch to remove -m from everything and attach it here.

Sigh. Alright, Zooko has convinced me that keeping `-m` working (at least for `tahoe start -m`) is not worth needing to manage different `--help` texts for windows. I've agreed to remove `-m` support and write a local shell script or something to do the equivalent function (zooko suggests `function tahoestartm { for NODE in ${*} ; do tahoe start $NODE ; done } `). I'll be bitter about it, though :-). I should keep a list of useful features that I've had to remove because they weren't easy to implement under windows, especially when nobody would ever have noticed that they were missing under windows. I'm pretty sure that I'm the only person who has ever used `tahoe start -m`, and it will take me some time to retrain my fingers to type something else. The code will be cleanest if we remove `-m` from everything (did you know it was accepted by `tahoe create-client`, `tahoe create-keygen`, and `tahoe create-stats-gatherer`? yeah, didn't think so :-). I don't think anyone will miss it. I'll put together a patch to remove -m from everything and attach it here.
Author

Attachment remove-m.diff (9680 bytes) added

remove --multiple/-m from all CLI commands, clean up basedir processing

**Attachment** remove-m.diff (9680 bytes) added remove --multiple/-m from all CLI commands, clean up basedir processing
secorp commented 2010-11-27 17:27:56 +00:00
Owner

I use the -m option frequently from the command line to start/stop/create-client etc. I can get around it, but I too will briefly mourn its passing.

I use the -m option frequently from the command line to start/stop/create-client etc. I can get around it, but I too will briefly mourn its passing.
zooko self-assigned this 2010-11-27 21:23:00 +00:00

Looks good! Hooray for less code!

Looks good! Hooray for less code!
Brian Warner <warner@lothar.com> commented 2010-11-27 21:39:43 +00:00
Owner

In changeset:69b42c6cb7ec5d4a:

remove --multiple/-m option from all CLI commands: closes #1262

I personally used "tahoe start/restart -m ../MY-TESTNET/node*" all the time,
to spin up or update a local testgrid while iterating over new code. However,
with the recent switch from "subprocess.Popen(/bin/twistd)" to "import and
call twistd.run()" in scripts/startstop_node.py (yay fewer processes!),
"start -m" broke, and fixing it requires os.fork, which is unavailable on
windows (boo windows!). And I was probably the only one using -m. So in the
interests of uniformity among platforms and simpler code (yay negative code
days!), we're just removing -m from everything. I will start using a little
shell script or something to simulate the removed functionality.

This patch also cleans up CLI-function calling a bit: get the basedir from
the config dict (instead of sometimes from a separate argument), and always
return a numeric exit code.
In changeset:69b42c6cb7ec5d4a: ``` remove --multiple/-m option from all CLI commands: closes #1262 I personally used "tahoe start/restart -m ../MY-TESTNET/node*" all the time, to spin up or update a local testgrid while iterating over new code. However, with the recent switch from "subprocess.Popen(/bin/twistd)" to "import and call twistd.run()" in scripts/startstop_node.py (yay fewer processes!), "start -m" broke, and fixing it requires os.fork, which is unavailable on windows (boo windows!). And I was probably the only one using -m. So in the interests of uniformity among platforms and simpler code (yay negative code days!), we're just removing -m from everything. I will start using a little shell script or something to simulate the removed functionality. This patch also cleans up CLI-function calling a bit: get the basedir from the config dict (instead of sometimes from a separate argument), and always return a numeric exit code. ```
tahoe-lafs added the
fixed
label 2010-11-27 21:39:43 +00:00
Author

Attachment many (108 bytes) added

shell script to replace "-m": run like "many ./bin/tahoe start ../GRID/node*"

**Attachment** many (108 bytes) added shell script to replace "-m": run like "many ./bin/tahoe start ../GRID/node*"
108 B
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1262
No description provided.