Hanging on dead reference? #1875

New Issue

nejucomo · 2012-11-22T15:07:08Z

nejucomo commented

2012-11-22 15:07:08 +00:00

Symptoms:

I left a tahoe backup --verbose $LOCAL_PATH tahoe:$GRID_PATH process running last night. This was a child of a logging script I wrote called logcmd; please see Footnote 1 below for important stdio buffering details.

In the morning, the output appeared to be hanging, but I wasn't certain.
In a separate terminal, I ran tahoe ls tahoe:. It appeared to hang.
I killed it with ^C then reran it, and it appeared to hang, so I killed that.
I examined the backup process terminal to see no updates. [2]Footnote
I ran tahoe list-aliases, to verify that does not hang.

After those steps, I did these things, but do not remember the order:

I ran tahoe ls tahoe: a third time and it gave an UnrecoverableFileError.
I examined the backup terminal to see an embedded exception (inside a string literal of another exception) mentioning UploadUnhappinessError, PipelineError, and DeadReferenceError.

After all of the above, I tried tahoe ls alias: again and it immediately gave the correct listing of names I expected.

Hypothesis 1:

In this case, all of the following are hypothesized to be true:

The stdio buffering and process management scheme of logcmd (see Footnote 1) kept an exception traceback in memory instead of flushing it to the terminal.
Also, logcmd did not detect that the backup process had exited. (Otherwise it would flush the output.)
Some networking issue triggered the exception in the tahoe backup process.
The same networking issue caused the first two tahoe ls processes to hang.
The same networking issue, or a slightly different networking issue caused the third invocation of tahoe ls to exit with the UnrecoverableFileError.
The networking issue or issues (possible more than one distinct networking state) resolved, and the fourth tahoe ls invocation succeeded.

This hypothesis would fit especially if my laptop disabled networking after a period of inactivity, or if the network was disabled by an access point and my laptop did not automatically renew a dhcp lease, and when I started poking it in the morning it resumed networking.

One mark of evidence against this is that I had successfully browsed for a bit before the above commands.

Hypothesis 2:

Assume the following:

logcmd (see Footnote 1) did not hold onto exception output for any notable period of time, but flushed the traceback soon after it was generated.
running tahoe ls was related, or even a cause of, the exception in the tahoe backup process.
some networking condition in tahoe or foolscap will not timeout on its own, but requires other activity before an exception is triggered.

If logcmd did not introduce stdio buffering problems, then it seems unlikely that the tahoe backup exception would have appeared just as I was running tahoe ls commands, given that it had been running for ~6 hours.

In other words, there's a strong correlation between the tahoe ls invocations and the tahoe backup exception. The hypothesis is that the former somehow trigger the latter.

The last bullet-point implies that some kinds of networking errors (maybe DeadReferenceError or something about pipelining) do not time out, but instead require some other activity before an exception is raised. If this hypothesis is true, I consider this a bug.

Footnote 1: The backup process was a child of a logging-utility python script I wrote, named logcmd which generally has these features (sorry no public source yet):

It uses subprocess.Popen(...).
It loops on Popen.poll and continues running as long as there is no child returncode or as long as that code indicates the child has been stopped or resumed. (See os.WIF* docs)
Inside the loop it does a select() on the child's stdout and stderr with no timeout.
It then reads using file.readline on the select returns.
It buffers read data, then it splits on '\n' and writes all but the last chunk of the split (which may be an incomplete line) to multiple destinations.

Footnote 2: This was based on a belief that list-aliases does no networking and I wanted to distinguish between networking errors or some more general tahoe hanging issue.

**Symptoms**: I left a `tahoe backup --verbose $LOCAL_PATH tahoe:$GRID_PATH` process running last night. This was a child of a logging script I wrote called `logcmd`; please see Footnote 1 below for important stdio buffering details. 1. In the morning, the output appeared to be hanging, but I wasn't certain. 1. In a separate terminal, I ran `tahoe ls tahoe:`. It appeared to hang. 1. I killed it with `^C` then reran it, and it appeared to hang, so I killed that. 1. I examined the backup process terminal to see no updates. [2]Footnote 1. I ran `tahoe list-aliases`, to verify that does not hang. After those steps, I did these things, but do not remember the order: * I ran `tahoe ls tahoe:` a third time and it gave an `UnrecoverableFileError`. * I examined the `backup` terminal to see an embedded exception (inside a string literal of another exception) mentioning `UploadUnhappinessError`, `PipelineError`, and `DeadReferenceError`. After all of the above, I tried `tahoe ls alias:` again and it immediately gave the correct listing of names I expected. **Hypothesis 1**: In this case, all of the following are hypothesized to be true: * The stdio buffering and process management scheme of `logcmd` (see Footnote 1) kept an exception traceback in memory instead of flushing it to the terminal. * Also, `logcmd` did not detect that the `backup` process had exited. (Otherwise it would flush the output.) * Some networking issue triggered the exception in the `tahoe backup` process. * The same networking issue caused the first two `tahoe ls` processes to hang. * The same networking issue, or a slightly different networking issue caused the third invocation of `tahoe ls` to exit with the `UnrecoverableFileError`. * The networking issue or issues (possible more than one distinct networking state) resolved, and the fourth `tahoe ls` invocation succeeded. This hypothesis would fit especially if my laptop disabled networking after a period of inactivity, or if the network was disabled by an access point and my laptop did not automatically renew a dhcp lease, *and* when I started poking it in the morning it resumed networking. One mark of evidence against this is that I had successfully browsed for a bit before the above commands. **Hypothesis 2**: Assume the following: * `logcmd` (see Footnote 1) did not hold onto exception output for any notable period of time, but flushed the traceback soon after it was generated. * running `tahoe ls` was related, or even a cause of, the exception in the `tahoe backup` process. * some networking condition in `tahoe` or `foolscap` will not timeout on its own, but requires other activity before an exception is triggered. If `logcmd` did not introduce stdio buffering problems, then it seems unlikely that the `tahoe backup` exception would have appeared *just as* I was running `tahoe ls` commands, given that it had been running for ~6 hours. In other words, there's a strong correlation between the `tahoe ls` invocations and the `tahoe backup` exception. The hypothesis is that the former somehow trigger the latter. The last bullet-point implies that some kinds of networking errors (maybe `DeadReferenceError` or something about pipelining) do not time out, but instead require some other activity before an exception is raised. If this hypothesis is true, I consider this a bug. **Footnote 1**: The `backup` process was a child of a logging-utility python script I wrote, named `logcmd` which generally has these features (sorry no public source yet): * It uses `subprocess.Popen(...)`. * It loops on `Popen.poll` and continues running as long as there is no child `returncode` *or* as long as that code indicates the child has been stopped or resumed. (See [os.WIF* docs](http://docs.python.org/2/library/os.html#os.WIFCONTINUED)) * Inside the loop it does a `select()` on the child's stdout and stderr with *no timeout*. * It then reads using `file.readline` on the select returns. * It buffers read data, then it splits on '\n' and writes all but the last chunk of the split (which may be an incomplete line) to multiple destinations. **Footnote 2**: This was based on a belief that `list-aliases` does no networking and I wanted to distinguish between networking errors or some more general tahoe hanging issue.

nejucomo added the

labels 2012-11-22 15:07:08 +00:00

nejucomo added this to the undecided milestone 2012-11-22 15:07:08 +00:00

nejucomo commented

2012-11-22 15:15:02 +00:00

Commands and Output:

Here are some sanitized cut'n'pastes of the commands described above.

tahoe backup

The end of the stderr stream of tahoe backup is:

Traceback (most recent call last):
  File "/home/n/virtualenvs/default/bin/tahoe", line 8, in <module>
    load_entry_point('allmydata-tahoe==1.9.2', 'console_scripts', 'tahoe')()
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/runner.py", line 113, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/runner.py", line 99, in runner
    rc = cli.dispatch[command](so)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/cli.py", line 569, in backup
    rc = tahoe_backup.backup(options)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks\x0a    current.result = callback(current.result, *args, **kw)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/util/pipeline.py\", line 121, in _call_finished\x0a    d.errback(self.failure)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 388, in errback\x0a    self._startRunCallbacks(fail)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks\x0a    self._runCallbacks()\x0a--- <exception caught here> ---\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks\x0a    current.result = callback(current.result, *args, **kw)\x0a  File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/immutable/encode.py\", line 501, in _remove_shareholder\x0a    raise UploadUnhappinessError(msg)\x0aallmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.ipb.DeadReferenceError'>: Connection was lost (to tubid=(*REDACTED*)) (during method=RIBucketWriter:write)\x0a])>\x0a]\x0a"

Using python to print the packed string literal shows:

Traceback (most recent call last):
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/util/pipeline.py", line 121, in _call_finished
    d.errback(self.failure)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 388, in errback
    self._startRunCallbacks(fail)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/immutable/encode.py", line 501, in _remove_shareholder
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.ipb.DeadReferenceError'>: Connection was lost (to tubid=(*REDACTED*)) (during method=RIBucketWriter:write)
])>
]

tahoe ls

The first two invocations just show a KeyboardInterrupt where I ^C'd them. The third invocation looks like this:

$ tahoe ls tahoe:
Error during GET: 410 Gone
UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more.

The fourth invocation looks like:

$ tahoe ls tahoe:
(*REDACTED: The names listed are what I expect. *)

**Commands and Output**: Here are some sanitized cut'n'pastes of the commands described above. **`tahoe backup`** The end of the `stderr` stream of `tahoe backup` is: ``` Traceback (most recent call last): File "/home/n/virtualenvs/default/bin/tahoe", line 8, in <module> load_entry_point('allmydata-tahoe==1.9.2', 'console_scripts', 'tahoe')() File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/runner.py", line 113, in run rc = runner(sys.argv[1:], install_node_control=install_node_control) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/runner.py", line 99, in runner rc = cli.dispatch[command](so) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/cli.py", line 569, in backup rc = tahoe_backup.backup(options) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 325, in backup return bu.run() File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 118, in run new_backup_dircap = self.process(options.from_dir) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 188, in process childcap = self.process(childpath) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 194, in process childcap, metadata = self.upload(childpath) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/scripts/tahoe_backup.py", line 305, in upload raise HTTPError("Error during file PUT", resp) allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error "Traceback (most recent call last):\x0a File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks\x0a current.result = callback(current.result, *args, **kw)\x0a File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/util/pipeline.py\", line 121, in _call_finished\x0a d.errback(self.failure)\x0a File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 388, in errback\x0a self._startRunCallbacks(fail)\x0a File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 455, in _startRunCallbacks\x0a self._runCallbacks()\x0a--- <exception caught here> ---\x0a File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py\", line 542, in _runCallbacks\x0a current.result = callback(current.result, *args, **kw)\x0a File \"/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/immutable/encode.py\", line 501, in _remove_shareholder\x0a raise UploadUnhappinessError(msg)\x0aallmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.ipb.DeadReferenceError'>: Connection was lost (to tubid=(*REDACTED*)) (during method=RIBucketWriter:write)\x0a])>\x0a]\x0a" ``` Using python to print the packed string literal shows: ``` Traceback (most recent call last): File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/util/pipeline.py", line 121, in _call_finished d.errback(self.failure) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 388, in errback self._startRunCallbacks(fail) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/home/n/virtualenvs/default/lib/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/n/virtualenvs/default/lib/python2.7/site-packages/allmydata/immutable/encode.py", line 501, in _remove_shareholder raise UploadUnhappinessError(msg) allmydata.interfaces.UploadUnhappinessError: shares could be placed or found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.: [Failure instance: Traceback (failure with no frames): <class 'allmydata.util.pipeline.PipelineError'>: <PipelineError error=([Failure instance: Traceback (failure with no frames): <class 'foolscap.ipb.DeadReferenceError'>: Connection was lost (to tubid=(*REDACTED*)) (during method=RIBucketWriter:write) ])> ] ``` **tahoe ls** The first two invocations just show a `KeyboardInterrupt` where I `^C`'d them. The third invocation looks like this: ``` $ tahoe ls tahoe: Error during GET: 410 Gone UnrecoverableFileError: the directory (or mutable file) could not be retrieved, because there were insufficient good shares. This might indicate that no servers were connected, insufficient servers were connected, the URI was corrupt, or that shares have been lost due to server departure, hard drive failure, or disk corruption. You should perform a filecheck on this object to learn more. ``` The fourth invocation looks like: ``` $ tahoe ls tahoe: (*REDACTED: The names listed are what I expect. *) ```

nejucomo commented

2012-11-22 17:09:26 +00:00

I've published the source code of logcmd to a github project. The version actually used in this ticket is linked here.

There are no unit tests yet, and I've discovered one bug: Buffered output may not be written when a process exits.

I've published the source code of `logcmd` to a [github project](https://github.com/nejucomo/logcmd). The version actually used in this ticket is [linked here](https://github.com/nejucomo/logcmd/blob/2cd6ada8a71a4674b4c38ffdb25a9915a7bc4cf7/logcmd.py). There are no unit tests yet, and I've discovered one bug: Buffered output may not be written when a process exits.

davidsarah commented

2012-11-22 23:32:58 +00:00

Was this on your LeastAuthority account? If so, the storage server crashed at around 03:19 UTC (2012-11-22), and I restarted it at 04:39 UTC.

I suspect that the storage server crashed because it was out of memory for processing a large file. It is possible that the reads performed by the tahoe ls took memory additional to that needed for the backup, pushing the total usage over the threshold that caused a crash (or alternatively, the backup may just have happened to process a large file at about the time of the ls).

Was this on your LeastAuthority account? If so, the storage server crashed at around 03:19 UTC (2012-11-22), and I restarted it at 04:39 UTC. I suspect that the storage server crashed because it was out of memory for processing a large file. It is possible that the reads performed by the `tahoe ls` took memory additional to that needed for the backup, pushing the total usage over the threshold that caused a crash (or alternatively, the backup may just have happened to process a large file at about the time of the `ls`).

davidsarah commented

2012-11-22 23:34:02 +00:00

tahoe list-aliases indeed does no networking.

`tahoe list-aliases` indeed does no networking.

tahoe-lafs added

and removed

labels 2012-12-11 22:16:57 +00:00

Sign in to join this conversation.