mplayer triggers two bugs in Tahoe's new downloader #1154

Closed
opened 2010-08-03 18:28:54 +00:00 by francois · 10 comments
francois commented 2010-08-03 18:28:54 +00:00
Owner

I tried to play a movie hosted on a Tahoe-LAFS grid with 'mplayer (@@http://127.0.0.1:3456/file/URI-blah/@@named=/test.avi@@)'. The error seems to be related to mplayer seeking at the end of the file for downloading the AVI index before seeking back to the beginning for the actual rendering.

Preventing mplayer from reading the index with the '--noidx' parameter avoid this error.

2010-08-03 20:06:40+0200 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib/python2.6/dist-packages/twisted/application/app.py", line 348, in runReactorWithLogging
	    reactor.run()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1170, in run
	    self.mainLoop()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop
	    self.runUntilCurrent()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	--- <exception caught here> ---
	  File "/usr/lib/pymodules/python2.6/foolscap/eventual.py", line 26, in _turn
	    cb(*args, **kwargs)
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 197, in loop
	    self._do_loop()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 253, in _do_loop
	    wanted, needed = self._desire()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 611, in _desire
	    % (want_it.dump(), need_it.dump(), gotta_gotta_have_it.dump()))
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/util/spans.py", line 143, in dump
	    return "len=%d: %s" % (len(self),
	exceptions.TypeError: __len__() should return an int
2010-08-03 20:06:51+0200 [HTTPChannel,198,127.0.0.1] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib/python2.6/dist-packages/twisted/application/app.py", line 348, in runReactorWithLogging
	    reactor.run()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1170, in run
	    self.mainLoop()
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1182, in mainLoop
	    self.doIteration(t)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/selectreactor.py", line 140, in doSelect
	    _logrun(selectable, _drdw, selectable, method, dict)
	--- <exception caught here> ---
	  File "/usr/lib/python2.6/dist-packages/twisted/python/log.py", line 84, in callWithLogger
	    return callWithContext({"system": lp}, func, *args, **kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/python/log.py", line 69, in callWithContext
	    return context.call({ILogContext: newCtx}, func, *args, **kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/python/context.py", line 59, in callWithContext
	    return self.currentContext().callWithContext(ctx, func, *args, **kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/python/context.py", line 37, in callWithContext
	    return func(*args,**kw)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/selectreactor.py", line 156, in _doReadOrWrite
	    self._disconnectSelectable(selectable, why, method=="doRead")
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/posixbase.py", line 191, in _disconnectSelectable
	    selectable.readConnectionLost(f)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/tcp.py", line 508, in readConnectionLost
	    self.connectionLost(reason)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/tcp.py", line 513, in connectionLost
	    abstract.FileDescriptor.connectionLost(self, reason)
	  File "/usr/lib/python2.6/dist-packages/twisted/internet/abstract.py", line 64, in connectionLost
	    self.producer.stopProducing()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/segmentation.py", line 146, in stopProducing
	    self._cancel_segment_request.cancel()
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/node.py", line 27, in cancel
	    self._f(self)
	  File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/node.py", line 468, in _cancel_request
	    if self._active_segment.segnum not in segnums:
	exceptions.AttributeError: 'NoneType' object has no attribute 'segnum'
I tried to play a movie hosted on a Tahoe-LAFS grid with 'mplayer (@@http://127.0.0.1:3456/file/URI-blah/@@named=/test.avi@@)'. The error seems to be related to mplayer seeking at the end of the file for downloading the AVI index before seeking back to the beginning for the actual rendering. Preventing mplayer from reading the index with the '--noidx' parameter avoid this error. ``` 2010-08-03 20:06:40+0200 [-] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.6/dist-packages/twisted/application/app.py", line 348, in runReactorWithLogging reactor.run() File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1170, in run self.mainLoop() File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop self.runUntilCurrent() File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent call.func(*call.args, **call.kw) --- <exception caught here> --- File "/usr/lib/pymodules/python2.6/foolscap/eventual.py", line 26, in _turn cb(*args, **kwargs) File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 197, in loop self._do_loop() File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 253, in _do_loop wanted, needed = self._desire() File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/share.py", line 611, in _desire % (want_it.dump(), need_it.dump(), gotta_gotta_have_it.dump())) File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/util/spans.py", line 143, in dump return "len=%d: %s" % (len(self), exceptions.TypeError: __len__() should return an int ``` ``` 2010-08-03 20:06:51+0200 [HTTPChannel,198,127.0.0.1] Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.6/dist-packages/twisted/application/app.py", line 348, in runReactorWithLogging reactor.run() File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1170, in run self.mainLoop() File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1182, in mainLoop self.doIteration(t) File "/usr/lib/python2.6/dist-packages/twisted/internet/selectreactor.py", line 140, in doSelect _logrun(selectable, _drdw, selectable, method, dict) --- <exception caught here> --- File "/usr/lib/python2.6/dist-packages/twisted/python/log.py", line 84, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib/python2.6/dist-packages/twisted/python/log.py", line 69, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib/python2.6/dist-packages/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib/python2.6/dist-packages/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/usr/lib/python2.6/dist-packages/twisted/internet/selectreactor.py", line 156, in _doReadOrWrite self._disconnectSelectable(selectable, why, method=="doRead") File "/usr/lib/python2.6/dist-packages/twisted/internet/posixbase.py", line 191, in _disconnectSelectable selectable.readConnectionLost(f) File "/usr/lib/python2.6/dist-packages/twisted/internet/tcp.py", line 508, in readConnectionLost self.connectionLost(reason) File "/usr/lib/python2.6/dist-packages/twisted/internet/tcp.py", line 513, in connectionLost abstract.FileDescriptor.connectionLost(self, reason) File "/usr/lib/python2.6/dist-packages/twisted/internet/abstract.py", line 64, in connectionLost self.producer.stopProducing() File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/segmentation.py", line 146, in stopProducing self._cancel_segment_request.cancel() File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/node.py", line 27, in cancel self._f(self) File "/home/francois/dev/tahoe-lafs-ticket798-1.7.1-r4653/src/allmydata/immutable/downloader/node.py", line 468, in _cancel_request if self._active_segment.segnum not in segnums: exceptions.AttributeError: 'NoneType' object has no attribute 'segnum' ```
tahoe-lafs added the
code
major
defect
n/a
labels 2010-08-03 18:28:54 +00:00
tahoe-lafs added this to the undecided milestone 2010-08-03 18:28:54 +00:00
warner was assigned by tahoe-lafs 2010-08-03 18:28:54 +00:00
francois commented 2010-08-03 18:36:43 +00:00
Author
Owner

This ticket is related to #798.

This ticket is related to #798.

ooh, good bugs. I'll look into them.

ooh, good bugs. I'll look into them.
warner modified the milestone from undecided to 1.8.0 2010-08-03 21:38:59 +00:00
tahoe-lafs added
code-network
critical
1.8β
and removed
code
major
n/a
labels 2010-08-04 01:36:31 +00:00

I figured out the first one: it turns out that webapi GETs that use a Range header were completely broken: the root cause was that python's *len* method is, apparently, not allowed to return a long (who knew?). I just pushed the patch (changeset:8844655705e4fb76).

I'm still trying to figure out the stopProducing exception: it feels like the HTTP client dropped the connection at an unexpected time (when no actual segments were being fetched). There's a simple fix, but I want to build a reproducible test case first.

I figured out the first one: it turns out that webapi GETs that use a Range header were completely broken: the root cause was that python's `*len*` method is, apparently, not allowed to return a `long` (who knew?). I just pushed the patch (changeset:8844655705e4fb76). I'm still trying to figure out the `stopProducing` exception: it feels like the HTTP client dropped the connection at an unexpected time (when no actual segments were being fetched). There's a simple fix, but I want to build a reproducible test case first.
francois commented 2010-08-04 20:30:39 +00:00
Author
Owner

Hi Brian, Thanks for looking into this!

I've attached the full debug log (mplayer-with-idx.log) as well as the mplayer console output (mplayer-console.log) to this ticket. I hope this helps.

BTW, during this test the file download doesn't stop when mplayer exits. It is still completely downloaded.

And, when mplayer is run a second time, after Tahoe finished downloading the whole file, it plays the movie just well and even allows seeking through it.

Hi Brian, Thanks for looking into this! I've attached the full debug log (mplayer-with-idx.log) as well as the mplayer console output (mplayer-console.log) to this ticket. I hope this helps. BTW, during this test the file download doesn't stop when mplayer exits. It is still completely downloaded. And, when mplayer is run a second time, after Tahoe finished downloading the whole file, it plays the movie just well and even allows seeking through it.
francois commented 2010-08-04 20:31:00 +00:00
Author
Owner

Attachment mplayer-with-idx.log (570920 bytes) added

**Attachment** mplayer-with-idx.log (570920 bytes) added
francois commented 2010-08-04 20:31:09 +00:00
Author
Owner

Attachment mplayer-console.log (1253 bytes) added

**Attachment** mplayer-console.log (1253 bytes) added

Hm, it looks like mplayer-with-idx.log is actually using the old downloader: the log messages don't match the new code.

I've found a few places where that second exception could possibly be raised, but I'm still trying to find a way to reproduce it for a test.

Hm, it looks like `mplayer-with-idx.log` is actually using the old downloader: the log messages don't match the new code. I've found a few places where that second exception could possibly be raised, but I'm still trying to find a way to reproduce it for a test.

Oh, nevermind, I think I figured it out. There's actually three bugs overlapping here:

  1. the Spans/DataSpans classes used *len* methods that returned longs instead of ints, causing an exception during download. (my changeset:8844655705e4fb76 fix was incorrect: it turns out that *nonzero* is not allowed to return a long either).
  2. there is a lost-progress bug in DownloadNode, where a failure in one segment-fetch will cause all other pending segment-fetches to hang forever
  3. a stopProducing that occurs during this hang-forever period causes an exception, because there is no active segment-fetch in place

The bug1 fix is easy: replace self.*len* with self.len and make *nonzero* always return a bool. The bug3 fix is also easy: DownloadNode._cancel_request should tolerate self._active_segment being None.

The bug2 fix is not trivial but not hard. The start-next-fetch code in DownloadNode should be factored out, and DownloadNode.fetch_failed code should invoke it after sending errbacks to the requests which failed. This will add a nice property: if you get unrecoverable bit errors in one segment, you might still be able to get valid data from other segments (as opposed to giving up on the whole file because of a single error). I think there are some other changes that must be made to really get this property, though.. when we get to the point where we sort shares by "goodness", we'll probably clean this up. The basic idea will be that shares with errors go to the bottom of the list but are not removed from it entirely: if we really can't find the data we need somewhere else, we'll give the known-corrupted share a try, in the hopes that there are some uncorrupted parts of the share.

I've got a series of test cases to exercise these three bugs.. I just have to build them in the right order to make sure that I'm not fixing the wrong one first (and thus hiding one of the others from my test).

Oh, nevermind, I think I figured it out. There's actually three bugs overlapping here: 1. the `Spans/DataSpans` classes used `*len*` methods that returned `long`s instead of `int`s, causing an exception during download. (my changeset:8844655705e4fb76 fix was incorrect: it turns out that `*nonzero*` is not allowed to return a `long` either). 1. there is a lost-progress bug in `DownloadNode`, where a failure in one segment-fetch will cause all other pending segment-fetches to hang forever 1. a `stopProducing` that occurs during this hang-forever period causes an exception, because there is no active segment-fetch in place The bug1 fix is easy: replace `self.*len*` with `self.len` and make `*nonzero*` always return a `bool`. The bug3 fix is also easy: `DownloadNode._cancel_request` should tolerate `self._active_segment` being `None`. The bug2 fix is not trivial but not hard. The start-next-fetch code in `DownloadNode` should be factored out, and `DownloadNode.fetch_failed` code should invoke it after sending errbacks to the requests which failed. This will add a nice property: if you get unrecoverable bit errors in one segment, you might still be able to get valid data from other segments (as opposed to giving up on the whole file because of a single error). I think there are some other changes that must be made to really get this property, though.. when we get to the point where we sort shares by "goodness", we'll probably clean this up. The basic idea will be that shares with errors go to the bottom of the list but are not removed from it entirely: if we really can't find the data we need somewhere else, we'll give the known-corrupted share a try, in the hopes that there are some uncorrupted parts of the share. I've got a series of test cases to exercise these three bugs.. I just have to build them in the right order to make sure that I'm not fixing the wrong one first (and thus hiding one of the others from my test).

Oh, I should mention what I'm guessing mplayer was doing. I think it issued two simultaneous parallel webapi GET requests: one for the index near the end of the file, and another for the first block of the file. They both would have been delivered to the same FileNode instance, creating two Segmentation requests (one for each read() call), creating two calls to DownloadNode.get_segment(). The second one would wait for the first one to finish, since to keep the memory footprint low, DownloadNode only works on one segment at a time. The first segment-fetch failed because of the *len* bug, leaving the second fetch hanging (because of the lost-progress bug). When mplayer got an error on the index GET, I believe it gave up on the other GET, dropping the HTTP connection, causing connectionLost and stopProducing, causing a DownloadLoad._cancel_request when no segment-fetch was active, triggering the third bug.

Oh, I should mention what I'm guessing mplayer was doing. I think it issued two simultaneous parallel webapi GET requests: one for the index near the end of the file, and another for the first block of the file. They both would have been delivered to the same `FileNode` instance, creating two `Segmentation` requests (one for each `read()` call), creating two calls to `DownloadNode.get_segment()`. The second one would wait for the first one to finish, since to keep the memory footprint low, `DownloadNode` only works on one segment at a time. The first segment-fetch failed because of the `*len*` bug, leaving the second fetch hanging (because of the lost-progress bug). When mplayer got an error on the index GET, I believe it gave up on the other GET, dropping the HTTP connection, causing `connectionLost` and `stopProducing`, causing a `DownloadLoad._cancel_request` when no segment-fetch was active, triggering the third bug.

Should be fixed now, in changeset:43c5032105288a58 and changeset:f6f9a97627d210a6.

Should be fixed now, in changeset:43c5032105288a58 and changeset:f6f9a97627d210a6.
warner added the
fixed
label 2010-08-05 19:23:36 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1154
No description provided.