add streaming (on-line) upload to HTTP interface #320

New Issue

warner · 2008-02-27T20:12:02Z

warner commented

2008-02-27 20:12:02 +00:00

In 0.8.0, the upload interfaces visible to HTTP all require the file to be
completely present on the tahoe node before any upload work can be
accomplished. For a FUSE plugin (talking to a local tahoe node) that provides
an open/write/close POSIX-like API to some application, this means that the
write() calls all finish quickly, while the close() call takes a long time.

Many applications cannot handle this. These apps enforce timeouts on the
close() call on the order of 30-60 seconds. If these apps can handle network
filesystems at all, my hunch is that they will be more tolerant of delays in
the write() calls than in the close().

This effectively imposes a maximum file size on uploads, determined by the
link speed times the close() timeout. Using the helper can improve this by a
factor of 'N/k' relative to non-assisted uploads. The current FUSE plugin has
a number of unpleasant workarounds that involve lying to the close() call
(pretending that the file has been uploaded when in fact it has not), which
have a bunch of knock-on effects (like how to handle the subsequent open+read
of the file that we've supposedly just written).

To accomodate this better, we need to move the slow part of upload from
close() into write(). That means that whatever slow DSL link we're traversing
(either ciphertext to the helper or shares to the grid) needs to get data
during write().

This requires a number of items:

an HTTP interface that will accept partial data.
- twisted.web doesn't deliver the Request to the Resource until the body
  has been fully received, so to continue using twisted.web we must
  either hack it or add something application-visible (like "upload
  handles" which accept multiple PUTs or POSTs and then a final "close"
  action).
- twisted.web2 offers streaming uploads, but 1) it isn't released yet, 2)
  all the Twisted folks I've spoken to say we shouldn't use it yet, and
  3) it doesn't work with Nevow. To use it, we would probably need to
  include a copy of twisted.web2 with Tahoe, which either means renaming
  it to something that doesn't conflict with the twisted package, or
  including a copy of twisted as well.
some way to use randomly-generated encryption keys instead of CHK-based
ones. At the very least we must make sure that we can start sending data
over the slow link before we've read the entire file. The FUSE interface
(with open/write/close) doesn't give the FUSE plugin knowledge of the
full file before the close() call. Our current helper remote interface
requires knowledge of the storage index (and thus the key) before the
helper is contacted. This introduces a tension between de-duplication and
streaming upload.

I've got more notes on this stuff.. will add them later.

In 0.8.0, the upload interfaces visible to HTTP all require the file to be completely present on the tahoe node before any upload work can be accomplished. For a FUSE plugin (talking to a local tahoe node) that provides an open/write/close POSIX-like API to some application, this means that the write() calls all finish quickly, while the close() call takes a long time. Many applications cannot handle this. These apps enforce timeouts on the close() call on the order of 30-60 seconds. If these apps can handle network filesystems at all, my hunch is that they will be more tolerant of delays in the write() calls than in the close(). This effectively imposes a maximum file size on uploads, determined by the link speed times the close() timeout. Using the helper can improve this by a factor of 'N/k' relative to non-assisted uploads. The current FUSE plugin has a number of unpleasant workarounds that involve lying to the close() call (pretending that the file has been uploaded when in fact it has not), which have a bunch of knock-on effects (like how to handle the subsequent open+read of the file that we've supposedly just written). To accomodate this better, we need to move the slow part of upload from close() into write(). That means that whatever slow DSL link we're traversing (either ciphertext to the helper or shares to the grid) needs to get data during write(). This requires a number of items: * an HTTP interface that will accept partial data. * twisted.web doesn't deliver the Request to the Resource until the body has been fully received, so to continue using twisted.web we must either hack it or add something application-visible (like "upload handles" which accept multiple PUTs or POSTs and then a final "close" action). * twisted.web2 offers streaming uploads, but 1) it isn't released yet, 2) all the Twisted folks I've spoken to say we shouldn't use it yet, and 3) it doesn't work with Nevow. To use it, we would probably need to include a copy of twisted.web2 with Tahoe, which either means renaming it to something that doesn't conflict with the twisted package, or including a copy of twisted as well. * some way to use randomly-generated encryption keys instead of CHK-based ones. At the very least we must make sure that we can start sending data over the slow link before we've read the entire file. The FUSE interface (with open/write/close) doesn't give the FUSE plugin knowledge of the full file before the close() call. Our current helper remote interface requires knowledge of the storage index (and thus the key) before the helper is contacted. This introduces a tension between de-duplication and streaming upload. I've got more notes on this stuff.. will add them later.

warner added the

labels 2008-02-27 20:12:02 +00:00

warner added this to the 0.9.0 (Allmydata 3.0 final) milestone 2008-02-27 20:12:02 +00:00

warner commented

2008-03-05 18:35:12 +00:00

Since it looks like twisted.web2 won't be ready for production use for a
while (if ever), and the hacks we'd have to make to twisted.web1 would be
effectively the same as rewriting twisted.web2, we decided to go with the
application-visible approach. This means upload handles.

Rob was more comfortable with server-generated handles than with
client-generated ones, so the web-API I'm planning to build will use a series
of POSTs like so:

POST /upload/open?key=KEYSPEC
- if KEYSPEC is "CHK", then the server will buffer plaintext until the
  close, then compute the CHK encryption key. This defeats streaming.
- if KEYSPEC is "random", then the server will generate a random
  encryption key. This enables streaming.
- if KEYSPEC is a 32-byte hexidecimal string, the server will use the
  equivalent binary form as the encryption key. This enables streaming.
- if KEYSPEC is a 26-character base32-encoded string, the server will use
  the equivalent binary form as the encryption key. This enables
  streaming. This is the same form as the output of 'tahoe dump-cap'.
the response body of the /upload/open call is an upload handle, composed
entirely of URL-safe ASCII characters. All further calls will use it.
POST /upload/$HANDLE
- the body of the POST will be one chunk of file data. All chunks will be
  written in order. No seek calls are supported at this time. The
  Content-Type of the POST can be anything except one of the usual HTML
  form encoding types (multipart/form-data or
  application/x-www-form-urlencoded), to prevent the twisted.web request
  handler from attempting to parse the chunk.
- The POST will stall if necessary to prevent too much storage from being
  consumed in the client. If the upload is occurring in a streaming
  fashion, this will attempt to push the chunk over the slow link before
  returning, to accomplish the goal of moving the upload time from the
  close() call to the write() calls.
- The response body will be empty
POST /upload/$HANDLE?close=true
- the last chunk should be accompanied by ?close=true . This chunk may be
  empty.
- the POST will stall until the upload has completed
- the response body will contain the URI of the uploaded file

This API is all the application needs to know about, but to make streaming
work, we need a bit more under the hood. The largest current challenge is
that immutable lease requests must be accompanied by an accurate size value,
so we can't start encoding until we know the size of the file. That means we
can only get streaming with a helper. We need a new helper protocol that will
start with a storage index and then push ciphertext to the helper (instead of
having the helper pull ciphertext), then tell the helper that we're done. At
that point, the helper knows the size of the file, so it can encode and push.

So I'm going to build these two protocols: the POST /upload one and the
push-to-helper one, since that will enable streaming in our current
most-important use case. Later, we can investigate a different storage-server
protocol that will let us declare a maximum size, then push data until we're
done, then reset the size to the correct value. With that one in place, we
will be able to stream without a helper. Note, however, that CHK (computed by
the tahoe node) always disables streaming.

Since it looks like twisted.web2 won't be ready for production use for a while (if ever), and the hacks we'd have to make to twisted.web1 would be effectively the same as rewriting twisted.web2, we decided to go with the application-visible approach. This means upload handles. Rob was more comfortable with server-generated handles than with client-generated ones, so the web-API I'm planning to build will use a series of POSTs like so: * POST /upload/open?key=KEYSPEC * if KEYSPEC is "CHK", then the server will buffer plaintext until the close, then compute the CHK encryption key. This defeats streaming. * if KEYSPEC is "random", then the server will generate a random encryption key. This enables streaming. * if KEYSPEC is a 32-byte hexidecimal string, the server will use the equivalent binary form as the encryption key. This enables streaming. * if KEYSPEC is a 26-character base32-encoded string, the server will use the equivalent binary form as the encryption key. This enables streaming. This is the same form as the output of 'tahoe dump-cap'. * the response body of the /upload/open call is an upload handle, composed entirely of URL-safe ASCII characters. All further calls will use it. * POST /upload/$HANDLE * the body of the POST will be one chunk of file data. All chunks will be written in order. No `seek` calls are supported at this time. The Content-Type of the POST can be anything except one of the usual HTML form encoding types (multipart/form-data or application/x-www-form-urlencoded), to prevent the twisted.web request handler from attempting to parse the chunk. * The POST will stall if necessary to prevent too much storage from being consumed in the client. If the upload is occurring in a streaming fashion, this will attempt to push the chunk over the slow link before returning, to accomplish the goal of moving the upload time from the close() call to the write() calls. * The response body will be empty * POST /upload/$HANDLE?close=true * the last chunk should be accompanied by ?close=true . This chunk may be empty. * the POST will stall until the upload has completed * the response body will contain the URI of the uploaded file This API is all the application needs to know about, but to make streaming work, we need a bit more under the hood. The largest current challenge is that immutable lease requests must be accompanied by an accurate size value, so we can't start encoding until we know the size of the file. That means we can only get streaming with a helper. We need a new helper protocol that will start with a storage index and then push ciphertext to the helper (instead of having the helper pull ciphertext), then tell the helper that we're done. At that point, the helper knows the size of the file, so it can encode and push. So I'm going to build these two protocols: the POST /upload one and the push-to-helper one, since that will enable streaming in our current most-important use case. Later, we can investigate a different storage-server protocol that will let us declare a maximum size, then push data until we're done, then reset the size to the correct value. With that one in place, we will be able to stream without a helper. Note, however, that CHK (computed by the tahoe node) always disables streaming.

zooko commented

2008-03-05 20:24:58 +00:00

Ugh -- I was excited about making Tahoe do streaming using the simple old RESTful API. I'm not very excited about changing the wapi to facilitate streaming. If we're going the direction of extending the wapi to enable more sophisticated file semantics, then we should probably head in the direction of making it be a subset of WebDAV.

http://webdav.org/

Basically, there is value in better streaming performance with current simple wapi, and there is value in a more complex API that allows things like seek() and versioning (i.e. WebDAV), but extending the wapi to do this chunked streaming is a "sour spot" in the trade-off which uglifies the wapi and enables only a little bit of added functionality.

Another reason that I'm unhappy about this decision is that code to handle the current wapi in streaming fashion already exists and works:

http://twistedmatrix.com/trac/browser/branches/web2-new-stream-1937-2

Brian wrote "twisted.web2 won't be ready for production use for a while", but I'm skeptical about what this "ready for production use" actually means concretely -- I think it has more to do with the Twisted project not having working release automation and volunteers to do release management than with there actually being bugs that would prevent that code from sufficing for this ticket.

:-(

Ugh -- I was excited about making Tahoe do streaming using the simple old RESTful API. I'm not very excited about changing the wapi to facilitate streaming. If we're going the direction of extending the wapi to enable more sophisticated file semantics, then we should probably head in the direction of making it be a subset of WebDAV. <http://webdav.org/> Basically, there is value in better streaming performance with current simple wapi, and there is value in a more complex API that allows things like seek() and versioning (i.e. WebDAV), but extending the wapi to do this chunked streaming is a "sour spot" in the trade-off which uglifies the wapi and enables only a little bit of added functionality. Another reason that I'm unhappy about this decision is that code to handle the current wapi in streaming fashion already exists and works: <http://twistedmatrix.com/trac/browser/branches/web2-new-stream-1937-2> Brian wrote "twisted.web2 won't be ready for production use for a while", but I'm skeptical about what this "ready for production use" actually means concretely -- I think it has more to do with the Twisted project not having working release automation and volunteers to do release management than with there actually being bugs that would prevent that code from sufficing for this ticket. :-(

warner commented

2008-03-06 02:17:51 +00:00

After much discussion and prioritizing, we've decided to back down from this
goal, and put this project on hold for a month or more.

The problem that we hoped to solve with this feature was that native apps
that use Tahoe through a FUSE plugin could behave badly if the close() call
took a long time to fix. A secondary goal was to make the OS's built-in
progress bar (for drag-and-drop copies) more accurate. There are three basic
approaches we can take:

We do streaming, write() takes a while, close() is fast. However, if
we're using a helper, close() still takes about 3MBps, so it isn't
instantaneous, and if some windows app has a 30-second timeout on
close(), this still limits us to 90MB files. Also this kind of streaming
means that we must give up convergence. Progress bars are fairly
accurate. Close means close.
No streaming, write() is fast, close() is slow. Apps have problems.
Progress bar is wrong. Close means close.
No streaming, and the FUSE plugin quietly implements an asynchronous
write cache. write() is fast, close() is fast, apps are happy, progress
bar is wrong, close means "we'll work on it".

We decided that approach 3 was the way to go. We plan to implement sync() in
the FUSE layer to block until the write cache is empty (at least on systems
where it exists.. we aren't yet sure if the SMB protocol that windows-FUSE
uses provides such a call). Backup apps are likely to use something like
sync() to be sure the data is really flushed out, and therefore they ought to
be safe (although they might enforce some other sort of timeout on sync(),
who knows).

We'll use a separate progress indication mechanism (a toolbar icon?) to let
the user know that the write cache is non-empty, and that therefore they
should not shut down their computer quite yet. The FUSE plugin should be able
to display status information about its cache and an ETA of how long it will
take to finish pushing.

This also ties in to the dirnode batching. If we're batching directory
additions to make them go faster, we're doing write caching anyways, and have
already committed to making the close() call lie about its completion status.

We may consider exposing tahoe's current-operation progress information in a
machine-readable format to the FUSE plugin, so it can include that status in
its own. To make this accurate, we need to add some sort of "task-id" (a
unique number) to each webapi request. These task-ids can then be put in the
JSON status output web page, so the FUSE plugin can correlate the tasks.

After much discussion and prioritizing, we've decided to back down from this goal, and put this project on hold for a month or more. The problem that we hoped to solve with this feature was that native apps that use Tahoe through a FUSE plugin could behave badly if the close() call took a long time to fix. A secondary goal was to make the OS's built-in progress bar (for drag-and-drop copies) more accurate. There are three basic approaches we can take: 1. We do streaming, write() takes a while, close() is fast. However, if we're using a helper, close() still takes about 3MBps, so it isn't instantaneous, and if some windows app has a 30-second timeout on close(), this still limits us to 90MB files. Also this kind of streaming means that we must give up convergence. Progress bars are fairly accurate. Close means close. 2. No streaming, write() is fast, close() is slow. Apps have problems. Progress bar is wrong. Close means close. 3. No streaming, and the FUSE plugin quietly implements an asynchronous write cache. write() is fast, close() is fast, apps are happy, progress bar is wrong, close means "we'll work on it". We decided that approach 3 was the way to go. We plan to implement sync() in the FUSE layer to block until the write cache is empty (at least on systems where it exists.. we aren't yet sure if the SMB protocol that windows-FUSE uses provides such a call). Backup apps are likely to use something like sync() to be sure the data is really flushed out, and therefore they ought to be safe (although they might enforce some other sort of timeout on sync(), who knows). We'll use a separate progress indication mechanism (a toolbar icon?) to let the user know that the write cache is non-empty, and that therefore they should not shut down their computer quite yet. The FUSE plugin should be able to display status information about its cache and an ETA of how long it will take to finish pushing. This also ties in to the dirnode batching. If we're batching directory additions to make them go faster, we're doing write caching anyways, and have already committed to making the close() call lie about its completion status. We may consider exposing tahoe's current-operation progress information in a machine-readable format to the FUSE plugin, so it can include that status in its own. To make this accurate, we need to add some sort of "task-id" (a unique number) to each webapi request. These task-ids can then be put in the JSON status output web page, so the FUSE plugin can correlate the tasks.

warner commented

2008-03-06 02:35:16 +00:00

re: twisted.web2 not being ready for a while:

When we asked the twisted.web IRC folks last week, we identified the
following problems:

nevow is incompatible with twisted.web2, and nobody expressed interest in
fixing nevow, despite the offer of money
twisted.web2 has not been released yet, and nobody expressed interest in
releasing it, despite the offer of money. web2 is in a strange place,
where its existence is inhibiting work on web1, and the existence of web1
is inhibiting work on web2.
despite the streaming code in twisted.web2 looking functional and (in my
mind) well-designed, the consensus among the twisted folks was that it
wasn't worth using, and that the code from that web2-new-stream branch
might be better. The fact that there exist two functional streaming
mechanisms and that the twisted community hasn't settled upon either of
them makes me even less confident that web2 will be released any time
soon. (it feels like they're arguing about things that don't need
fixing). I may be completely wrong about this one, though.

Using an unreleased copy of twisted.web2 is difficult, because python's
import mechanism makes it hard to have your twisted.internet come from one
place and your twisted.web2 come from somewhere else. (setuptools "namespace
packages" are one attempt to solve this, as is the divmod "combinator", and
both appear to be pretty ugly hacks).

So I think the easiest approach would be to make a private copy of web2 in
the allmydata tree, perhaps under allmydata.tw_web2 . To do this, we'd have
to touch most of the 103 .py files and change their import statements to pull
from allmydata.tw_web2.FOO instead of twisted.web2.FOO . This would make it
difficult to apply later upstream patches, although we might get lucky and
'darcs replace' could do much of the work for us. However, I don't trust
'darcs replace' to do this correctly in the long term: I think each upstream
update would need to be applied by hand and the results carefully inspected.
We'd have to play darcs games (i.e. maintain a separate web2-tracking repo
and merge its contents into the tahoe one with some directory-renaming
patches) to enable ongoing updates. And we'd have to add 876kB of an external
library to the Tahoe source tree, which is already much larger than I'd
prefer.

The best outcome would be if the twisted folks made up their mind about web2,
made a release, and then made a release of Twisted that included it. Then we
could simply declare a dependency upon Twisted-2.6.0 or Twisted-8.0 or
whatever they're going to call it this week and we'd be done. But that's
certainly not going to happen before we ship 1.0 in a week, and I don't
believe it is going to happen within the next three months either.

So, I'm glad that we were able to decide to punt on the streaming features,
because I didn't see a happy way to implement them in a single PUT or POST,
and I too did not like the multiple-POST app-visible approach described
above.

re: twisted.web2 not being ready for a while: When we asked the twisted.web IRC folks last week, we identified the following problems: 1. nevow is incompatible with twisted.web2, and nobody expressed interest in fixing nevow, despite the offer of money 2. twisted.web2 has not been released yet, and nobody expressed interest in releasing it, despite the offer of money. web2 is in a strange place, where its existence is inhibiting work on web1, and the existence of web1 is inhibiting work on web2. 3. despite the streaming code in twisted.web2 looking functional and (in my mind) well-designed, the consensus among the twisted folks was that it wasn't worth using, and that the code from that web2-new-stream branch might be better. The fact that there exist *two* functional streaming mechanisms and that the twisted community hasn't settled upon either of them makes me even less confident that web2 will be released any time soon. (it feels like they're arguing about things that don't need fixing). I may be completely wrong about this one, though. Using an unreleased copy of twisted.web2 is difficult, because python's import mechanism makes it hard to have your twisted.internet come from one place and your twisted.web2 come from somewhere else. (setuptools "namespace packages" are one attempt to solve this, as is the divmod "combinator", and both appear to be pretty ugly hacks). So I think the easiest approach would be to make a private copy of web2 in the allmydata tree, perhaps under allmydata.tw_web2 . To do this, we'd have to touch most of the 103 .py files and change their import statements to pull from allmydata.tw_web2.FOO instead of twisted.web2.FOO . This would make it difficult to apply later upstream patches, although we might get lucky and 'darcs replace' could do much of the work for us. However, I don't trust 'darcs replace' to do this correctly in the long term: I think each upstream update would need to be applied by hand and the results carefully inspected. We'd have to play darcs games (i.e. maintain a separate web2-tracking repo and merge its contents into the tahoe one with some directory-renaming patches) to enable ongoing updates. And we'd have to add 876kB of an external library to the Tahoe source tree, which is already much larger than I'd prefer. The best outcome would be if the twisted folks made up their mind about web2, made a release, and then made a release of Twisted that included it. Then we could simply declare a dependency upon Twisted-2.6.0 or Twisted-8.0 or whatever they're going to call it this week and we'd be done. But that's certainly not going to happen before we ship 1.0 in a week, and I don't believe it is going to happen within the next three months either. So, I'm glad that we were able to decide to punt on the streaming features, because I didn't see a happy way to implement them in a single PUT or POST, and I too did not like the multiple-POST app-visible approach described above.

zooko commented

2008-03-06 20:20:59 +00:00

Brian, your summary is good. One thing you overlooked is the option of shipping our own entire twisted including twisted.web2, thus avoiding renaming issues.

Also, please be more specific about what you fear might go wrong with using darcs replace. On IRC you said that a potential problem is that the token might not match other uses, for example the token "twisted.web2" wouldn't match "from twisted import web2". This is a valid concern, but I want to be clear that there is nothing buggy or vague or complicated about darcs's replace-token functionality -- you just have to spell out all tokens that you want replaced. There are no funny merge edge cases or anything with token-replace patches.

Brian, your summary is good. One thing you overlooked is the option of shipping our own entire twisted including twisted.web2, thus avoiding renaming issues. Also, please be more specific about what you fear might go wrong with using `darcs replace`. On IRC you said that a potential problem is that the token might not match other uses, for example the token "twisted.web2" wouldn't match "from twisted import web2". This is a valid concern, but I want to be clear that there is nothing buggy or vague or complicated about darcs's replace-token functionality -- you just have to spell out all tokens that you want replaced. There are no funny merge edge cases or anything with token-replace patches.

warner commented

2008-03-06 23:24:34 +00:00

Thanks! Yes, we could ship all of twisted with tahoe, at a cost of 853 files,
89 directories, and 7.8MB of python code (roughly 8 times larger than Tahoe
itself: 97 files, 7 directories, and 1.2MB in src/allmydata/). In addition,
we would be making it more difficult for users (and developers!) to use any
other version of twisted along with Tahoe.

We are effectively doing this for/to our Mac and Windows users, by virtue of
using py2app/py2exe, for the goal of making a single-file install. For that
purpose, I think it's a win, and I wouldn't mind having a custom version of
twisted in those application bundles. But for developers I think it would be
a loss.

re: 'darcs replace'. My first concern is the set of filenames on which the
operations are performed. I believe that darcs requires you to enumerate the
filenames when you perform the replace command, and later patches could add
files that contain tokens that you want to replace. The 'darcs replace' that
renames twisted.web2 with allmydata.tw_web2 in foo.py, performed in January
when we first started the process, will not catch the tokens in the new
bar.py that got added in a later version of web2 released in June.

My second (weaker) concern is the variety of forms that the import statement
might take:

from twisted.web2 import stream
import stream (ok, this one doesn't require rewriting)
from twisted.web2.dav import noneprops
import twisted.web2.dav.element.xmlext as ext
from twisted import web2 (although I can't find any instances of this)

This mainly depends upon the regexp that darcs uses to define a 'token',
versus non-token boundaries. I think that if you just do 'darcs replace
twisted.web2 allmydata.tw_web2' then it declares '.' to be a yes-token
character, which means it can't be a token-boundary, which means that it
won't be replaced in 'twisted.web2.dav'. But there may be a way to explicitly
tell darcs what you want to use as an is-a-token regexp.

Once 2.5 and relative imports are more common, there could be other forms,
although again it is unlikely that we'd see 'from ..web2.dav import
noneprops', since that would be a dumb equivalent of 'from dav import
noneprops'.

I don't believe web2 does dynamically-computed import statements, but I think
nevow does (using twisted.python.reflect.namedAny, for example). These would
also be likely missed by 'darcs replace'.

I haven't used 'darcs replace' enough to be comfortable with it, but I agree
that there is nothing buggy or magical about it.

Thanks! Yes, we could ship all of twisted with tahoe, at a cost of 853 files, 89 directories, and 7.8MB of python code (roughly 8 times larger than Tahoe itself: 97 files, 7 directories, and 1.2MB in src/allmydata/). In addition, we would be making it more difficult for users (and developers!) to use any other version of twisted along with Tahoe. We are effectively doing this for/to our Mac and Windows users, by virtue of using py2app/py2exe, for the goal of making a single-file install. For that purpose, I think it's a win, and I wouldn't mind having a custom version of twisted in those application bundles. But for developers I think it would be a loss. re: 'darcs replace'. My first concern is the set of filenames on which the operations are performed. I believe that darcs requires you to enumerate the filenames when you perform the replace command, and later patches could add files that contain tokens that you want to replace. The 'darcs replace' that renames twisted.web2 with allmydata.tw_web2 in foo.py, performed in January when we first started the process, will not catch the tokens in the new bar.py that got added in a later version of web2 released in June. My second (weaker) concern is the variety of forms that the import statement might take: * from twisted.web2 import stream * import stream (ok, this one doesn't require rewriting) * from twisted.web2.dav import noneprops * import twisted.web2.dav.element.xmlext as ext * from twisted import web2 (although I can't find any instances of this) This mainly depends upon the regexp that darcs uses to define a 'token', versus non-token boundaries. I think that if you just do 'darcs replace twisted.web2 allmydata.tw_web2' then it declares '.' to be a yes-token character, which means it can't be a token-boundary, which means that it won't be replaced in 'twisted.web2.dav'. But there may be a way to explicitly tell darcs what you want to use as an is-a-token regexp. Once 2.5 and relative imports are more common, there could be other forms, although again it is unlikely that we'd see 'from ..web2.dav import noneprops', since that would be a dumb equivalent of 'from dav import noneprops'. I don't believe web2 does dynamically-computed import statements, but I think nevow does (using twisted.python.reflect.namedAny, for example). These would also be likely missed by 'darcs replace'. I haven't used 'darcs replace' enough to be comfortable with it, but I agree that there is nothing buggy or magical about it.

warner commented

2008-05-09 00:10:40 +00:00

this isn't going to happen for 1.1.0

warner modified the milestone from 1.1.0 to undecided

2008-05-09 00:10:40 +00:00

warner commented

2008-06-01 22:08:58 +00:00

We've discussed some of the storage-server protocol changes that would support this, in http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html

Also #392 (pipeline upload segments) is related.

We've discussed some of the storage-server protocol changes that would support this, in <http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html> Also #392 (pipeline upload segments) is related.

zooko commented

2008-09-24 13:51:29 +00:00

I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html

I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: <http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html>

zooko commented

2009-02-17 22:50:47 +00:00

Argh! The lack of this feature just caused me to lose data!

My drive is nearly full on my Macbook Pro. I tried to backup a file to Tahoe so that I could delete that file to make room. While it was uploading, I started editing a very difficult, delicate, emotional letter to the OSI license-discuss mailing list about the Transitive Grace Period Public Licence.

Tahoe tried to make a temporary copy of the large file in order to hash it before uploading it, thus running my system out of disk space and causing the editor that I was using to crash and lose some of the letter I was composing. How frustrating!

The biggest reason why Tahoe doesn't already do streaming uploads was that we liked the "hash it before uploading" it as a way to achieve convergence so that successive uploads of the same file by the same person would not waste upload bandwidth and storage space. Now that we have backupdb, that same goal can be handled much more efficiently (most of the time) by backupdb. Hopefully now we can move to proper streaming upload.

Argh! The lack of this feature just caused me to lose data! My drive is nearly full on my Macbook Pro. I tried to backup a file to Tahoe so that I could delete that file to make room. While it was uploading, I started editing a very difficult, delicate, emotional letter to the OSI license-discuss mailing list about the Transitive Grace Period Public Licence. Tahoe tried to make a temporary copy of the large file in order to hash it before uploading it, thus running my system out of disk space and causing the editor that I was using to crash and lose some of the letter I was composing. How frustrating! The biggest reason why Tahoe doesn't already do streaming uploads was that we liked the "hash it before uploading" it as a way to achieve convergence so that successive uploads of the same file by the same person would not waste upload bandwidth and storage space. Now that we have backupdb, that same goal can be handled much more efficiently (most of the time) by backupdb. Hopefully now we can move to proper streaming upload.

zooko commented

2009-04-22 18:03:20 +00:00

#684 is about the part of this in which the client can specify what encryption key to use. There is a patch submitted by Shawn Willden.

zooko changed title from ~~add streaming upload to HTTP interface~~ to add streaming (on-line) upload to HTTP interface

2009-09-10 16:11:39 +00:00

zooko commented

2009-12-04 18:50:32 +00:00

If you love this ticket, you might also like #809 (Measure how segment size affects upload/download speed.) and #398 (allow users to disable use of helper: direct uploads might be faster).

davidsarah commented

2009-12-12 20:56:10 +00:00

#684 (specifying the encryption key) is wontfixed, but I don't think it would be necessary for this ticket if random keys were used.

zooko modified the milestone from eventually to 2.0.0

2010-02-23 03:09:22 +00:00

jsgf commented

2010-03-11 18:21:59 +00:00

Are uploads using a helper streaming?

zooko commented

2010-05-15 04:42:19 +00:00

Replying to jsgf:

Are uploads using a helper streaming?

Currently the Tahoe-LAFS gateway (storage client) receives the entire file plaintext, writes it out in a temp file on disk (while computing the secure hash of it), then generates an encryption key (using that secure hash), then reads it back from the temp file on disk, encrypting as it goes. This is all the same whether you're usikng an immutable upload helper or not. The difference is without the immutable upload helper you also do erasure coding during this second pass while you are doing encryption. With the immutable upload helper you just do the encryption, streaming the ciphertext to the immutable upload helper who does the erasure coding.

Replying to [jsgf](/tahoe-lafs/trac-2024-07-25/issues/320#issuecomment-64899): > Are uploads using a helper streaming? Currently the Tahoe-LAFS gateway (storage client) receives the entire file plaintext, writes it out in a temp file on disk (while computing the secure hash of it), then generates an encryption key (using that secure hash), then reads it back from the temp file on disk, encrypting as it goes. This is all the same whether you're usikng an immutable upload helper or not. The difference is without the immutable upload helper you also do erasure coding during this second pass while you are doing encryption. With the immutable upload helper you just do the encryption, streaming the ciphertext to the immutable upload helper who does the erasure coding.

zooko commented

2010-05-16 05:27:25 +00:00

#294 (make the option of random-key encryption available through the wui and cli) was about a related issue. In order to do streaming upload the Tahoe-LAFS gateway will of course have to do random-key encryption. However, I don't think users actually need to have a switch to control random-key encryption as such, so I've closed #294 and marked it as a duplicate of this ticket.

zooko commented

2010-05-16 06:05:06 +00:00

I intend to have a go at this for Tahoe-LAFS v1.8. The part that I'm likely to have the most trouble with is getting access to the first part of the file which has been uploaded from e.g. the web browser to the twisted.web web server before the entire file has been uploaded. There is a longstanding, stale twisted ticket which is in the context of the now abandoned twisted.web2 project:

http://twistedmatrix.com/trac/ticket/1937 # in twisted.web2, change "stream" to use newfangled not yet defined stream api

There may be some other way to get access to the data incrementally before the entire file has been completely uploaded. Help?

I intend to have a go at this for Tahoe-LAFS v1.8. The part that I'm likely to have the most trouble with is getting access to the first part of the file which has been uploaded from e.g. the web browser to the twisted.web web server before the entire file has been uploaded. There is a longstanding, stale twisted ticket which is in the context of the now abandoned twisted.web2 project: <http://twistedmatrix.com/trac/ticket/1937> # in twisted.web2, change "stream" to use newfangled not yet defined stream api There may be some other way to get access to the data incrementally before the entire file has been completely uploaded. Help?

zooko commented

2010-05-16 06:06:32 +00:00

Other tickets that we would hopefully also be able to close as part of this work:

#1032 Display active HTTP upload operations on the status page
#951 uploads aren't cancelled by closing the web page
#952 multiple simultaneous uploads of the same file

Other tickets that we would hopefully also be able to close as part of this work: * #1032 Display active HTTP upload operations on the status page * #951 uploads aren't cancelled by closing the web page * #952 multiple simultaneous uploads of the same file

zooko modified the milestone from 2.0.0 to 1.8.0

2010-05-16 18:28:51 +00:00

zooko self-assigned this 2010-05-16 18:28:51 +00:00

davidsarah commented

2010-05-16 23:35:42 +00:00

In the case of the SFTP frontend, there is no problem with getting at the upload stream, unlike HTTP. So we could implement streaming upload immediately for SFTP at least in some cases (see #1041 for details), if the uploader itself supported it.

Perhaps we should leave this ticket for the issue of getting at the upload stream of an HTTP request in twisted.web (which is what most of the above comments are about), and open a ticket for streaming support in the new uploader. It looks like the current IUploadable interface isn't really suited to streaming (for example it has a get_size method, and it pulls the data when a "push" approach would be more appropriate), so there is some design work to do on that new ticket that is independent of HTTP.

In the case of the SFTP frontend, there is no problem with getting at the upload stream, unlike HTTP. So we could implement streaming upload immediately for SFTP at least in some cases (see #1041 for details), if the uploader itself supported it. Perhaps we should leave this ticket for the issue of getting at the upload stream of an HTTP request in twisted.web (which is what most of the above comments are about), and open a ticket for streaming support in the new uploader. It looks like the current `IUploadable` interface isn't really suited to streaming (for example it has a `get_size` method, and it pulls the data when a "push" approach would be more appropriate), so there is some design work to do on that new ticket that is independent of HTTP.

zooko commented

2010-07-24 05:36:58 +00:00

Although I would dearly love to get this ticket fixed, I think we have enough other important issues in front of us for v1.8.0, so I'm moving this into the "soon" Milestone. If you think you can fix this in the next couple of weeks, move it back into the "1.8" Milestone, but then you either have to move an equivalent mass of tickets out of "1.8" or you have to commit to spending an extra strong dose of volunteer energy to get this fixed. ;-)

zooko modified the milestone from 1.8.0 to soon

2010-07-24 05:36:58 +00:00

davidsarah commented

2011-01-03 05:08:55 +00:00

Replying to davidsarah:

Perhaps we should leave this ticket for the issue of getting at the upload stream of an HTTP request in twisted.web (which is what most of the above comments are about), and open a ticket for streaming support in the new uploader.

That ticket is #1288.

Replying to [davidsarah](/tahoe-lafs/trac-2024-07-25/issues/320#issuecomment-64906): > Perhaps we should leave this ticket for the issue of getting at the upload stream of an HTTP request in twisted.web (which is what most of the above comments are about), and open a ticket for streaming support in the new uploader. That ticket is #1288.

zooko commented

2011-01-27 07:25:36 +00:00

The correct ticket in the Twisted issue tracker is: http://twistedmatrix.com/trac/ticket/288 (no way to access the data of an upload which is in-progress), not http://twistedmatrix.com/trac/ticket/1937 (in twisted.web2, change "stream" to use newfangled not yet defined stream api).

There is a preliminary patch by exarkun attached to Twisted ticket 288.

The correct ticket in the Twisted issue tracker is: <http://twistedmatrix.com/trac/ticket/288> (no way to access the data of an upload which is in-progress), not <http://twistedmatrix.com/trac/ticket/1937> (in twisted.web2, change "stream" to use newfangled not yet defined stream api). There is a preliminary patch by exarkun attached to [Twisted ticket 288](http://twistedmatrix.com/trac/ticket/288).

tahoe-lafs modified the milestone from soon to eventually

2012-12-06 21:38:08 +00:00

Sign in to join this conversation.