download a subtree as an archive #1029

Open
opened 2010-05-01 21:24:47 +00:00 by nejucomo · 4 comments

For some use cases it may be useful to retrieve an entire directory tree as an archive. Perhaps the wapi call would look like:

GET /uri/$DIRCAP?t=archive&archive_format=tgz

-to retrieve a gzipped tarball.

Issues:

Should the action parameter be t= or some other name such as output= ?

How will the browser name this file?

What if the directory structure contains loops?

What if the full directory tree is huge?

For some use cases it may be useful to retrieve an entire directory tree as an archive. Perhaps the wapi call would look like: ``` GET /uri/$DIRCAP?t=archive&archive_format=tgz ``` -to retrieve a gzipped tarball. Issues: Should the action parameter be `t=` or some other name such as `output=` ? How will the browser name this file? What if the directory structure contains loops? What if the full directory tree is huge?
nejucomo added the
code-frontend-web
minor
enhancement
1.6.1
labels 2010-05-01 21:24:47 +00:00
nejucomo added this to the undecided milestone 2010-05-01 21:24:47 +00:00
davidsarah commented 2010-05-01 23:46:15 +00:00
Owner

My suggested answers:

The action parameter should be t=, because:

  • different GET ...?t= actions retrieve different kinds of information about the referenced object(s), which is the case here;
  • this better fits the existing structure of the webapi code, which dispatches on t= first.

The format parameter name doesn't need to be as long as archive_format, it could just be format or output.

The filename should be the last component of the path to the directory if given, otherwise the short base32 SI of the directory. The filetype should be given by the format parameter. It should be possible to override the filename+type using @@named.

Loops should cause an error. Since the response may already have been started when the loop is detected, this can't be an HTTP error response -- see #822 for possible ways of dealing with that. The gateway will have to remember the SIs of already-seen directories in order to detect loops. (In theory it should be sufficient to remember only mutable directories. We should already be doing that for recursive operations, but I'm not sure we are.)

The directory tree potentially being huge does not present any opportunities for malicious DoS that aren't already present. To avoid these, don't share a gateway with potential DoS-attackers. It does increase the risk of accidental DoS. OTOH, the client can always abort the HTTP request.

My suggested answers: The action parameter should be `t=`, because: * different `GET ...?t=` actions retrieve different kinds of information about the referenced object(s), which is the case here; * this better fits the existing structure of the webapi code, which dispatches on `t=` first. The format parameter name doesn't need to be as long as `archive_format`, it could just be `format` or `output`. The filename should be the last component of the path to the directory if given, otherwise the short base32 SI of the directory. The filetype should be given by the format parameter. It should be possible to override the filename+type using `@@named`. Loops should cause an error. Since the response may already have been started when the loop is detected, this can't be an HTTP error response -- see #822 for possible ways of dealing with that. The gateway will have to remember the SIs of already-seen directories in order to detect loops. (In theory it should be sufficient to remember only mutable directories. We should already be doing that for recursive operations, but I'm not sure we are.) The directory tree potentially being huge does not present any opportunities for malicious DoS that aren't already present. To avoid these, don't share a gateway with potential DoS-attackers. It does increase the risk of accidental DoS. OTOH, the client can always abort the HTTP request.
tahoe-lafs added
major
and removed
minor
labels 2010-05-01 23:46:15 +00:00
tahoe-lafs changed title from Download a dircap as an archive. to download a subtree as an archive 2010-05-01 23:46:15 +00:00
davidsarah commented 2010-05-02 02:48:15 +00:00
Owner

#1030 is a CLI interface to this functionality.

Reasons to implement this ticket as a webapi operation rather than directly in the CLI:

  • it requires only one request to the gateway rather than many requests;
  • it allows the gateway to make storage protocol requests in parallel (the CLI could retrieve files in parallel, but all the existing CLI commands are implemented synchronously using a single thread);
  • if #204 ("virtual CDs") were implemented, this would be a more efficient way of obtaining all or part of the contents of a CD;
  • providing this function in the webapi also allows the WUI and any future JavaScript UI to use it.
#1030 is a CLI interface to this functionality. Reasons to implement this ticket as a webapi operation rather than directly in the CLI: * it requires only one request to the gateway rather than many requests; * it allows the gateway to make storage protocol requests in parallel (the CLI *could* retrieve files in parallel, but all the existing CLI commands are implemented synchronously using a single thread); * if #204 ("virtual CDs") were implemented, this would be a more efficient way of obtaining all or part of the contents of a CD; * providing this function in the webapi also allows the WUI and any future [JavaScript](wiki/JavaScript) UI to use it.
Author

On directory loops: Some formats, such as tar, allow symlinks. Would it be possible to translate directory loops into symlinks appropriately?

On directory loops: Some formats, such as tar, allow symlinks. Would it be possible to translate directory loops into symlinks appropriately?
davidsarah commented 2010-05-02 19:45:03 +00:00
Owner

Replying to nejucomo:

On directory loops: Some formats, such as tar, allow symlinks. Would it be possible to translate directory loops into symlinks appropriately?

Yes, for those formats.

Python has built-in zipfile and tarfile modules to create .zip and .tar[.gz,.bz2] archives. The tarfile module appears to support writing an archive with symlinks (using a TarInfo object with .type = SYMTYPE and .linkname set).

Another issue is the character encoding of file paths. For .zip files there is a bit in the local file header of each file that indicates the encoding is UTF-8 (see Appendix D of the zip format spec), although only a few recently updated zip extractors will recognize this; others will misinterpret the path as Cp437. For .tar files, the PAX format always stores paths as UTF-8. PAX might not be supported by as many extractors as the GNU tar format, although it should be fairly widely supported now.

Replying to [nejucomo](/tahoe-lafs/trac-2024-07-25/issues/1029#issuecomment-77086): > On directory loops: Some formats, such as tar, allow symlinks. Would it be possible to translate directory loops into symlinks appropriately? Yes, for those formats. Python has built-in [zipfile](http://docs.python.org/library/zipfile.html) and [tarfile](http://docs.python.org/library/tarfile.html) modules to create .zip and .tar[.gz,.bz2] archives. The tarfile module appears to support writing an archive with symlinks (using a TarInfo object with `.type = SYMTYPE` and `.linkname` set). Another issue is the character encoding of file paths. For .zip files there is a bit in the local file header of each file that indicates the encoding is UTF-8 (see Appendix D of the [zip format spec](http://www.pkware.com/documents/casestudies/APPNOTE.TXT)), although only a few recently updated zip extractors will recognize this; others will misinterpret the path as Cp437. For .tar files, the [PAX format](http://docs.python.org/library/tarfile.html#supported-tar-formats) always stores paths as UTF-8. PAX might not be supported by as many extractors as the GNU tar format, although it should be fairly widely supported now.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1029
No description provided.