make SFTP frontend handle updates to MDMFs without downloading and uploading the entire file #1496

Open
opened 2011-08-22 05:35:34 +00:00 by zooko · 3 comments

It appears that the current version of the #393 branch, in the SFTPD frontend, [downloads the entire MDMF file and then uploads the entire new version of it]source:ticket393-MDMF-2/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5151#L815, even if the SFTP client has overwritten only a portion of it. This isn't a regression—Tahoe-LAFS v1.8 didn't have MDMF's at all, but did [the same download-entire-file-and-upload-entire-new-version]source:trunk/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5127#L828 in order to let an SFTP client appear to "overwrite" a portion of an immutable file.

However, I think this should probably be considered a blocker for 1.9 final. Users could legitimately expect that the performance benefits of MDMFs -- namely spending only approximately A network usage to overwrite A bytes out of a B-byte MDMF -- will apply when the edit the file through SFTPD as well as when they [edit it through the WAPI]source:ticket393-MDMF-2/docs/frontends/webapi.rst?rev=5138#writing-uploading-a-file.

We should update [performance.rst]source:ticket393-MDMF-2/docs/performance.rst to state what the performance of MDMFs is in addition to the performance of SDMFs and immutables. If we were going to ship Tahoe-LAFS v1.9 with the current behavior (which seems like a bad idea to me at the moment), then we would need to add another section of MDMF as edited through SFTPD in addition to MDMF as edited through the WAPI.

It appears that the current version of the #393 branch, in the SFTPD frontend, [downloads the entire MDMF file and then uploads the entire new version of it]source:ticket393-MDMF-2/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5151#L815, even if the SFTP client has overwritten only a portion of it. This isn't a regression—Tahoe-LAFS v1.8 didn't have MDMF's at all, but did [the same download-entire-file-and-upload-entire-new-version]source:trunk/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5127#L828 in order to let an SFTP client appear to "overwrite" a portion of an immutable file. However, I think this should probably be considered a blocker for 1.9 final. Users could legitimately expect that the performance benefits of MDMFs -- namely spending only approximately *A* network usage to overwrite *A* bytes out of a *B*-byte MDMF -- will apply when the edit the file through SFTPD as well as when they [edit it through the WAPI]source:ticket393-MDMF-2/docs/frontends/webapi.rst?rev=5138#writing-uploading-a-file. We should update [performance.rst]source:ticket393-MDMF-2/docs/performance.rst to state what the performance of MDMFs is in addition to the performance of SDMFs and immutables. If we were going to ship Tahoe-LAFS v1.9 with the current behavior (which seems like a bad idea to me at the moment), then we would need to add another section of MDMF as edited through SFTPD in addition to MDMF as edited through the WAPI.
zooko added the
code-mutable
critical
defect
1.8.2
labels 2011-08-22 05:35:34 +00:00
zooko added this to the 1.9.0 milestone 2011-08-22 05:35:34 +00:00
davidsarah commented 2011-08-23 16:06:02 +00:00
Owner

Replying to zooko:

It appears that the current version of the #393 branch, in the SFTPD frontend, [downloads the entire MDMF file and then uploads the entire new version of it]source:ticket393-MDMF-2/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5151#L815, even if the SFTP client has overwritten only a portion of it. This isn't a regression—Tahoe-LAFS v1.8 didn't have MDMF's at all, but did [the same download-entire-file-and-upload-entire-new-version]source:trunk/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5127#L828 in order to let an SFTP client appear to "overwrite" a portion of an immutable file.

However, I think this should probably be considered a blocker for 1.9 final.

There are two applicable optimizations.

a) for immutable and MDMF files: download segments out-of-order, i.e. if the client tries to read from a segment beyond the last downloaded segment so far, schedule that segment to be downloaded next.

b) for MDMF files: when the SFTP file handle is closed, overwrite only segments that have changed.

I think you're talking about b). An [OverwriteableFileConsumer]source:src/allmydata/frontends/sftpd.py@5179#L294 instance already keeps track of regions that have been overwritten, but it currently discards information about regions that have also been fully downloaded, and it's slightly inconvenient to change that (because we use a heap to provide efficient access to the first remaining region that has not yet been downloaded). It's feasible to implement b) within the 1.9 schedule, but it does require some non-trivial code changes, so we'd probably want to do it before the beta.

I don't think this should be considered a blocker, though. Remember that the SFTP frontend never creates mutable files, even though it can read and write existing ones. So someone using SFTP as their main interface would rarely, if ever, be affected by the performance of MDMF as seen through SFTP.

Also, OverwriteableFileConsumer already has a fairly complicated implementation. I had planned to improve its test coverage before making any further optimizations. Currently it is not as well-tested as the rest of sftpd.py, partly because its behaviour depends nondeterministically on the timing of the download relative to the timing of requests from the SFTP client, which is more difficult to test (although it's possible to make the test deterministic by mocking the downloader).

We should update [performance.rst]source:ticket393-MDMF-2/docs/performance.rst to state what the performance of MDMFs is in addition to the performance of SDMFs and immutables. If we were going to ship Tahoe-LAFS v1.9 with the current behavior (which seems like a bad idea to me at the moment), then we would need to add another section of MDMF as edited through SFTPD in addition to MDMF as edited through the WAPI.

If we don't implement this optimization for 1.9, we would just need to add a note that the SFTP frontend does not have any MDMF-specific optimizations, so its performance for MDMF is the same as for SDMF.

Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/6558): > It appears that the current version of the #393 branch, in the SFTPD frontend, [downloads the entire MDMF file and then uploads the entire new version of it]source:ticket393-MDMF-2/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5151#L815, even if the SFTP client has overwritten only a portion of it. This isn't a regression—Tahoe-LAFS v1.8 didn't have MDMF's at all, but did [the same download-entire-file-and-upload-entire-new-version]source:trunk/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5127#L828 in order to let an SFTP client appear to "overwrite" a portion of an immutable file. > > However, I think this should probably be considered a blocker for 1.9 final. There are two applicable optimizations. a) for immutable and MDMF files: download segments out-of-order, i.e. if the client tries to read from a segment beyond the last downloaded segment so far, schedule that segment to be downloaded next. b) for MDMF files: when the SFTP file handle is closed, overwrite only segments that have changed. I think you're talking about b). An [OverwriteableFileConsumer]source:src/allmydata/frontends/sftpd.py@5179#L294 instance already keeps track of regions that have been overwritten, but it currently discards information about regions that have also been fully downloaded, and it's slightly inconvenient to change that (because we use a heap to provide efficient access to the first remaining region that has not yet been downloaded). It's feasible to implement b) within the 1.9 schedule, but it does require some non-trivial code changes, so we'd probably want to do it before the beta. I don't think this should be considered a blocker, though. Remember that the SFTP frontend never creates mutable files, even though it can read and write existing ones. So someone using SFTP as their main interface would rarely, if ever, be affected by the performance of MDMF as seen through SFTP. Also, OverwriteableFileConsumer already has a fairly complicated implementation. I had planned to improve its test coverage before making any further optimizations. Currently it is not as well-tested as the rest of sftpd.py, partly because its behaviour depends nondeterministically on the timing of the download relative to the timing of requests from the SFTP client, which is more difficult to test (although it's possible to make the test deterministic by mocking the downloader). > We should update [performance.rst]source:ticket393-MDMF-2/docs/performance.rst to state what the performance of MDMFs is in addition to the performance of SDMFs and immutables. If we were going to ship Tahoe-LAFS v1.9 with the current behavior (which seems like a bad idea to me at the moment), then we would need to add another section of MDMF as edited through SFTPD in addition to MDMF as edited through the WAPI. If we don't implement this optimization for 1.9, we would just need to add a note that the SFTP frontend does not have any MDMF-specific optimizations, so its performance for MDMF is the same as for SDMF.
tahoe-lafs added
major
and removed
critical
labels 2011-08-23 16:06:02 +00:00
davidsarah commented 2011-08-23 19:27:31 +00:00
Owner

Replying to [davidsarah]comment:2:

Replying to zooko:

It appears that the current version of the #393 branch, in the SFTPD frontend, [downloads the entire MDMF file and then uploads the entire new version of it]source:ticket393-MDMF-2/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5151#L815, even if the SFTP client has overwritten only a portion of it. This isn't a regression—Tahoe-LAFS v1.8 didn't have MDMF's at all, but did [the same download-entire-file-and-upload-entire-new-version]source:trunk/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5127#L828 in order to let an SFTP client appear to "overwrite" a portion of an immutable file.

However, I think this should probably be considered a blocker for 1.9 final.

b) for MDMF files: when the SFTP file handle is closed, overwrite only segments that have changed.

As explained in /tahoe-lafs/trac-2024-07-25/issues/5455#comment:38 and /tahoe-lafs/trac-2024-07-25/issues/5455#comment:135, the MDMF uploader always uploads whole shares, even if its client tells it the regions that have changed. So making the SFTP frontend tell it which regions have changed should not be a blocker for 1.9.

(I'm not sure I agree with the reasoning in /tahoe-lafs/trac-2024-07-25/issues/5455#comment:38, but that's a separate issue.)

If we don't implement this optimization for 1.9, we would just need to add a note that the SFTP frontend does not have any MDMF-specific optimizations, so its performance for MDMF is the same as for SDMF.

Actually, the memory usage for downloads should be better than for SDMF.

Replying to [davidsarah]comment:2: > Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/6558): > > It appears that the current version of the #393 branch, in the SFTPD frontend, [downloads the entire MDMF file and then uploads the entire new version of it]source:ticket393-MDMF-2/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5151#L815, even if the SFTP client has overwritten only a portion of it. This isn't a regression—Tahoe-LAFS v1.8 didn't have MDMF's at all, but did [the same download-entire-file-and-upload-entire-new-version]source:trunk/src/allmydata/frontends/sftpd.py?annotate=blame&rev=5127#L828 in order to let an SFTP client appear to "overwrite" a portion of an immutable file. > > > > However, I think this should probably be considered a blocker for 1.9 final. > > b) for MDMF files: when the SFTP file handle is closed, overwrite only segments that have changed. As explained in [/tahoe-lafs/trac-2024-07-25/issues/5455](/tahoe-lafs/trac-2024-07-25/issues/5455)#comment:38 and [/tahoe-lafs/trac-2024-07-25/issues/5455](/tahoe-lafs/trac-2024-07-25/issues/5455)#comment:135, the MDMF uploader always uploads whole shares, even if its client tells it the regions that have changed. So making the SFTP frontend tell it which regions have changed should not be a blocker for 1.9. (I'm not sure I agree with the reasoning in [/tahoe-lafs/trac-2024-07-25/issues/5455](/tahoe-lafs/trac-2024-07-25/issues/5455)#comment:38, but that's a separate issue.) > If we don't implement this optimization for 1.9, we would just need to add a note that the SFTP frontend does not have any MDMF-specific optimizations, so its performance for MDMF is the same as for SDMF. Actually, the memory usage for downloads should be better than for SDMF.
tahoe-lafs modified the milestone from 1.9.0 to 1.10.0 2011-08-23 19:27:31 +00:00
davidsarah commented 2011-08-26 20:54:37 +00:00
Owner

Replying to [davidsarah]comment:3:

Replying to [davidsarah]comment:2:

Replying to zooko:

However, I think this should probably be considered a blocker for 1.9 final.

b) for MDMF files: when the SFTP file handle is closed, overwrite only segments that have changed.

As explained in /tahoe-lafs/trac-2024-07-25/issues/5455#comment:38 and /tahoe-lafs/trac-2024-07-25/issues/5455#comment:135, the MDMF uploader always uploads whole shares, even if its client tells it the regions that have changed.

I was mistaken about that; Kevan clarified in /tahoe-lafs/trac-2024-07-25/issues/5455#comment:152 :

As an aside, we still do partial updates -- we just do them in such a
way as to ensure that all of the changes are written at once, in that we
retain all of the updated segments in memory as write vectors, then send
those all at once to the storage server.

I still think this should not be a blocker for 1.9, though, since it's too big a change.

Replying to [davidsarah]comment:3: > Replying to [davidsarah]comment:2: > > Replying to [zooko](/tahoe-lafs/trac-2024-07-25/issues/6558): > > > However, I think this should probably be considered a blocker for 1.9 final. > > > > b) for MDMF files: when the SFTP file handle is closed, overwrite only segments that have changed. > > As explained in [/tahoe-lafs/trac-2024-07-25/issues/5455](/tahoe-lafs/trac-2024-07-25/issues/5455)#comment:38 and [/tahoe-lafs/trac-2024-07-25/issues/5455](/tahoe-lafs/trac-2024-07-25/issues/5455)#comment:135, the MDMF uploader always uploads whole shares, even if its client tells it the regions that have changed. I was mistaken about that; Kevan clarified in [/tahoe-lafs/trac-2024-07-25/issues/5455](/tahoe-lafs/trac-2024-07-25/issues/5455)#comment:152 : > As an aside, we still do partial updates -- we just do them in such a > way as to ensure that all of the changes are written at once, in that we > retain all of the updated segments in memory as write vectors, then send > those all at once to the storage server. I still think this should not be a blocker for 1.9, though, since it's too big a change.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1496
No description provided.