warn users about the performance issues of mutable files #878

Closed
opened 2010-01-04 22:35:40 +00:00 by zooko · 17 comments

Performance issues:

  • mutable files are stored in their entirety in RAM briefly during upload
  • creating a new mutable file requires creating a new RSA public/private key-pair, which can take as many as a billion CPU cycles

Currently, new users can carefully read the Tahoe-LAFS docs and then go on and decide to use mutable files without being aware of these issues. To close this ticket, fix that.

Performance issues: * mutable files are stored in their entirety in RAM briefly during upload * creating a new mutable file requires creating a new RSA public/private key-pair, which can take as many as a billion CPU cycles Currently, new users can carefully read the Tahoe-LAFS docs and then go on and decide to use mutable files without being aware of these issues. To close this ticket, fix that.
zooko added the
unknown
major
defect
1.5.0
labels 2010-01-04 22:35:40 +00:00
zooko added this to the undecided milestone 2010-01-04 22:35:40 +00:00
Author

Here's the thread where new user Jody Harris made it clear that a new user who does read the docs still doesn't learn about these issues: http://allmydata.org/pipermail/tahoe-dev/2010-January/003478.html

Here's the thread where new user Jody Harris made it clear that a new user who *does* read the docs still doesn't learn about these issues: <http://allmydata.org/pipermail/tahoe-dev/2010-January/003478.html>
kevan commented 2010-01-05 17:54:36 +00:00
Owner

I'll take care of this.

I'll take care of this.
kevan commented 2010-01-05 19:07:34 +00:00
Owner

I added the documentation to known_issues.txt, since there are proposals and tickets open that hope to fix this (which would seem to imply that it is a known issue).

Thoughts? Things that should be there but aren't?

I added the documentation to [known_issues.txt](http://allmydata.org/trac/tahoe/browser/docs/known_issues.txt?rev=2a63fc9159f80b08), since there are proposals and tickets open that hope to fix this (which would seem to imply that it is a known issue). Thoughts? Things that should be there but aren't?
kevan commented 2010-01-06 17:20:05 +00:00
Owner

After reading a message (http://allmydata.org/pipermail/tahoe-dev/2010-January/003488.html) on tahoe-dev, I realized that I had misunderstood mutable file modification when writing my first patch. While the process I described was accurate for certain operations (specifically directory modification), it didn't apply to file creation using the CLI or the WUI, the places where users would be creating mutable files, and the places where the warning would be relevant. I'm attaching a reworded patch that fixes this issue.

After reading a message (<http://allmydata.org/pipermail/tahoe-dev/2010-January/003488.html>) on tahoe-dev, I realized that I had misunderstood mutable file modification when writing my first patch. While the process I described was accurate for certain operations (specifically directory modification), it didn't apply to file creation using the CLI or the WUI, the places where users would be creating mutable files, and the places where the warning would be relevant. I'm attaching a reworded patch that fixes this issue.
Author

This ticket is a subset of #757 (there isn't a doc that says "which operations are efficient").

This ticket is a subset of #757 (there isn't a doc that says "which operations are efficient").
Author

FWIW here are measurements of how many CPU cycles are needed to generate an RSA 2048 bit key: http://bench.cr.yp.to/results-sign.html (the ones labelled "ronald2048"). That is not measuring the same implementation of RSA as the one we use, but it is a good benchmark to show that generating RSA keys is expensive.

FWIW here are measurements of how many CPU cycles are needed to generate an RSA 2048 bit key: <http://bench.cr.yp.to/results-sign.html> (the ones labelled "ronald2048"). That is not measuring the same *implementation* of RSA as the one we use, but it is a good benchmark to show that generating RSA keys is expensive.
davidsarah commented 2010-01-14 21:49:55 +00:00
Owner

(http://allmydata.org/trac/tahoe/attachment/ticket/878/mutable_docs.txt#L21) :
"will be invalidated if the file is modified" -> "would be invalidated if the file were modified".

(http://allmydata.org/trac/tahoe/attachment/ticket/878/mutable_docs.txt#L21) : "will be invalidated if the file is modified" -> "would be invalidated if the file were modified".
davidsarah commented 2010-01-14 21:51:59 +00:00
Owner

"tahoe-lafs" -> "Tahoe-LAFS" (three times)

"tahoe-lafs" -> "Tahoe-LAFS" (three times)

while "billions of CPU cycles" is technically accurate, it would be more meaningful to users to say "perhaps an entire second on a desktop PC" (and maybe a parenthetical remark about small ARM boxes). We don't want to scare them away from using directories altogether, just help them understand why a loop that creates a million directories might take a million seconds.

Also, I believe the motivation for this ticket was specifically about large mutable files, so I'd emphasize the unfortunate-and-we-haven't-fixed-with-MDMF performance aspects (i.e. the cost=O(filesize) parts) rather than the unfortunate-and-we-haven't-fixed-with-ECDSA aspects (like the constant cost of creating new mutable files).

while "billions of CPU cycles" is technically accurate, it would be more meaningful to users to say "perhaps an entire second on a desktop PC" (and maybe a parenthetical remark about small ARM boxes). We don't want to scare them away from using directories altogether, just help them understand why a loop that creates a million directories might take a million seconds. Also, I believe the motivation for this ticket was specifically about *large* mutable files, so I'd emphasize the unfortunate-and-we-haven't-fixed-with-MDMF performance aspects (i.e. the cost=O(filesize) parts) rather than the unfortunate-and-we-haven't-fixed-with-ECDSA aspects (like the constant cost of creating new mutable files).
Author

For Jody Harris, seconds elapsed on today's average PC might be more useful (or maybe not -- perhaps he prefers CPU cycles), but for Jonathan Ellis (the bug reporter of #757) CPU cycles is probably more useful. Also I wonder about people who are running their Tahoe-LAFS gateway on virtual machine. Would seconds-on-an-average-modern CPU significantly underestimate the cost to them?

For Jody Harris, seconds elapsed on today's average PC might be more useful (or maybe not -- perhaps he prefers CPU cycles), but for Jonathan Ellis (the bug reporter of #757) CPU cycles is probably more useful. Also I wonder about people who are running their Tahoe-LAFS gateway on virtual machine. Would seconds-on-an-average-modern CPU significantly underestimate the cost to them?

like I said, "billions of CPU cycles" is more accurate (and more universal), but I think the most likely audience for this document will be well-served by having at least one human-meaningful unit of measure in there somewhere, even if only anecdotally. For example, I tell people that the unit tests currently take about 240s on my 2008-era laptop, and I tell them that "tahoe mkdir" takes about 800ms on the same machine. And I expect that people will know how their own hardware compares to a reference point like that. Let's not refuse to offer them a translation hint just because we can't give them an exact number of seconds for their particular hardware.

like I said, "billions of CPU cycles" is more accurate (and more universal), but I think the most likely audience for this document will be well-served by having at least one human-meaningful unit of measure in there somewhere, even if only anecdotally. For example, I tell people that the unit tests currently take about 240s on my 2008-era laptop, and I tell them that "tahoe mkdir" takes about 800ms on the same machine. And I expect that people will know how their own hardware compares to a reference point like that. Let's not refuse to offer them a translation hint just because we can't give them an exact number of seconds for their particular hardware.
kevan commented 2010-01-15 03:35:24 +00:00
Owner

I'm updating the patch to include David-Sarahs' suggestions. Thanks for the feedback. :)

I'm updating the patch to include David-Sarahs' suggestions. Thanks for the feedback. :)
kevan commented 2010-01-15 04:57:08 +00:00
Owner

zooko and I were talking in IRC, and concluded that the explanation of why RSA is used with mutable files is inappropriate for known_issues.txt. I'll remove it when I work on the cycles versus seconds issue.

zooko and I were talking in IRC, and concluded that the explanation of why RSA is used with mutable files is inappropriate for known_issues.txt. I'll remove it when I work on the cycles versus seconds issue.
kevan commented 2010-01-15 20:54:11 +00:00
Owner

I think I agree with Brian.

Without a meaningful human figure to put "billions of CPU cycles" into perspective, that paragraph is a tad scarier than it needs to be. My first instinct when reading this exchange was to try to work both figures in there, but the point of that paragraph seems a lot clearer with only seconds than with both cycles and seconds.

I moved the explanation of mutable file performance issues to docs/performance.txt, because that seemed like a more appropriate place for it.

I think I agree with Brian. Without a meaningful human figure to put "billions of CPU cycles" into perspective, that paragraph is a tad scarier than it needs to be. My first instinct when reading this exchange was to try to work both figures in there, but the point of that paragraph seems a lot clearer with only seconds than with both cycles and seconds. I moved the explanation of mutable file performance issues to docs/performance.txt, because that seemed like a more appropriate place for it.
kevan commented 2010-01-15 20:54:49 +00:00
Owner

Attachment mutable_docs.txt (35048 bytes) added

mutable file documentation

**Attachment** mutable_docs.txt (35048 bytes) added mutable file documentation
davidsarah commented 2010-01-18 02:53:14 +00:00
Owner

Looks good to me.

Looks good to me.
Author

Applied as changeset:26c6b806d7922da1. Thank you!

Applied as changeset:26c6b806d7922da1. Thank you!
zooko added the
fixed
label 2010-01-26 14:34:50 +00:00
zooko closed this issue 2010-01-26 14:34:50 +00:00
zooko modified the milestone from undecided to 1.6.0 2010-01-26 15:04:01 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#878
No description provided.