improve alacrity by downloading only the part of the Merkle Tree that you need #800
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#800
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The downloader currently reads the entire Merkle Tree over the blocks when it needs to read only a log-N-sized subset of it. Here is the function
ReadBucketProxy._get_block_hashes()
which is informed by its caller about which hashes are needed, but which then dumbly goes ahead and downloads the entire set: source:src/allmydata/immutable/layout.py@4048#L415. Here is the caller which figures out exactly which subset of the Merkle Tree that it needs, asks theReadBucketProxy
for it, then stores all the ones it got that it didn't need: source:src/allmydata/immutable/download.py@4054#L334. (Good thing it stores them, because it usually does need them later, when it proceeds to download the rest of the file.)This issue was mentioned in this mailing list thread:
http://allmydata.org/pipermail/tahoe-dev/2009-September/002779.html
Ticket #670 mentioned this issue.
I'm marking this as "easy". :-)
In playing around with this, I noticed that if I change the logic in http://allmydata.org/trac/tahoe/browser/src/allmydata/immutable/layout.py?rev=4a4a4f95202ec976#L415 to download only the requested hashes, a number of unit tests that enforce upper limits on the number of reads start to fail. This makes sense -- instead of downloading the entire hash tree at once with one read operation when we need any part of it, we download (with my modifications) each chunk of the hash tree in a separate read operation, so there will of course be more reads.
I probably need to look more deeply into how block hashes are used before coming up with an opinion on this; so I guess this is an FYI note for the moment.
I think it is okay for it to use more reads, so the test should be loosened to allow it to pass even if it does. The existence of that test of the number of reads does serve to remind me, however, that multiple small reads of the of the hash tree would actually be a performance loss for small files. We should do some more measurements of performance. Perhaps it would be a win to heuristically over-read by fetching a few more than the required number of hashes would be a a win.
#442 was a duplicate of this.
This may fit into Brian's New Downloader so I'm assigning this ticket to him to get his attention. If it is much later than February 9, and Brian hasn't clicked "accept" on this ticket, then you can safely assume he isn't currently working on it and you can click "accept" on it yourself to indicate that you are working on it.
If you like this ticket, you might also like the "Brian's New Downloader" bundle of tickets: #605 (two-hour delay to connect to a grid from Win32, if there are many storage servers unreachable), #798 (improve random-access download to retrieve/decrypt less data), #809 (Measure how segment size affects upload/download speed.), #287 (download: tolerate lost or missing servers), and #448 (download: speak to as few servers as possible).
Brian's New Downloader is now planned for v1.8.0.
#798 (which has landed) includes this feature. Specifically, source:src/allmydata/immutable/downloader/share.py@4688#L681 (
Share._desire_block_hashes
) asks the partially-filled hashtree which nodes it needs, and only sends read requests for those.The small reads are only coalesced if they are adjacent: except for the first few kB of the share, the downloader does not read extra not-always-needed data for the purpose of reducing the number of remote
read()
messages. That might be a nice feature to have post-1.8.0; we need to measure the performance tradeoffs: eachread()
message probably carries about 30-40 bytes of overhead, so I'd expect that coalescing gaps of more than that to be a net loss. Adding a multi-segmentreadv()
message to the remote-read protocol might help, but is more disruptive.So I'm closing this ticket.