new-downloader performs badly when downloading a lot of data from a file #1170
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1170
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Some measurements:
Oh, all those 32-byte reads must have been all the hashes in the Merkle Trees. I assume that those are indeed coalesced using the clever spans structure source:src/allmydata/util/spans.py@4666. Nevertheless we should investigate the very poor performance shown in this download status file.
yeah, the 32/64-byte reads are hashtree nodes. The spans structure only coaleses adjacent/overlapping reads (the 64-byte reads are the result of two neighboring 32-byte hashtree nodes being fetched), but all requests are pipelined (note the "txtime" column in the "Requests" table, which tracks remote-bucket-read requests), and the overhead of each message is fairly small (also note the close proximity of the "rxtime" for those batches of requests). So I'm not particularly worried about merging these requests further.
My longer-term goal is to extend the Spans data structure with some sort of "close enough" merging feature: given a Spans bitmap, return a new bitmap with all the small holes filled in, so e.g. a 32-byte gap between two hashtree nodes (which might not be strictly needed until a later segment is read) would be retrieved early. The max-hole-size would need to be tuned to match the overhead of each remote-read message (probably on the order of 30-40 bytes): there's a breakeven point somewhere in there.
Another longer-term goal is to add a
readv()
-type API to the remote share-read protocol, so we could fetch multiple ranges in a single call. This doesn't shave much overhead off of just doing multiple pipelinedread()
requests, so again it's low-priority.And yes, a cleverer which-share-should-I-use-now algorithm might reduce stalls like that. I'm working on visualization tools to show the raw download-status events in a Gantt-chart -like form, which should make it easier to develop such an algorithm. For now, you want to look at the Request table for correlations between reads that occur at the same time. For example, at the +1.65s point, I see several requests that take 1.81s/2.16s/2.37s . One clear improvement would be to fetch shares 0 and 5 from different servers: whatever slowed down the reads of sh0 also slowed down sh5. But note that sh8 (from the other server) took even longer: this suggests that the congestion was on your end of the line, not theirs, especially since the next segment arrived in less than half a second.
Replying to warner:
I don't understand what those columns mean (see #1169 (documentation for the new download status page)).
Replying to warner:
I'm having trouble interpreting it (re: #1169).
I tried to watch the same movie from my office network and got similarly unwatchable results, download status page attached. Could it be a problem with the way my client, VLC.app, is reading?
Attachment down-1.html (110075 bytes) added
Attachment down-2.html (3058756 bytes) added
Well, it wasn't the VLC.app client. I did another download of the same file using wget. The performance was bad--38 KB/s:
Here is the download status page for this download (attached). Note that one server had a DYHB RTT of 3 minutes and another had a DYHB RTT of 8 minutes! There were no incident report files or
twistd.log
entries.The two servers with dramatically higher DYHB RTTs introduced themselves as:
and
I pinged their IP addresses:
Attachment down-0.html (5761 bytes) added
Okay, I've finally realized that this is a regression of the feature that we [added in v1.6.0]source:trunk/NEWS?rev=4698#L267 to start fetching blocks as soon as you've learned about enough shares and to use the lowest-latency servers. Attached is the download status page from v1.7.1 of trying to download this same file from the same test grid. It performs much better:
We can't release Tahoe-LAFS v1.8.0 with this behavior because it is a significant regression: people who use grids with slow or occasionally slow servers such as the public Test Grid would be ill-advised to upgrade from v1.7.1 to v1.8.0 and we don't like to release new versions that some users are ill-advised to upgrade to.
I've noticed that when tickets get more than one attachment it becomes confusing for the reader to understand what is what, so here's a quick recap:
The feature that we released in v1.6.0 was ticket #928, and we did add some sort of unit tests for it, by making some servers not respond to DYHB at all: http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/test/test_hung_server.py?rev=37a242e01af6cf76
(In the pre-1.6.0 version, that situation would cause download to stall indefinitely, so that was our primary goal at that time and that is what the tests ensure no longer happens.)
Note: the wget speed indicator is telling about "current" speed and so it varies a lot during a download. To get reliable speed measurements I guess I should let the wget finish which means, I suppose, I should download a smaller file! I would use the download status pages timings as an indicator of performance instead of the wget speed indicator.
Replying to warner:
Yes, I was going to point that out. Given that the DYHB responses were:
Yeah, I'd like to see some more quantifiable data. It's a pity that the
old-downloader doesn't provide as much information as the new one (a flog
might help), but obviously I learned from experience with old-downloader
while building the instrumentation on the new-downloader :).
The status data you show from both downloaders show a server in common, and
the other server responded to the DYHB very quickly, so for at least the
beginning of the download, I don't think the downloader has enough
information to do any better.
Many of the new-downloader block-requests (I'm looking at the +179s to +181s
mark) show correlated stalls of both the "fast" server (sp26) and the other
"slow" server (nszi). If the problem were a single slow server, I'd expect to
see big differences between the response times.
Interesting. So, the main known-problem with the new-downloader (at least the
one on the top of my personal list) is its willingness to pull multiple
shares from the same server (a "diversity failure"), which obviously has the
potential to be slower than getting each share from a different server.
This is plausibly acceptable for the first segment, because the moment we
receive the DYHB response that takes us above "k" shares, we're faced with a
choice: start downloading now, or wait a while (how long??) in the hopes that
new responses will increase diversity and result in a faster download.
But after the first segment, specifically after we've received the other DYHB
responses, the downloader really ought to get as much diversity as it can, so
pulling multiple shares from the same server (when there's an alternative)
isn't excusable after that point.
The fix for this is to implement the next stage of the new-downloader
project, which is to rank servers (and which-share-from-which-server
mappings) according so some criteria (possibly speed, possibly cost,
fairness, etc), and reevaluate that list after each segment is fetched. This
is closely tied into the "OVERDUE" work, which is tied into the notion of
cross-file longer-term server quality/reputation tracking, which is loosely
tied into the notion of alternative backend server classes.
And I can't get that stage finished and tested in the next week, nor is a
change that big a very stable thing to land this close to a release. So I'm
hoping that further investigation will reveal something convenient, like
maybe that 1.7.1 is actually just as variable as new-downloader on this grid,
or that the two-shares-from-one-server problem isn't as bad as it first
appears.
I do have a quick-and-dirty patch that might improve matters, which is
worth experimenting with. I'll have to dig it out of a dark corner of my
laptop, but IIRC it added an artificial half-second delay after receiving >=k
shares from fewer than k servers. If new shares were found before that timer
expired, the download would proceed with good diversity. If not, the download
would begin with bad diversity after a small delay.
It fixed the basic problem, but I don't like arbitrary delays, and didn't
address the deeper issue (you could still wind up pulling shares from slow
servers even after you have evidence that there are faster ones available),
so I didn't include it in #798.
RE davidsarah's comment:
Yeah, that's the sort of heuristic that I didn't want to guess at. It'll be
easier to see this stuff when I land the visualization code. The arrival
order of positive responses is:
At +117ms, we don't have enough shares to download. At +204ms, we have enough
shares but we'd like more diversity: we can't know that we could achieve our
ideal diversity by waiting another 8 milliseconds, so we start downloading
the first segment immediately.
By the time the second segment is started (at +977ms), we have a clearer
picture of the available shares. We also have about 40kB of experience with
each server (or 80kB for sp26, since we happened to fetch two shares from
it), which we might use to make some guesses about speeds. When the second
segment is started, at the very least we should prefer an arragement that
gives us one share from each server. We might also like to prefer shares that
we've already been using (since we'll have to fetch fewer hash-tree nodes to
validate them); note that these two goals are already in conflict. We should
prefer servers which gave us faster responses, if we believe that they're
more likely to give fast responses in the future. But if we only hit up the
really fast servers, they'll be using more bandwidth, which might cost them
money, so they might prefer that we spread some of the load onto the slower
servers, whatever we mutually think is fair.
And we need serendipity too: we should occasionally download a share from a
random server, because it might be faster than any of the ones we're
currently using, although maybe it won't be, so a random server may slow us
down. All five of these goals conflict with each other, so there are weights
and heuristics involved, which will change over time.
And we should remember some of this information beyond the end of a single
download, rather than starting with an open mind each time, to improve
overall efficiency.
So yeah, it's a small thread that, when tugged, pulls a giant elephant into
the room. "No no, don't tug on that, you never know what it might be attached
to".
So I'm hoping to find a quicker smaller solution for the short term.
Brian asked for better measurements, and I ran quite a few (appended below). I think these results are of little use as they are very noisy and as far as I can tell I was just wrong when I thought, earlier today, that 1.8.0c2 was downloading this file slower than 1.7.1 did.
On the other hand I think these numbers are trying to tell us that something is wrong. Why does it occasionally take 40s to download 100K?
After I post this comment I will attach some status reports and flogs.
With v1.7.1 and no flogtool tail:
With 1.7.1 and flogtool tail:
v1.7.1 without tail:
v1.7.1 and flogtool tail:
Now switched from office to home.
v1.7.1 and no flogtool tail, 1M:
v1.7.1 and flogtool tail, 1M:
v1.7.1 and no flogtool tail, 100K:
v1.7.1 and flogtool tail, 100K:
v1.8.0c2 and no flogtool tail, 100K:
v1.8.0c2 and no flogtool tail, 1M:
v1.8.0c2 and flogtool tail, 1M:
Attachment flog-1.7.1.bz2 (20172 bytes) added
Attachment flog-1.8.0c2.bz2 (36494 bytes) added
Attachment 1.8.0c2-dl100M-didntusethepals-down-2.html (24501 bytes) added
I just did one more download with 1.7.1 of 1M in order to get both the status page and the flog I named this download "run 99" so that I could keep its status page, flog, and stdout separate from all the others on this ticket.
Here is run 99, at my home, with Tahoe-LAFS v1.7.1, the first 1M of (@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@) :
I will now attach the status output and flog of run 99.
Attachment 1.7.1-run-number-99-down-0.html (2867 bytes) added
Attachment flog-1.7.1-from-run-number-99.bz2 (31625 bytes) added
I just did one more download with 1.8.0c2 of 1M in order to get both the status page and the flog I named this download "run 100" so that I could keep its status page, flog, and stdout separate from all the others on this ticket. Here is run 100, at my home, with Tahoe-LAFS v1.8.0c2, the first 1M of (@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@) :
Attachment flog-1.8.0c2-r4698-from-run-100.bz2 (49267 bytes) added
Attachment 1.8.0c2-r4698-run-100-down-0.html (24532 bytes) added
Okay there's no solid evidence that there is a regression from 1.7.1. I think Brian should use this ticket to analyze my flogs and status pages if he wants and then change it to be a ticket about download server selection. :-) Removing "regression".
I just did one more download with 1.8.0c2 of 100M in order to get both the status page and the flog I named this download "run 101" so that I could keep its status page, flog, and stdout separate from all the others on this ticket. Here is run 101, at my home, with Tahoe-LAFS v1.8.0c2, the first 1M of (@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@) :
Attachment flog-run-101-100M-1.8.0c2-r4698.bz2 (1872831 bytes) added
Attachment 1.8.0c2-r4698-run-101-down-1.html (1231326 bytes) added
I just did one more download with 1.7.1 of 100M in order to get both the status page and the flog I named this download "run 102" so that I could keep its status page, flog, and stdout separate from all the others on this ticket. Here is run 102, at my home, with Tahoe-LAFS v1.7.1, the first 1M of (@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@) :
Interesting that 1.7.1 was twice as fast as 1.8.0c2 this time.
Attachment flog-run-102-100M-1.7.1.bz2 (173265 bytes) added
Annoyingly, 1.7.1 has a bug where it doesn't show downloads in the status page sometimes, and that happened this time, so I can't show you the status page for run 102.
run 103
1.7.1
the first 100M
(@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@)
Attachment flog-run-103-100M-1.7.1.bz2 (2358125 bytes) added
run 104
1.8.0rc2-4698
the first 100M
(@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@)
Attachment flog-run-104-100M-1.8.0c2-r4698.bz2 (1877994 bytes) added
Hm, okay it really looks like there is a substantial (2X) slowdown for using Tahoe-LAFS v1.8.0c2 instead of v1.7.1 on today's (and yesterday's) Test Grid. I'm re-adding the
regression
tag which means I think this issue should block 1.8.0 release until we at least understand it better.Attachment 1.8.0c2-run-104-down-0.html (1231853 bytes) added
run 105
1.7.1
the first 100M
(@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@)
Attachment flog-run-105-100M-1.7.1.bz2 (336776 bytes) added
run 106
1.8.0c2-r4698
the first 100M
(@@http://localhost:3456/file/URI%3ACHK%3A4klgnafrwsm2nx3bqy24ygac5a%3Acrz7nhthi4bevzrug6xwgif2vhiacp7wk2cfmjutoz2ns3w45qza%3A3%3A10%3A1490710513/@@named=/bbb-360p24.i420.lossless.drc.ogg.fixed.ogg%2Bbbb-24fps.flac.via-ffmpeg.ogg@@)
Attachment flog-run-106-100M-1.8.0c2-r4698.bz2 (1765981 bytes) added
Attachment 1.8.0c2-r4698-run-106-down-0.html (1230114 bytes) added
I had an idea for a not-too-complex share-selection algorithm this morning:
ShareFinder
report all shares as soon as it learnsSegmentFetcher
needs to start using a new share,The general idea is to cycle through all the shares we know about, but first
try to build a sharemap that only uses one share per server (i.e. perfect
diversity). That might fail because the shares are not diverse enough, so we
can walk through the loop a second time and be willing to accept two
shares per server. If that fails, we raise our willingness to three shares
per server, etc. If we ever finish a loop without adding at least one share
to our sharemap, we declare failure: this indicates that there are not enough
distinct shares (that we know about so far) to succeed.
If this returns FAIL, that really means we should declare "hunger" and ask
the
ShareFinder
to look for more shares. If we return SUCCESS butmax_shares_per_server > 1
, then we should ask for more shares too (butstart the segment anyways: new shares may help the next segment do better).
This is still vulnerable to certain pathological situations, like if
everybody has a copy of sh0 but only the first server has a copy of sh1: this
will use sh0 from the first server then circle around and have to use sh1
from that server as well. A smarter algorithm would peek ahead, realize the
scarcity of sh1, and add sh1 from the first server so it could get sh0 from
one of the other servers instead.
But I think this might improve the diversity of downloads without going down
the full
itertools.combinations
-enumerating route that represents the"complete" way to approach this problem.
This seems promising. It sounds like you might think that the slowdown of 1.8.0c2 vs. 1.7.1 on the current Test Grid might be due to one server being used to serve two shares in 1.8.0c2 when two different servers would be used—one for each share—in 1.7.1. Is that what you think? Have you had a chance to look at my flogs attached to this ticket to confirm that this is what is happening?
Replying to warner:
(Parenthetical historical observation which is pleasurable to me: Your heuristic algorithm for server selection (for download) in comment:79552, and your observation that it is susceptible to failure in certain cases, is similar to my proposed heuristic algorithm for server selection for upload in #778 (comment:72570, for the benefit of future cyborg archaeologist historians). David-Sarah then observed that finding the optimal solution was a standard graph theory problem named "maximum matching of a bipartite graph". Kevan then implemented it and thus we were able to finish #778.)
My copy of Cormen, Leiserson, Rivest 1st Ed. says (chapter 27.3) that the Ford-Fulkerson solution requires computation O(V * E) where V is the number of vertices (num servers plus num shares) and E is the number of edges (number of (server, share) tuples).
Now what Kevan actually implemented in [happinessutil.py]source:src/allmydata/util/happinessutil.py@4593#L80 just returns the size of the maximum matching, and what we want here is an actual matching. I'm not 100% sure but I think if you save all the
path
's that are returned fromaugmenting_path_for()
inservers_of_happiness()
and return the resulting set of paths then you'll have your set of server->share mappings.Replying to zooko:
Okay this does appear to be happening in at least one of the slow v1.8.0c2 downloads attached to this ticket. I looked at 1.8.0c2-r4698-run-106-down-0.html and every request-block in it (for three different shares) went to the same server -- nszizgf5 -- which was the first server to respond to the DYHB (barely) and which happened to be the only server that had three shares. So at least for that run, Brian's idea that fetching blocks of different shares from the same server is a significant slowdown seems to be true.
does new-downloader perform badly for certain situations (such as today's Test Grid)?to new-downloader performs badly when the first server to reply to DYHB has K sharesIn http://tahoe-lafs.org/pipermail/tahoe-dev/2010-August/004998.html I wrote:
Hey waitasecond. As far as I understand, Tahoe-LAFS v1.7.1 should also—just like v1.8.0c2—start downloading all three shares from Greg's server as soon as that server is the first responder to the DYHB:
http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/immutable/download.py?rev=9f3995feb9b2769c#L923
Am I misunderstanding? So the question of why 1.7.1 seems to download 2 or 3 times as fast as 1.8.0c2 on this grid remains open.
run 107
1.7.1
the first 10MB
with cProfile profiling running but no flog running
Attachment prof-run-107.dump.txt (113473 bytes) added
run 108
1.8.0c2
the first 10MB
with cProfile profiling running but no flog running
Attachment prof-run-108-dump.txt (111816 bytes) added
run 109
1.7.1
the first 100MB
with cProfile profiling running but no flog running
Attachment prof-run-109-dump.txt (94530 bytes) added
run 110
1.8.0c2
the first 100MB
with cProfile profiling running but no flog running
Attachment prof-run-110-dump.txt (114356 bytes) added
Attachment run-110-download-status.html (1228847 bytes) added
Okay, the problem with the current downloader in 1.8.0c2 is that it goes slower and slower as it downloads more and more data from a file.
It consistently wins (or at least ties) 1.7.1 in downloads <= 10MB but consistently loses badly for 100 MB. Also the profiling result in prof-run-110-dump.txt shows major CPU usage in spans:
new-downloader performs badly when the first server to reply to DYHB has K sharesto new-downloader performs badly when downloading a lot of data from a fileLooking at [immutable/downloader/share.py]source:src/allmydata/immutable/downloader/share.py@4688, I have the following review comments:
Are
_wanted
,_requested
, and_received
old names for_pending
,_received
, and_unavailable
? Or perhaps from a different design entirely? And that's six states, not four.A span is
add
'ed to_pending
in [_send_requests()]source:trunk/src/allmydata/immutable/downloader/share.py@4688#L698 and removed from_pending
in [_got_data()]source:trunk/src/allmydata/immutable/downloader/share.py@4688#L833 but is not removed if the request errbacks instead of callbacks. That would be a bug for it still to be marked as "pending" after the request errbacked, wouldn't it?We shouldn't give the author of a file the ability to raise AssertionError from [immutable/downloader/share.py line 416 _satisfy_share_hash_tree()]source:trunk/src/allmydata/immutable/downloader/share.py@4688#L416 but instead give him the ability to cause [_satisfy_offsets()]source:trunk/src/allmydata/immutable/downloader/share.py@4688#L331 to raise a LayoutInvalid exception (see related ticket #1085 (we shouldn't use "assert" to validate incoming data in introducer client))
This looks like a bug: [share.py _got_data()]source:trunk/src/allmydata/immutable/downloader/share.py@4688#L741:
That could explain the slowdown -- the items added to
_received
here are never removed, because the removal code in [_satisfy_block_data()]source:trunk/src/allmydata/immutable/downloader/share.py@4688#L517 is:I added the following assertions to source:trunk/src/allmydata/util/spans.py@4666:
And indeed these assertions fail because
data
is not an integer.However, then when I add this patch:
This causes a bunch of tests to fail in ways that I don't understand.
more review notes:
with
assert_invariants()
which iterates over all the spans. AlsoDataSpans.add()
itself searches for where to make modifications by iterating from the beginning, which seems unnecessary. Couldn't it do a binary search to find the place it needs to modify and then modify only a local neighborhood there?In [immutable/downloader/share.py]source:trunk/src/allmydata/immutable/downloader/share.py@4688
_unavailable
can have things added to it (in case of intentional over-read or in case of failing/corrupted server) but never has things removed from it. Does that matter? I suspect that it is intentional and doesn't hurt but I'm not sure.Also Brian discovered another bug in new-downloader last night. Here are some excerpts from IRC.
Responses to earlier comments:
prof-run-110-dump.txt suggest that in a 100MB file download, we
spent half of the total time in
Spans.*init*
?_wanted, _requested, _received
is stale._pending
upon errback is a bugLayoutInvalid
is better thanassert
, yeahself._received.add(start,data)
is correct:_received
is aDataSpans
instance, notSpans
, and it holds strings, notbooleans.
_received
holds the data that comes back from the serveruntil the "satisfy" code consumes it. It has methods like
get
andpop
, whereas the simplerSpans
class merely has methods foris-range-in-span.
ever fire. If those same assertions were added to spans.py#L295, I'd get
it. What types were start/length in your observed assertion failures? And
what was the stack trace?
self._received.add(start, length)
is wrong; itmust be called with (int,str).
correct
_unavailable
should be benign: the amount ofunavailable data is small and constant (if the share is intact, we should
only add to
_unavailable
during the first few reads if we've guessedthe segsize wrong).
Now some new ideas. I've found a couple of likely issues.
definitely is growing over the course of the download. It's noisy,
but it goes from about 0.8s at the start (seg0), to about 1.5s-2.0s at the
end (seg762). I haven't looked at smaller deltas (i.e. only inside the
"desire" code) to rule out network variations, but it certainly points to
a leak or complexity increase of some sort that gets worse as the download
progresses.
100MB would rule out anything that's influenced by the absolute segment
number.
Spans.dump()
strings in the flog, I see that two ofthe three shares (sh7+sh8) have an ever-growing
.received
DataSpans
structure. A cursory check suggests they are growing by 64bytes and one range per segment. By the end of the download (seg762), sh7
is holding 37170 bytes in 378 ranges (whereas sh3 only has 1636 bytes in
22 ranges, and remains mostly constant)
We keep it around in
_received
because it might be useful later:maybe we ask for the wrong data because our guess of the segsize (and
thus numsegs, and thus the size/placement of the hashtrees) was wrong.
But later we might take advantage of whatever we fetched by mistake.
IncompleteHashTree.needed_hashes()
call, when asked whathashes we need to validate leaf 0, might tell us we need the hash for
leaf 0 too. However, that hash can be computed from the block of data
that makes up the leaf, so we don't really need to fetch it. (whereas we
do need the hash for leaf 1, since it sits on the "uncle chain" for
leaf0). If the desire-side code is conservatively/incorrectly asking for
the leaf0 hash, but the satisfy-side code doesn't use it, then we'll add
a single 32-byte hash node per segment.
desire-side code will ask for ciphertext hash tree nodes from each
segment we're using. However, the satisfy-side code will only use the
hashes from the first response: by the time the second response arrives,
the ciphertext hash tree is satisfied, so that clause isn't reached.
This means that we'll leave that data in
._received
forever. Thisseems most likely: it would explain why the first share (sh3) doesn't
grow, whereas the later two shares do, and why I saw a 64-byte increment
(the actual growth would depend upon the segment number, and how many
new uncle-chain nodes are needed, but 2-nodes is a fairly common value).
.received
leftover-data issue shouldn't be such a big deal,N=378 is not a huge number, but the measured increase in inter-segment
time suggests that whatever the O() complexity is, N=378 is enough to
cause problems.
So I think the next directions to pursue are:
using real remote_read calls, ideally 100MB or 1GB in a few seconds. This
would use a
Share
subclass that returns data immediately (well,after a eventual-send) rather than ever touching a server. It might also
need to stub out some of the hashtree checks, but does need real
needed_hashes
computations. Then we fix the code until this testfinishes in a reasonable amount of time. While I wouldn't have the test
case assert anything about runtime, I would have it assert things like
._received
doesn't grow over the course of the test.old-downloader and new-downloader: if we're looking at a O(n^3^) problem,
it will manifest as a much heavier CPU load. (if we were merely looking at
a pipelining failure, the CPU time would be the same, but wallclock time
would be higher).
Spans
(and specificallyDataSpans
) forcomputational-complexity problems. Build some tests of these with N=400ish
and see how efficient they are. They're supposed to be linear wrt
number-of-ranges, but either they aren't, or they're being called in a way
which makes it worse
assert_invariants
calls, to see if we're hitting that old problemwhere the data structure is efficient unless we leave in the self-checks
or debugging messages
the second and later shares
IncompleteHashTree.needed_hashes
and see if we're actuallyrequesting the leaf node that we don't really need.
DataSpans
structure. Theperhaps-too-clever overlap/merge behavior is mostly just exercised during
the fetch of the first segment, before we're sure about the correct number
of segments (we fetch some data speculatively to reduce roundtrips; if we
guess wrong, we'll get the wrong data, but
DataSpans
lets us easilyuse that data later if it turns out to be what we needed for some other
purpose). Perhaps a data structure which was less tuned for merging
adjacent ranges would be better, maybe one which has an explicit
merge()
method that's only called just before the requests are sentout. Or maybe the value of holding on to that data isn't enough to justify
the complexity.
purposes, I'd like to be able to label the bits in a
Spans
withtheir purpose: if we send parallel requests for both seg2 and seg3, I'd
like the seg2 data to arrive first, so e.g. the hashes needed to validate
seg2 should arrive before the bulk block data for seg3. A label on the
bits like "this is for seg2" would let us order the requests in such a
way to reduce our memory footprint. A label like this might also be
useful for handling the unused-ciphertext-hash-tree-nodes problem, if we
could remove data from a
DataSpans
that's labelled with analready-complete segnum.
Finally, the bug zooko mentioned in comment:79568 is real. I'm still working on
it, but basically it prevents us from using shares that arrive after the
initial batch of requests: they are not initialized properly and don't get a
correct block hash tree. I'm working on a fix. The symptom is that we fall back to the initial shares, but if those have died, the download will fail, which is wrong.
And I'm still working on the new share-selection algorithm. The code works,
and my basic unit tests work, but certain ones require the comment:79568 bug to
be fixed before it is safe to use (the bug will hurt current downloads, but
occurs less frequently).
Attachment spans.py.diff (677 bytes) added
Short-term hack to test for asymptotic inefficiency of DataSpans.get_spans
Replying to zooko:
This is the smoking gun. The code of
DataSpans.get_spans
is:and the
Spans
constructor has the loop:Spans.add
does a linear search (plus a sort, if there is no overlap, but Timsort takes linear time for an already-sorted array), so the overall complexity ofDataSpans.get_spans
is Θ(n^2^) where n is the number of spans.Since
Spans
uses essentially the same invariant asDataSpans
for its array of spans (they are sorted with no overlaps or adjacency), it is possible to implementget_spans
in Θ(1) time. However I suspect that the important difference here is between Θ(n^2^) and Θ(n).The diff's implementation of
get_spans
includes a call tos._check
. It may also be worth doing another profile run without that call.(Some of my comments in ticket:798#comment:18 would reduce the number of calls to
overlap
and eliminate calls toadjacent
, but I don't think that's the critical issue by itself.)Replying to [davidsarah]comment:44:
Note that, given this problem and Brian's observations in comment:79569, the overall time for a download will be Θ(n^3^). So maybe we do need a better data structure (some sort of balanced tree or heap, maybe) if we want to get to Θ(n log n) rather than Θ(n^2^) for the whole download. But maybe that can wait until after releasing 1.8.
(Actually, just logging the output of
Spans.dump
calls will by itself cause Θ(n^2^) behaviour for the whole download, although with a fairly small constant.)Replying to warner:
This was my mistake. I must have confused it with a different test run. Those assertions never fire.
run 111
1.8.0c2
requesting all of the file
with flog running
Oh, and in run 111 (comment:79573) I had added log messages for all events which touched the Share._received Spans object so the resulting flogfile is a trace of everything that affects that object.
The following run has patch attachment:spans.py.diff.
run 112
1.8.0c2
requesting all of the file
with flog running
The patch helped a lot—compare run 112 to 111—but not enough to make trunk as fast as 1.7.1 on large downoads—compare run 112 to runs 102, 103, 105, and 109.
I intend to write a tool which reads the traces of what was done to the
Share._received
Spans object and does those operations to a Spans object so that we can run benchmark it and profile it in isolation.Attachment run-111-above28-flog.pickle.bz2 (2352820 bytes) added
Attachment run-112-above28-flog.pickle.bz2 (2553365 bytes) added
Attachment debuggery-trace-spans.dpatch.txt (10870 bytes) added
debuggery-trace-spans.dpatch.txt adds logging of all events that touched
Share._received
at loglevelCURIOUS
. run-111-above28-flog.pickle.bz2 and run-112-above28-flog.pickle.bz2 are the flogs from run 111 and run 112 with only events logged at levelCURIOUS
or above.BTW, be sure to pay attention to the
DataSpans
too, specificallyShare._received
. That's the one that I observed growing linearly withnumber-of-segments-read.
I'm close to finishing my rework of the way Shares are handled. If we can
make new-downloader fast enough by fixing complexity issues in spans.py, we
should stick with that for 1.8.0, because those are probably smaller and less
intrusive changes. If not, here are the properties of my Share-handling
changes:
use a new diversity-seeking Share selection algorithm, as described in
comment:27 . This should distribute the download load evenly among all
known servers when they have equal number of shares, and as evenly as
possible (while still getting k shares) when not. If more shares are
discovered later, the algorithm will recalculate the sharemap and take
advantage of the new shares, and we'll keep looking for new shares as long
as we don't have the diversity that we want (one share per server).
fix the problem in which late shares (not used for the first segment, but
located and added later) were not given the right sized hashtree and threw
errors, causing them to be dropped. I think this completely broke the
"tolerate loss of servers" feature, but the problem might have been caused
by the diversity-seeking algorithm change, rather than something that was
in new-downloader originally.
deliver all shares to the
SegmentFetcher
as soon as we learn aboutthem, instead of waiting for the fetcher to tell us it's hungry. This
gives the fetcher more information to work with.
I might be able to attach a patch tomorrow.. there are still some bugs in it,
and I haven't finished implementing the last point (push shares on discovery,
not pull on hunger).
Oh, hey, here's a simple patch to try out:
Since
self._received
is supposed to be empty after each segment is complete (unless we guess the segsize wrong), this patch simply manually empties it at that point. No data is retained from one segment to the next: any mistakes will just cause us to ask for more data next time.If the problem in this bug is a computational complexity in
DataSpans
, this should bypass it, by making sure we never add more than 3 or 4 ranges to one, since evenO(n^3)
is small when n is only 3 or 4. (we should still fix the problem, but maybe the fix can wait for 1.8.1). If the problem is inSpans
, or elsewhere, then this won't help.Attachment run-112-above28-flog-dump-sh8-on-nsziz.txt (7002679 bytes) added
run-112-above28-flog-dump-sh8-on-nsziz.txt is a flogtool dump of attachment:attachment:run-112-above28-flog.pickle.bz2 grepped for just one particular share (sh8 on nsziz). It is suitable as the input file for misc/simulators/bench_spans.py.
run-112-above28-flog-dump-sh8-on-nsziz.txt is a flogtool dump of run-112-above28-flog.pickle.bz2 grepped for just one particular share (sh8 on nsziz). It is suitable as the input file for [misc/simulators/bench_spans.py]source:trunk/misc/simulators/bench_spans.py@4700.
The output that I get on my Macbook Pro is:
This is even though I have spans.py.diff applied.
Okay, the patch from comment:79580 seems to have improved performance significantly. I just performed run 114:
Here is the full table:
I'm not sure if v1.8.0c2 is now good enough to be considered "not a significant regression" vs. v1.7.1 for downloading large files. I'll go download a large file with v1.7.1 now on my home network for comparison...
Hm, it seems like v1.7.1 is still substantially faster than v1.8.0c2+comment:79580:
Well the good news is that comment:79580 fixes the problem that downloads go slower the bigger they are (as expected). The bad news is that even with comment:79580 Tahoe-LAFS v1.8.0c2 is substantially slower than v1.7.1 for large files:
I'm going to start another run with v1.8.0c2, this time with the cProfile tool running, and go to sleep.
Attachment run-117-prof-cumtime.dump.txt (110161 bytes) added
I ran 1.8.0c2 under the profiler for a few minutes and then stopped it in order to get the profiling stats (attached). Unfortunately, they do not show any more smoking gun of CPU usage, so the remaining slowdown from v1.7.1 to v1.8.0c2 is likely to be one of the network-scheduling issues that Brian has been thinking about (server selection, pipelining), or else some other sort of subtle timing issue...
Here are the profiling stats for a brief (~4 minute) run of 1.8.0c2:
The functions with the most "cumtime" (time spent in the function or in any of the functions that it called) are:
I'll go ahead and leave a download running under the profiler overnight just in case something turns up.
Attachment run-115-flog.pickle.bz2 (496367 bytes) added
Attachment run-116-flog.pickle.bz2 (3149925 bytes) added
If you wanted to investigate why 1.8.0c2 is so much slower than 1.7.1 at downloading a large file even after applying the comment:79580 patch, then you could use run-115-flog.pickle.bz2 and run-116-flog.pickle.bz2 as evidence. Hm, hey waitasecond, in my earlier testing (recorded in this ticket), 1.8.0c2 was faster then 1.7.1 for small files (<= 10 MB). This was also the case for Nathan Eisenberg's benchmarks (posted to tahoe-dev). But currently it looks to me like the average download speed (as reported by curl during its operation) is the same at the beginning of the download as at the end, i.e. even during the first 10 MB or so 1.8.0c2 is only getting about 150 KBps where 1.7.1 is getting more than 200 KBps. Did something change?
I guess I (or someone) should run 1.7.1 vs. 1.8.0c2+comment:79580 on 10 MB files. But I'm way too tired to start that again right now.
Man, I'm really worn out from staying up night after night poking at this and then having to get up early the next morning to help my children get ready for school and myself ready for work. I could use more help!
Perhaps the remaining issue is server selection. Let's try Brian's comment:79552 diversity-seeking algorithm, combined with the comment:79580 fix.
Replying to davidsarah:
I'm willing to try the comment:79552 diversity-seeking algorithm, but I also would like to verify whether or not server-selection is one of the factors by inspecting the flogs...
Yes, the overnight run yielded no smoking gun (smoking CPU?) that I can see. I'll attach the full profiling results as an attachment.
Attachment run-118-prof-cumtime.dump.txt (116153 bytes) added
Attachment 1170-combo.diff (56434 bytes) added
patch to prefer share diversity, forget leftover data after each segment, and fix handling of numsegs
the "1170-combo.diff" patch combines the approaches as suggested in comment:79588 . Please give it a try and see if it helps. I'll try to look at the flogs to see what servers were used, to see if that run has a diversity issue or not.
Okay, I investigated server selection on the bus to work this morning. run-115-flog.pickle.bz2 shows:
The 1.7.1 flog doesn't which servers are actually being used for Request Blocks, but we know that 1.7.1 will always choose to get all three shares from nszizgf5 in a case like this.
Therefore I don't think that 1.8's share-selection can be part of the explanation for why 1.8 is slower than 1.7.
(This doesn't mean that improved share selection wouldn't make 1.9 faster than 1.8 is now.)
Replying to zooko:
There's a sizeable startup time in 1.7.1 (lots of roundtrips), which
went away in 1.8.0c2 . I think we're all in agreement about the
small-file speedups that provides (i.e. we've not seen any evidence to
the contrary). The change is on the order of a few seconds, though, so I
think a 10MB file (or portion of a file) that takes 10MB/150kBps= 60s to
complete won't be affected very much. I don't think you'll be able to
see its effects in the curl output.
Nathan's tests were on hundreds or thousands of small files.
From my tests, the new-downloader sees about 500ms more taken to
complete the first segment than the second and following ones. I believe
that's the time spend doing server selection, UEB fetches, and the large
hash chain fetches.
Attachment runs-119,120,121-curl-stdout.txt (2152 bytes) added
I ran three more measurements today at the office -- runs 119, 120, and 121 . These are the curl stdout from those. I will update a table with these results and put it into the original opening comment of this ticket.
Attachment runs-122,123,124,125,126,127-curl-stdout.4.txt (5676 bytes) added
I ran several more measurements from home, intended to test whether the logging in new-downloader is partially responsible for new-downloader's slowness. These are the curl stdout from those runs. I will update the table in the opening comment of this ticket to include these runs.
Attachment run123-down-status.html.bz2 (1537800 bytes) added
status page results for run 123
Attachment run127-down-status.html.bz2 (105278 bytes) added
Brian: I updated the table in the initial comment. Please let me know what other sorts of measurements you would like from me. It looks to me like there is still a significant regression in 1.8.0c2+comment:79580+spans.py.diff even if I comment-out almost all calls to log.msg() in
immutable/download/*.py
. I will attach the patch that I used to comment out all those logging calls. I'll probably go ahead and apply your 1170-combo.diff and run 100 MB downloads from the office during work today.Attachment comment-out-logging-in-immutable-download.dpatch.txt (36282 bytes) added
I did a quick test at home with a
def msg(*args,**kwargs):pass
insrc/allmydata/util/log.py
, and didn't see a noticable change (the noise level was pretty high, so even if there were a 10% difference, I probably wouldn't have been able to spot it).In some other testing at work, I was unable to see a consistent performance difference between 171 and my comment:79591 combo-patch, but the speed was warbling all over the place, so I don't feel that it was a very conclusive run. I'd patched both to only use a single server (nszi?), to reduce the variables.
What I'd like to do is to run a series of tests from my home network (no other traffic) using my personal backupgrid server (no other traffic), to see how consistent the results are. Maybe tomorrow I'll get a chance to try that.
Attachment perf-measure-office.txt (7307 bytes) added
I ran several more measurements from the office, intended to test whether Brian's 1170-combo.diff made 1.8.0c2 competitive with 1.7.1. Sadly it appears not. :-( I'll update the table in the initial comment with these results.
Attachment runs-129,130,131,132,133,134,135,136,137,138,139-curl-stdout.txt (7307 bytes) added
I ran several more measurements from the office, intended to test whether Brian's 1170-combo.diff made 1.8.0c2 competitive with 1.7.1. Sadly it appears not. :-( I'll update the table in the initial comment with these results. This attachment is a better-named copy of perf-measure-office.txt .
Attachment run139-down-status.html.bz2 (910388 bytes) added
Brian: please inspect the table in the ticket initial comment. It seems like there is a bimodal distribution with 1170-combo.diff, half of the time it runs at about 179 or 180 KBps (it ran at 169 KBps for the large download) and the other half of the time it runs at 262–291 KBps. The latter range is slightly faster than 1.7.1! I attached the down-status.html for the long download that ran at 169 KBps: run139-down-status.html.bz2.
Attachment runs-140,141-curl-stdout.txt (1283 bytes) added
and run 142
Attachment run-zooko1000-status.html (1235376 bytes) added
Attachment run-zooko1000-curl-stdout.txt (313 bytes) added
run-zooko1000 was from my local coffeeshop—Caffe Sole in South Boulder—and the status.html shows this interesting pattern that the downloader immediately issued 10 DYHB queries (as expected), and then it took 9.6 seconds for the first DYHB response to arrive. Then the really weird, part, it took 8.4s more for the next seven DYHB responses to arrive (totalling 18s from request to response)! Then, still weird, it took a total request-to-response time of 6 minutes for the ninth response and a total of 8 minutes for the tenth. Also, as soon as the first response arrived the downloader issued a new DYHB request, and that one, the eleventh one, took 8.92s for the response to arrive.
So, I suppose there is something very messed up about the network at my local coffeeshop. Perhaps it blocks a flow that starts on an idle TCP connection while it is trying to figure out how to insert ads into any HTTP responses, or something. Note that these TCP connections were all already established long before the download began.
Take-aways?
I guess it is that we should not make assumptions about "reasonable" for IP traffic. That is: if we want to support people who use Tahoe-LAFS from coffeeshops, over tethered cell phones, at Burning Man, on satellite uplinks, on the International Space Station, etc. (which I do).
Another take-away is that 1.8.0c2+combo.diff did pretty well in this situation! (I think 1.7.1 probably would have done well too but I didn't get a chance to try it.)
Attachment 141-status.html.bz2 (109438 bytes) added
status page for run 141
Attachment 142-status.html.bz2 (110361 bytes) added
status page for run 142
Attachment runs-143to162-alternating-stdout.txt (25821 bytes) added
capture from stdout - alternating between 1.7.1 and trunk+combo - run from cable modem at home on pubgrid
Attachment status-143.html.bz2 (109804 bytes) added
Attachment status-145.html.bz2 (110838 bytes) added
Attachment status-147.html.bz2 (109473 bytes) added
Attachment status-149.html.bz2 (109156 bytes) added
Attachment status-151.html.bz2 (111256 bytes) added
Attachment status-153.html.bz2 (109625 bytes) added
Attachment status-155.html.bz2 (110178 bytes) added
Attachment status-157.html.bz2 (111354 bytes) added
Attachment status-159.html.bz2 (109261 bytes) added
Attachment status-161.html.bz2 (109509 bytes) added
runs 143-162 generated with the following bash script to alternate the clients and grab the status file:
feel free to adapt and reuse.
I looked at the status.html files for some of the new-downloader runs. It looks like there's a reasonable correlation between download speed and server selection. The 240kBps-ish downloads tend to use sp26/nszi/4rk5, while the 130-140ish downloads tend to use fp3x or sroo instead of 4rk5.
Without more info from the 1.7.1 downloads (data which would be in the download-status, but for the old-downloader it isn't displayed until after the whole download is complete), we can't guess what servers were used for those runs. Zooko, how consistent do you think the speed-difference results would be if you used a 100MB file, instead of using the first 100MB of a multi-GB file? That might let us use Terrell's script and also collect download-status from the 1.7.1 runs.
It'd be awfully convenient if the speed difference that Zooko observed could be attributable to server selection, and if the combo patch made that selection work well enough to ship 1.8.0. A 1.8.1-era improvement could be to try out new servers over the course of the download, so that we'd land in the three-good-servers (sp26/nszi/4rk5) mode more often than the two-good-one-slow-servers (sp26/nszi/fp3x) mode.
Attachment 171-log.diff (3785 bytes) added
patch to add server-selection data to logs/twistd.log for 1.7.1
Attachment run-zooko1001-curl-stdout.txt (677 bytes) added
Attachment run-zooko1001-flog.pickle.bz2 (181113 bytes) added
Attachment run-zooko1002-curl-stdout.txt (704 bytes) added
Attachment run-zooko1002-flog.pickle.bz2 (1023504 bytes) added
Attachment run-zooko1002-status.html (1234385 bytes) added
Attachment Screen shot 2010-08-23 at 01.07.41-0600.png (363648 bytes) added
I added run1001 and run1002 to the big table. These two runs are notable for having complete packet traces and a screenshot of their wireshark summaries, as well as flogs and (for the 1.8.0c2 one) status.html. It looks to me as if 1.8.0c2+1170-combo.diff was slower than 1.7.1 for those runs because it chose slower servers.
Attachment runs-zooko2000-2020-curl-stdout.txt (5640 bytes) added
Attachment runs-zooko2000-2020-twistd.logs.tar.bz2 (124340 bytes) added
Attachment status-2001.html.bz2 (107465 bytes) added
Attachment status-2003.html.bz2 (107421 bytes) added
Attachment status-2005.html.bz2 (108182 bytes) added
Attachment status-2007.html.bz2 (108194 bytes) added
Attachment status-2009.html.bz2 (107704 bytes) added
Attachment status-2011.html.bz2 (108080 bytes) added
Attachment status-2013.html.bz2 (107948 bytes) added
Attachment status-2015.html.bz2 (107741 bytes) added
Attachment status-2017.html.bz2 (107820 bytes) added
Attachment status-2019.html.bz2 (107209 bytes) added
Added runs zooko2000 through zooko2019. Thanks a lot to Terrell for the script in comment:79599 which I used to do these runs!
Comments: it looks like there really is a substantial slowdown for switching from v1.7.1 to v1.8.0+1170-combo.diff for this file on this grid. I started examining the status.html files in order to annotate which servers were used by 1.8.0c2+combo.diff, but I got tired and stopped doing it after run 2007. I think Brian's current hypothesis is that server selection is the most important factor, and that seems quite plausible to me. v1.7.1 used the same set of servers in every one of its runs, and its performance was more consistent than 1.8.0c2+combo.diff's was.
It has taken a lot of effort to generate this data and to attach it and format it, so I hope it helps! Thanks again to Terrell.
Now I'm starting a new experiment, downloading
(@@http://localhost:3456/file/URI%3ACHK%3Avpk5d6pl5qelhnwfwtjj2v7tmq%3Adkt453pu5le7qmtix55hiibrzqqq3euchcjguio6vbetobxw5ola%3A3%3A10%3A334400401/@@named=/Negativeland_on_Radio1190.org.ogg@@) in its 300 MB entirety. This file currently has the following share layout:
Finished annotating the big table with what servers were used for each download.
Attachment run-zooko3000-curl-stdout.txt (913 bytes) added
Attachment run-zooko3001-curl-stdout.txt (917 bytes) added
Attachment run-zooko3002-curl-stdout.txt (859 bytes) added
Attachment status-3001.html.bz2 (362267 bytes) added
Attachment runs-zooko3000,3002.twistd.log (1000025 bytes) added
Added runs zooko3000, 3001, and 3002 to the table. These are, as mentioned, downloads of a 333 MB negativland.ogg file of which there are only 4 surviving shares, 2 shares each on fp3x and tavr. Run 3001 with v1.8.0c2+combo.diff went about half as fast as run 3000 with v1.7.1 even though they chose the same servers (for the long haul -- v1.8.0c2 uses different servers for the first segment or two I think). Then run 3002 started, with v1.7.1, and it went less than half as fast as run 3001 had! I had to stop it before it completed so I could go to work. I suspect that my DSL service was misbehaving at that time, but I haven't tried to confirm that, e.g. by examining the attached logs to see if there is some other explanation for why run 3002 went so slowly.
Attachment 180c2-viz-dyhb.png (75560 bytes) added
timeline of 1.8.0c2 (no patches) download, local testgrid (one computer), shows share-selection misbehavior
Attachment 180c2-viz-delays.png (66549 bytes) added
I made a lot of progress with my javascript-based download-status
visualization tools last night, after switching to the
Protovis library (which rocks!). Here are
two diagrams of a 12MB download performed on my laptop (using a local
testgrid entirely contained on one computer: lots of bandwidth, but only one
CPU to share among them all, and only one disk). The downloader code is from
current trunk, which means 1.8.0c2 (it was not using any of the patches
from this ticket, so it exhibits all the misbehaviors of 1.8.0c2).
I'm still working on the graphics. Time proceeds from left to right. The live
display is pan+zoomable. Currently DYHB and block-reads are filled with a
color that indicates which server they used, and block-reads get an outline
color that indicates which share number was being accessed. Overlapping
block-reads are shown stacked up. Most block reads are tiny (32 or 64 bytes)
but of course each segment requires 'k' large reads (each of about 41kB,
segsize/k).
180c2-viz-dyhb.png : this shows the startup phase. Note how all
block reads are coming from a single server (w5gi, in purple), even though
we heard from other servers by the time the second segment started. Also
note that, for some reason, the requests made for the first segment were
all serialized by shnum: we waited for all requests from the first share
to return before sending any requests for the second share.
180c2-viz-delays.png : this shows the midpoint of the download
(specifically the segments that cross the middle of the Merkle tree,
requiring the most hash nodes to retrieve). By this point, I'd added a
thicker outline around block reads that fetched more than 1kB of data, so
the actual data blocks can be distinguished from the hash tree nodes. The
reads are properly pipelined. But note the large gap (about 7.5ms) between
the receipt of the last block and the delivery of the segment. Also note
how the segments that require fewer hash nodes are delivered much faster.
I haven't yet ported these tools to the combo-patch -fixed downloader, nor
have I applied them to a download from the testgrid (which would behave very
differently: longer latencies, but less contention for disk or CPU). I'm
partially inclined to disregard the idiosyncrasies displayed by these charts
until I do that, but they still represent interesting problems to understand
further.
The large delay on the lots-of-hash-nodes segments raises suspicions of bad
performance in
IncompleteHashTree
when you add nodes, or about thebehavior of
DataSpans
when you add/remove data in it. TheDataSpans.add
time occurs immediately after the response comes back, sois clearly minimal (it lives in the space between one response and the next,
along the steep downwards slope), but the
DataSpans.pop
occurs duringthe mysterious gap. The Foolscap receive-processing time occurs inside the
request block rectangle. The Foolscap transmit-serialization time occurs
during the previous mysterious gap, so it must be fairly small (after the
previous segment was delivered, we sent a bazillion hash requests, and the
gap was small, whereas after the big segment was delivered, we didn't send
any hash requests, and the gap was big).
The next set of information that will be useful to add here will we a
generalized event list: in particular I want to see the start/finish times of
all hashtree-manipulation calls, zfec-decode calls, and AES decrypt calls.
That should take about 15 minutes to add, and should illuminate some of that
gap.
in case anyone wants to play with it, viz-with-combo.diff.bz2 contains both the "combo patch" and my current Protovis-based visualization tool. From the download-status page, follow the "Timeline" link. Still kinda rough, but hopefully useful.
(wow, for reference, don't upload a 900kB diff file and then let Trac try to colorize it. Compress the diff first so that Trac doesn't get clever and time out.)
Attachment viz-with-combo.diff.bz2 (182990 bytes) added
patch with visualization tools and share-selection fix and Spans performance mitigation fix
I did some more testing with those visualization tools (adding some misc
events like entry/exit of internal functions). I've found one place where the
downloader makes excessive eventual-send calls which appears to cost 250us
per
remote_read
call. I've also measured hash-tree operations asconsuming a surprising amount of overhead.
Share._got_response
call queues an eventual-send toShare.loop
, which checks the satisfy/desire processes. Since a singleI'd like to change the
_got_response
code to set a flag and queue asingle call to
loop
instead of queueing multiple calls. That would savea little time (and probably remove the severe jitter that I've seen on local
downloads), but I don't think it can explain the 50% slowdown that Zooko's
observed.
These visualization tools are a lot of fun. One direction to explore is to
record some packet timings (with tcpdump) and add it as an extra row: that
would show us how much latency/load Foolscap is spending before it delivers a
message response to the application.
I'll attach two samples of the viz output as viz-3.png and
viz-4.png . The two captures are of different parts of the
download, but in both cases the horizontal ticks are 500us apart. The
candlestick-diagram-like shapes are the satisfy/desire sections of
Share.loop
, and the lines (actually very narrow boxes) between them arethe "disappointment" calculation at the end of
Share.loop
, so the gapbefore it must be the
send_requests
routine.Attachment viz-3.png (21918 bytes) added
timeline sample showing satisfy/desire calls and process_block/FEC/hashtree operations
Attachment viz-4.png (14966 bytes) added
another timeline, showing AES in the inter-segment gap
created #1186 to track the redundant
Share.loop
callsAttachment status-4001.html.bz2 (363128 bytes) added
Attachment status-4003.html.bz2 (362878 bytes) added
Attachment status-4005.html.bz2 (363470 bytes) added
Attachment runs-zooko4000-4007-curl-stdout.txt (2296 bytes) added
Attachment runs-zooko4000-4007-twistd.log.tar.bz2 (289417 bytes) added
Attachment runs-zooko4000-4007-serverselection-twistd.log (1825 bytes) added
Added a new batch of runs -- runs zooko 4000 through 4006. There is a very clear pattern here! There only two server-selections represented: 1fp3x,2tavr and 2fp3z,1tavr. 1.8.0c2+combo.diff always chose the latter. v1.7.1 always chose the former except for one time when it chose the latter. Whenever you choose the latter you go at ~90 Kbps, whenever you choose the former you go at ~190 Kbps.
fixed up the alternating line colorization - adding 3100-3109 in a minute...
Attachment status-3100.html.bz2 (10847 bytes) added
Attachment status-3104.html.bz2 (11066 bytes) added
Attachment status-3105.html.bz2 (375968 bytes) added
Attachment status-3106.html.bz2 (11065 bytes) added
Attachment status-3107.html.bz2 (376983 bytes) added
Attachment status-3108.html.bz2 (10982 bytes) added
run 116 used 3*nszi
Added rows for 3100-3109... will attach the curl output when I get back to that terminal window. All runs were ~90Kbps, and they all selected the same shares as Zooko's runs 4000-4006.
These were run with Brian's patch for 1.8.0c2+combo+viz vs. 1.7.1.
Attachment status-3109.html.bz2 (376323 bytes) added
Attachment runs-terrell3100-3109-curl-stdout.txt (13623 bytes) added
I tested a well-distributed 25 MB file (http://pubgrid.tahoe-lafs.org/uri/URI:CHK:knvcmfkmzejsg2pfueygjpkygq:3qjcqnjzsccmwk5f4rtsbusln66mgel6esiclahz7hbcsqgqf3ga:3:10:24879985?t=info) and 1.8.0c2+1170-combo.diff was much better than v1.7.1 every time. I don't have the energy to edit all of this into the table, upload all the data files, hyperlink to them, etc., so here is a big ugly dump of the information. Sorry. Goodnight!
#1187 describes an approach that could mitigate the effect of choosing some slow servers. (I think that is complementary to trying to make better server choices.)
Brian and David-Sarah and I have discussed this off and on over the last few days, mostly on IRC, and we agree that the way forward to 1.8.0 final is to review and commit to trunk attachment:viz-with-combo.diff.bz2, then make a 1.8.0c3 release and invite everyone to test out the new, even better downloader and the new visualizations.
Attachment 1170-p1.diff (2172 bytes) added
for review: drop received data after each block finishes, to avoid spans.py complexity bug
Attachment 1170-p2.diff (67595 bytes) added
for review: use diversity-seeking share-selection algorithm, improve logging
Attachment 1170-p3.diff (303958 bytes) added
for review: add Protovis-based download-status timeline visualization page
ok, those three patches are ready for review, and are meant to be applied to current trunk in that order.
If you want me to land these, please let me know by say thursday, since I'm travelling this weekend and will have only limited network access next week.
I would rather apply them to trunk myself after reviewing them.
By the way, if you are already on your travels by the time I review these patches then if there is any very small, obvious fix that is needed I might do it myself rather than wait for you to get back on-line. Hopefully none will be needed and I can apply these as-is for 1.8.0c3. One thing that I'm suspecting I'm going to want changed is the visualization -- last time I looked it lacked labels indicating the meanings of the axes, the units, and the meanings of the objects, and I really hate graphs without complete labels. I think maybe it is something that my high school chemistry teacher crammed into my head? Never never never report data without units and labels.
Doesn't 1170-p3.diff mean we need to update [docs/frontends/download-status.txt]source:docs/frontends/download-status.txt?
Ugh, 1170-p3.diff adds in jquery.js, 120 KB and more than 4000 lines of code, and protovis-!r3.2.js, 116 KB and a minified (therefore obscured) version of more than 15,000 lines of code. This means we are storing 3rd party source code in our revision control history, and in the case of protovis-!r3.2.js it isn't even the real source code, but a minified (compressed) version of it. If only there were a principled, manageable way to declare our dependencies on those JavaScript codebases! :-(
But I'm not aware of one that we can use. At least let us not store computer-produced stuff -- the minified version of protovis -- but instead store the original "preferred form for making modifications" and minify it as a part of the build process or as part of the start up of the web gateway.
Does anyone know of a better way to manage our dependencies on JavaScript code?
yeah, that's tricky. I'm not sure what to suggest. For reference, the non-minified protovis code (
protovis-d3.2.js
) is 510KB (versus the minifiedprotovis-r3.2.js
is 117KB). The minifiedjquery.js
would be 57KB, versus the non-minified at 121KB.Attachment 1170-p123.darcspatch (376523 bytes) added
patchbundle with all three patches, ready to land upon passing review
I reviewed the first one and applied it as changeset:c89a464510394089. Thanks! Now trunk no longer has a superlinear CPU usage when uploading large files!
I reviewed the second one and applied it as changeset:00e9e4e6760021a1. Whoo-hoo! Now trunk has all of the Brian's New Downloader patches which Terrell and I benchmarked as being way better than the old 1.7.1 downloader!
Everyone should test and benchmark the heck out of trunk! This might become 1.8.0c3, or else we might figure out how to package up the third of Brian's three patches, the one that gives the beautiful JavaScript download visualization. However, I'm not sure if I want to put that in to 1.8.0c3, mostly because of JavaScript packaging issues. Either way, you should test the heck out of the current trunk. :-)
See Kyle's latest benchmarking reports:
short version: unless Kyle's measurements are wrong, there is still a huge performance regression which blocks 1.8.0 release. Waaah! :-(
I ran a large download under cProfile and the results clearly show that there is no CPU hotspot. Filtering out all the rows that had less than 10 seconds total CPU time during the 6 hours that I left it running (about 2.5 of which it was doing a download), and I get:
I am now going to attach the status.html files from the download that was run under the profiler and whose profiling results were posted in comment:79625. Why this single download (with Firefox 4 beta) of a single file spawned 8 download status pages I don't know.
Attachment down.html.tar.bz2 (1306532 bytes) added
status.html of downloads described in comment:-1
Recap of this ticket: there were two major performance regressions in 1.8.0c1 or c2 vs. 1.7.1. One was the superlinear computation in spans and the other was a server-selection algorithm that would in some cases choose to get multiple shares from one server unnecessarily. Both of those are fixed in 1.8.0c4 (upcoming) and benchmarks by various people indicate that 1.8.0c4 immutable download is only a little slower (~10% slower) than 1.7.1 in the worst case and much faster (e.g. ~400% faster) in other common cases.
So, this ticket is done well enough for v1.8.0 final. I strongly suspect that the remaining ~10% slowdown has to do with more computation after receiving a block and before sending the next get_block request, which would probably be best addressed by implementing #1110 or #1187.
Thanks very much to Brian, Terrell, David-Sarah, Kyle, and anyone else who helped slay this damned tenacious issue. :-) Hooray! It is dead! Good-bye ticket #1170!
P.S. The gentle reader, before looking away from this ticket forever and ever, might want to look at the following comments and perhaps transcribe some of their important bits out to a fresh new ticket: comment:79569, comment:79605, comment:79607. And by "the gentle reader", I guess I mean Brian.
The part of this ticket that was about integrating, deploying, and supporting the new visualizer has been moved to #1200 (package up Brian's New Visualization of immutable download).
See also #1182 (clean up and improve asymptotic complexity of Spans and DataSpans).
#1268 has been opened to cover the "coalesce
Share.loop()
calls"fix mentioned in comment:79607. I think that's all the action items leftover from this ticket.