gather information about historical server performance #905

New Issue

warner · 2010-01-15T06:09:23Z

warner commented

2010-01-15 06:09:23 +00:00

While patiently uploading some relatively small (3MB) files to the testgrid,
I found myself wishing for more specific information about how long each
server was taking to respond. Some part of me wanted to blame specific
servers for being slow, but I realize that my upstream bandwidth is limited,
and I'd like to compare the upload time of each segment against the minimum
possible upload time given my available bandwidth. So I want more information
about server performance.

The download staus web page (via "recent uploads and downloads") reports
per-server per-segment fetch response times as a big list of numbers. It's
possible to eyeball this and look for trends, for example in the following
download:

Per-Server Segment Fetch Response Times:
- hk475awa: 1.56s, 876ms, 714ms, 705ms, 545ms, 545ms, 539ms, 579ms, 392ms, 386ms, 377ms, 371ms, 372ms, 387ms, 555ms, 388ms, 373ms, 373ms, 372ms, 372ms, 370ms, 373ms, 370ms, 370ms
- lwkv6cji: 160ms, 103ms, 105ms, 106ms, 100ms, 119ms, 117ms, 105ms, 106ms, 109ms, 102ms, 107ms, 103ms, 1.02s, 91ms, 93ms, 90ms, 93ms, 90ms, 92ms, 89ms, 93ms, 89ms, 80ms
- 2gn6njsm: 763ms, 536ms, 520ms, 537ms, 537ms, 548ms, 537ms, 534ms, 527ms, 535ms, 528ms, 525ms, 628ms, 532ms, 541ms, 528ms, 530ms, 528ms, 534ms, 527ms, 528ms, 527ms, 529ms, 448ms

you can tell that lwkv6cji was pretty quick, and that 2gn6njsm
was 5-6x slower.

I'd like three improvements on this:

collect data on uploads, not just downloads
display data in some sort of graphical form, to make it easier to spot
trends. Maybe segnum should be the X axis, reponse time is the Y axis,
serverid is color, and each sample should be drawn as a dot. Servers which
gave consistent service would show up as horizontal stripes. Common-mode
delays would appear as spikes.
connect+display data across multiple uploads and downloads. This would
involve storing some history about each server, maybe in a sqlite database
or something.

I've also been thinking about this in the context of a new Downloader (#287),
in which I want to sort potential servers according to their likely download
speeds (favoring fast servers, but experimenting with slower ones to spread
the load and learn about alternatives). Mainly I want the Downloader to be
able to tell the difference between a server that's running at normal speed,
one that's running unusually slowly, and one that's disconnected, and I think
historical data will help here.

Also, with historical data, we might be able to deduce each server's local
upload/download speed limits by looking for a minimum reponse time for
given-sized messages. We might use this to build up a list of download
servers with an aggregate bandwidth that matches our own download bandwidth
(exactly filling all the pipes), or to influence share placement during
upload.

Another challenge is that currently the only data we get is response time for
each request, which combines outbound request size (and contention for the
wire), server response turnaround time (i.e. CPU load, disk IO latency, etc),
and response size (and wire contention). In some ways, we want to be able to
distinguish between those components. In other ways, we really only care
about the sum.

But mainly, distinguishing between "slow" and "disconnected" may work better
if we had finer-grained information, like how many bytes have arrived at the
socket since the last complete message was received, and how long ago the
last byte arrived. Likewise, when sending data, it would be useful to know
that the transmit buffer still has pending data, and how long it's been since
the last socket write was allowed. This would help tell the difference
between a connection that is alive but only trickling data through slowly,
versus one that has been stopped for the last 10 seconds. With only coarser
per-message data (which, for 3-of-10 and 128KiB segments, means about 40kB
per message), it might be hard to confidently declare a disconnect within a
reasonable multiple of the expected reponse time.

I don't know exactly what sorts of data I'd want or how much of it to keep.
This ticket is to collect ideas on what to collect and what to do with it.
The first concrete improvement would be to record per-request response times
for upload just like we do for download. The second would be to graph them.

While patiently uploading some relatively small (3MB) files to the testgrid, I found myself wishing for more specific information about how long each server was taking to respond. Some part of me wanted to blame specific servers for being slow, but I realize that my upstream bandwidth is limited, and I'd like to compare the upload time of each segment against the minimum possible upload time given my available bandwidth. So I want more information about server performance. The download staus web page (via "recent uploads and downloads") reports per-server per-segment fetch response times as a big list of numbers. It's possible to eyeball this and look for trends, for example in the following download: * Per-Server Segment Fetch Response Times: * `hk475awa`: 1.56s, 876ms, 714ms, 705ms, 545ms, 545ms, 539ms, 579ms, 392ms, 386ms, 377ms, 371ms, 372ms, 387ms, 555ms, 388ms, 373ms, 373ms, 372ms, 372ms, 370ms, 373ms, 370ms, 370ms * `lwkv6cji`: 160ms, 103ms, 105ms, 106ms, 100ms, 119ms, 117ms, 105ms, 106ms, 109ms, 102ms, 107ms, 103ms, 1.02s, 91ms, 93ms, 90ms, 93ms, 90ms, 92ms, 89ms, 93ms, 89ms, 80ms * `2gn6njsm`: 763ms, 536ms, 520ms, 537ms, 537ms, 548ms, 537ms, 534ms, 527ms, 535ms, 528ms, 525ms, 628ms, 532ms, 541ms, 528ms, 530ms, 528ms, 534ms, 527ms, 528ms, 527ms, 529ms, 448ms you can tell that `lwkv6cji` was pretty quick, and that `2gn6njsm` was 5-6x slower. I'd like three improvements on this: * collect data on uploads, not just downloads * display data in some sort of graphical form, to make it easier to spot trends. Maybe segnum should be the X axis, reponse time is the Y axis, serverid is color, and each sample should be drawn as a dot. Servers which gave consistent service would show up as horizontal stripes. Common-mode delays would appear as spikes. * connect+display data across multiple uploads and downloads. This would involve storing some history about each server, maybe in a sqlite database or something. I've also been thinking about this in the context of a new Downloader (#287), in which I want to sort potential servers according to their likely download speeds (favoring fast servers, but experimenting with slower ones to spread the load and learn about alternatives). Mainly I want the Downloader to be able to tell the difference between a server that's running at normal speed, one that's running unusually slowly, and one that's disconnected, and I think historical data will help here. Also, with historical data, we might be able to deduce each server's local upload/download speed limits by looking for a minimum reponse time for given-sized messages. We might use this to build up a list of download servers with an aggregate bandwidth that matches our own download bandwidth (exactly filling all the pipes), or to influence share placement during upload. Another challenge is that currently the only data we get is response time for each request, which combines outbound request size (and contention for the wire), server response turnaround time (i.e. CPU load, disk IO latency, etc), and response size (and wire contention). In some ways, we want to be able to distinguish between those components. In other ways, we really only care about the sum. But mainly, distinguishing between "slow" and "disconnected" may work better if we had finer-grained information, like how many bytes have arrived at the socket since the last complete message was received, and how long ago the last byte arrived. Likewise, when sending data, it would be useful to know that the transmit buffer still has pending data, and how long it's been since the last socket write was allowed. This would help tell the difference between a connection that is alive but only trickling data through slowly, versus one that has been stopped for the last 10 seconds. With only coarser per-message data (which, for 3-of-10 and 128KiB segments, means about 40kB per message), it might be hard to confidently declare a disconnect within a reasonable multiple of the expected reponse time. I don't know exactly what sorts of data I'd want or how much of it to keep. This ticket is to collect ideas on what to collect and what to do with it. The first concrete improvement would be to record per-request response times for upload just like we do for download. The second would be to graph them.