client feedback channel #484

New Issue

warner · 2008-07-02T00:21:49Z

warner commented

2008-07-02 00:21:49 +00:00

It would be nice if clients had a way to report errors and performance results to a central gatherer. This would be configured by dropping a "client-feedback.furl" file into the client's basedir. The client would then use this to send the following information to a gatherer at that FURL:

foolscap log "Incidents": severe errors, along with the log events that immediately preceded them
speeds/latencies of each network operation: upload/download performance

The issue is that, since Tahoe is such a resilient system, there are a lot of failures modes that wouldn't be visible to users. If a client reads a share and sees the wrong hash, it just uses a different one, and users don't see any problems unless there are many simultaneous failures. However, from the grid-admin/server point of view, a bad hash is a massively unlikely event and indicates serious problems: a disk is failing, a server has bad RAM, a file is corrupted, etc. The server/grid admin wants to know about these, even though the user does not.

Similarly, there are a number of grid tuning issues that are best addressed by learning about the client experience, and watching them change over time. When you add new servers to a grid, clients who ask all servers for their shares will take longer to do peer selection. How much longer? The best way to find out is to have the client report their peer-selection time to a gatherer on each operation, and then the gatherer can graph them over time. The grid admins might want to make their servers faster if it would improve user experience, but they need to find out what the user experience is before then can make that decision.

We might want to make these two separate reporting channels, with separate FURLs. Also, we can batch the reporting of the performance numbers: we don't have to report every single operation. We could cut down on active network connections by only trying to connect once a day and dumping incidents if and when we establish a connection. We need to keep several issues in mind: thundering herd, overloading the gatherer, bounding the queue size.

It would be nice if clients had a way to report errors and performance results to a central gatherer. This would be configured by dropping a "client-feedback.furl" file into the client's basedir. The client would then use this to send the following information to a gatherer at that FURL: * foolscap log "Incidents": severe errors, along with the log events that immediately preceded them * speeds/latencies of each network operation: upload/download performance The issue is that, since Tahoe is such a resilient system, there are a lot of failures modes that wouldn't be visible to users. If a client reads a share and sees the wrong hash, it just uses a different one, and users don't see any problems unless there are many simultaneous failures. However, from the grid-admin/server point of view, a bad hash is a massively unlikely event and indicates serious problems: a disk is failing, a server has bad RAM, a file is corrupted, etc. The server/grid admin wants to know about these, even though the user does not. Similarly, there are a number of grid tuning issues that are best addressed by learning about the client experience, and watching them change over time. When you add new servers to a grid, clients who ask all servers for their shares will take longer to do peer selection. How much longer? The best way to find out is to have the client report their peer-selection time to a gatherer on each operation, and then the gatherer can graph them over time. The grid admins might want to make their servers faster if it would improve user experience, but they need to find out what the user experience is before then can make that decision. We might want to make these two separate reporting channels, with separate FURLs. Also, we can batch the reporting of the performance numbers: we don't have to report every single operation. We could cut down on active network connections by only trying to connect once a day and dumping incidents if and when we establish a connection. We need to keep several issues in mind: thundering herd, overloading the gatherer, bounding the queue size.