Implement Halt and Catch Fire #529

Open
opened 2008-10-15 20:46:38 +00:00 by zandr · 2 comments
zandr commented 2008-10-15 20:46:38 +00:00
Owner

There have been a few cases lately where nodes have been in some impaired state, but still responding (if badly) to network requests. This caused other components of the system to block.

If in these conditions we instead stopped responding to network requests, the rest of the system would just ignore the wounded node and move on.

In particular, the recent webapi3 issue would have been invisible to users if the webapi node stopped responding to http. Then the balancer would have marked it as failed and moved on.

Same with the prodtahoe7 meltdown.

I acknowledge that deciding when to catch fire is non-trivial, so I'm filing this more to provoke conversation than to request any specific behavior.

There have been a few cases lately where nodes have been in some impaired state, but still responding (if badly) to network requests. This caused other components of the system to block. If in these conditions we instead stopped responding to network requests, the rest of the system would just ignore the wounded node and move on. In particular, the recent webapi3 issue would have been invisible to users if the webapi node stopped responding to http. Then the balancer would have marked it as failed and moved on. Same with the prodtahoe7 meltdown. I acknowledge that deciding when to catch fire is non-trivial, so I'm filing this more to provoke conversation than to request any specific behavior.
tahoe-lafs added the
unknown
major
defect
1.2.0
labels 2008-10-15 20:46:38 +00:00
tahoe-lafs added this to the undecided milestone 2008-10-15 20:46:38 +00:00

Zandr and I were just talking about this one. The basic idea is that it would
be nice if an HTTP load-balancer (which is sitting in front of a farm of
webapi nodes) could cheaply detect that a given webapi node was not in a good
state, and switch traffic to other nodes instead.

To begin with, we could define what it means to be in good state. We could
put a bit of code inside the node, maybe client.is_fully_functional(),
with some configurable criteria, maybe one or more of the following:

  • connected to Introducer
  • connected to at least N storage servers
  • connected to all blessed (#466) storage servers

Then, we could define how we want the webapi interface to behave when these
criteria are not met, one of:

  • webapi port stops listening completely
  • webapi port returns errors on all /uri requests (both reads and writes)
  • return error on all writes (POST or PUT to /uri)
  • return some special value to GETs of one specific status url

The first (stop listening entirely) is most useful for the load balancer,
because these devices typically assume that if a server responds at all, then
it will be able to respond correctly. It would, however, make it difficult
for us to solve the problem, since many of the diagnostic tools we would use
are themselves pages in the webapi. Any of the other options would improve
diagnosability, but would obligate the load-balancer to either look more
carefully at the response (start diverting traffic when it sees 500 Internal
Server Errors coming back, or use special probe requests to hit the status
URL on a periodic basis).

We also kicked around the idea of having two webapi ports, one which turns
itself off if the node were not fully functional, and a second which stays on
all the time. With this sort of scheme, the load-balancer could point at the
first port, and we'd use the second port for diagnostics.

A tangentially-related issue is that sometimes the node can appear to start,
'tahoe start' returns with success, but the node is in fact impaired in some
fatal way. I believe that a node which is unable to open the webapi listening
port will exhibit this behavior. I think there was a change to node startup
recently (the implementation of 'tahoe start') which makes this troublesome,
in which the bind() call is taking place after the fork(), whereas it used to
be before the fork(). #602 and #71 probably relate to this one, as well as
#371.

Zandr and I were just talking about this one. The basic idea is that it would be nice if an HTTP load-balancer (which is sitting in front of a farm of webapi nodes) could cheaply detect that a given webapi node was not in a good state, and switch traffic to other nodes instead. To begin with, we could define what it means to be in good state. We could put a bit of code inside the node, maybe `client.is_fully_functional()`, with some configurable criteria, maybe one or more of the following: * connected to Introducer * connected to at least N storage servers * connected to all blessed (#466) storage servers Then, we could define how we want the webapi interface to behave when these criteria are not met, one of: * webapi port stops listening completely * webapi port returns errors on all /uri requests (both reads and writes) * return error on all writes (POST or PUT to /uri) * return some special value to GETs of one specific status url The first (stop listening entirely) is most useful for the load balancer, because these devices typically assume that if a server responds at all, then it will be able to respond correctly. It would, however, make it difficult for us to solve the problem, since many of the diagnostic tools we would use are themselves pages in the webapi. Any of the other options would improve diagnosability, but would obligate the load-balancer to either look more carefully at the response (start diverting traffic when it sees 500 Internal Server Errors coming back, or use special probe requests to hit the status URL on a periodic basis). We also kicked around the idea of having two webapi ports, one which turns itself off if the node were not fully functional, and a second which stays on all the time. With this sort of scheme, the load-balancer could point at the first port, and we'd use the second port for diagnostics. A tangentially-related issue is that sometimes the node can appear to start, 'tahoe start' returns with success, but the node is in fact impaired in some fatal way. I believe that a node which is unable to open the webapi listening port will exhibit this behavior. I think there was a change to node startup recently (the implementation of 'tahoe start') which makes this troublesome, in which the bind() call is taking place after the fork(), whereas it used to be before the fork(). #602 and #71 probably relate to this one, as well as #371.
warner added
code-frontend-web
and removed
unknown
labels 2009-03-08 21:59:55 +00:00

See #912 for a request that when the node knows that it is failing it sends out an alert.

See #912 for a request that when the node knows that it is failing it sends out an alert.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#529
No description provided.