Incomplete ServerMap triggers UncoordinatedWriteError upon mutable Publish #1795

New Issue

tahoe-lafs · 2012-07-25T03:04:43Z

jean commented

2012-07-25 03:04:43 +00:00

This error has been seen in the wild while working on the Tamias system that uses tahoe-lafs as a storage layer. It seems to show up much more often in our testing environment where we do have a lot of clients connecting and leaving the network at high frequencies (client nodes, not storage nodes).

Before overwriting a mutable file, the client builds a servermap using MODE_WRITE. This mode does not query all servers but stops querying when 'epsilon' consecutive servers stated that they do not have a share. When this happens (hit boundary, in the log) the servermap is considered to be done if all servers on the left of the boundary have answered.

In some corner cases, all those servers have answered but specific timing makes it so that the server is marked as having a share but the share information has not been processed yet. Because there are several concurrent calls to check_for_done, one of them might consider that the servermapupdate can stop running, actually preventing the processing of the last share.

This results in a partial servermap. When the Publish operation starts, it might select the last server - the one missing from the servermap - as a candidate for the missing share. It will then issue a testv that checks for the absence of a share. This testv fails because there is a share, and a UCW is triggered.

This can be seen in the attached log starting from event 6750, the boundary is found at 6898 and 6899 stops the servermap update. Event 6900 has the partial servermap, and events 6903,6904 show the last share processing that is filtered because the servermap update has already been stopped. 6918 and 6920 show the servermap before (partial) and after (unforunately chosing the 'hidden' server whose answer was discarded). This leads to the eventual UCW at event 6955 triggered by the failed testv at 6953.

In out testing environment, we use the attached workaround that moves the addition to the good_servers list at the very bottom of the deferedlist that is built per-server. This is expected to cause problems when servers have multiple shares, but it is just a temporary fix anyway.

This error has been seen in the wild while working on the Tamias system that uses tahoe-lafs as a storage layer. It seems to show up much more often in our testing environment where we do have a lot of clients connecting and leaving the network at high frequencies (client nodes, not storage nodes). Before overwriting a mutable file, the client builds a servermap using MODE_WRITE. This mode does not query all servers but stops querying when 'epsilon' consecutive servers stated that they do not have a share. When this happens (hit boundary, in the log) the servermap is considered to be done if all servers on the left of the boundary have answered. In some corner cases, all those servers have answered but specific timing makes it so that the server is marked as having a share but the share information has not been processed yet. Because there are several concurrent calls to check_for_done, one of them might consider that the servermapupdate can stop running, actually preventing the processing of the last share. This results in a partial servermap. When the Publish operation starts, it might select the last server - the one missing from the servermap - as a candidate for the missing share. It will then issue a testv that checks for the absence of a share. This testv fails because there is a share, and a UCW is triggered. This can be seen in the attached log starting from event 6750, the boundary is found at 6898 and 6899 stops the servermap update. Event 6900 has the partial servermap, and events 6903,6904 show the last share processing that is filtered because the servermap update has already been stopped. 6918 and 6920 show the servermap before (partial) and after (unforunately chosing the 'hidden' server whose answer was discarded). This leads to the eventual UCW at event 6955 triggered by the failed testv at 6953. In out testing environment, we use the attached workaround that moves the addition to the good_servers list at the very bottom of the deferedlist that is built per-server. This is expected to cause problems when servers have multiple shares, but it is just a temporary fix anyway.