inappropriate "uncoordinated write error" after handling a server failure #540

New Issue

warner · 2008-11-25T20:04:28Z

warner commented

2008-11-25 20:04:28 +00:00

I noticed the automated "speedtest" failing with an unexpected Uncoordinated Write Error for the past few days. There were several issues involved, but the one for this ticket is as follows:

mutable publish assigns shares to servers, sends out requests. Let's say that share 1 goes to server A, and share 2 goes to server B.
for whatever reason, server A returns an error
the publish process must find a new server for share 1, say it picks B
the publish process sends a readv-and-testv-and-writev for share 1 to server B
but, it uses the same test vector that it used for the first request (the one that wrote share 2), which includes a clause that says "the server should not have any unknown shares". This probably only hits when we're first creating the mutable file.
server B receives the request for share 2, and accepts it, and responds with success
server B then receives the request for share 1, looks at the test vector, says "hey, but I already have a share (i.e. share 2)", so the test vector does not match, so the write is rejected
the publish process sees the rejected write and concludes that someone else must have written a share at the same time, so it throws Uncoordinated Write Error

So really the sole publisher is colliding with themselves.

I think the fix would be to have the publisher keep track of which share requests it has sent, perhaps in the servermap (as "pending writes", or "proposed writes"). When the second writev request is generated, it should build a test vector based upon the pending write (so it includes share2).

I noticed the automated "speedtest" failing with an unexpected Uncoordinated Write Error for the past few days. There were several issues involved, but the one for this ticket is as follows: * mutable publish assigns shares to servers, sends out requests. Let's say that share 1 goes to server A, and share 2 goes to server B. * for whatever reason, server A returns an error * the publish process must find a new server for share 1, say it picks B * the publish process sends a readv-and-testv-and-writev for share 1 to server B * **but**, it uses the same test vector that it used for the first request (the one that wrote share 2), which includes a clause that says "the server should not have any unknown shares". This probably only hits when we're first creating the mutable file. * server B receives the request for share 2, and accepts it, and responds with success * server B then receives the request for share 1, looks at the test vector, says "hey, but I already have a share (i.e. share 2)", so the test vector does not match, so the write is rejected * the publish process sees the rejected write and concludes that someone else must have written a share at the same time, so it throws Uncoordinated Write Error So really the sole publisher is colliding with themselves. I think the fix would be to have the publisher keep track of which share requests it has sent, perhaps in the servermap (as "pending writes", or "proposed writes"). When the second writev request is generated, it should build a test vector based upon the pending write (so it includes share2).

warner added the

labels 2008-11-25 20:04:28 +00:00

warner added this to the undecided milestone 2008-11-25 20:04:28 +00:00

warner commented

2009-12-29 18:50:27 +00:00

I think the publisher can also hit this for already-existing files too, where the first message says "I think you have sh1=ver1, here is sh1=ver2", and then (because of some other server having an error) it wants to add a second share to that same server, so it sends "I think you have sh1=ver1, here is sh2=ver2", and is surprised when the server says "actually I have sh1=ver2 you numbskull".

I think zooko's incident-2009-07-29-104230-vyc6byy.flog.bz2 in ticket #786 is related, but I haven't been able to figure it out exactly (it reports a surprise, but the log event says that their report matches our expectations, which makes me think that the code which logs the event is showing a different "expectation" than the one that was bundled in the testv portion of the share-write request.. it feels like two messages being sent at the same time to the same server).

I think the publisher can also hit this for already-existing files too, where the first message says "I think you have sh1=ver1, here is sh1=ver2", and then (because of some other server having an error) it wants to add a second share to that same server, so it sends "I think you have sh1=ver1, here is sh2=ver2", and is surprised when the server says "actually I have sh1=ver2 you numbskull". I think zooko's `incident-2009-07-29-104230-vyc6byy.flog.bz2` in ticket #786 is related, but I haven't been able to figure it out exactly (it reports a surprise, but the log event says that their report matches our expectations, which makes me think that the code which logs the event is showing a different "expectation" than the one that was bundled in the testv portion of the share-write request.. it feels like two messages being sent at the same time to the same server).

zooko commented

2010-01-14 00:02:48 +00:00

This might be related to #899, newly reported by Kyle Markley and Andrej Falout.

tahoe-lafs added

critical

and removed

major

labels 2010-03-24 22:43:35 +00:00

zooko commented

2010-05-26 14:42:23 +00:00

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

zooko modified the milestone from undecided to 1.8.0

2010-05-26 14:42:23 +00:00

kevan commented

2010-05-28 02:37:18 +00:00

I'm almost certain that I'll end up squashing this with MDMF, so I'll assign it to myself.

tahoe-lafs modified the milestone from 1.8.0 to 1.9.0

2010-08-10 03:37:42 +00:00

zooko commented

2010-08-10 04:24:38 +00:00

If you like this ticket, you might like #546 (mutable-file surprise shares raise inappropriate UCWE).