inappropriate "uncoordinated write error" after handling a server failure #540

Open
opened 2008-11-25 20:04:28 +00:00 by warner · 7 comments

I noticed the automated "speedtest" failing with an unexpected Uncoordinated Write Error for the past few days. There were several issues involved, but the one for this ticket is as follows:

  • mutable publish assigns shares to servers, sends out requests. Let's say that share 1 goes to server A, and share 2 goes to server B.
  • for whatever reason, server A returns an error
  • the publish process must find a new server for share 1, say it picks B
  • the publish process sends a readv-and-testv-and-writev for share 1 to server B
  • but, it uses the same test vector that it used for the first request (the one that wrote share 2), which includes a clause that says "the server should not have any unknown shares". This probably only hits when we're first creating the mutable file.
  • server B receives the request for share 2, and accepts it, and responds with success
  • server B then receives the request for share 1, looks at the test vector, says "hey, but I already have a share (i.e. share 2)", so the test vector does not match, so the write is rejected
  • the publish process sees the rejected write and concludes that someone else must have written a share at the same time, so it throws Uncoordinated Write Error

So really the sole publisher is colliding with themselves.

I think the fix would be to have the publisher keep track of which share requests it has sent, perhaps in the servermap (as "pending writes", or "proposed writes"). When the second writev request is generated, it should build a test vector based upon the pending write (so it includes share2).

I noticed the automated "speedtest" failing with an unexpected Uncoordinated Write Error for the past few days. There were several issues involved, but the one for this ticket is as follows: * mutable publish assigns shares to servers, sends out requests. Let's say that share 1 goes to server A, and share 2 goes to server B. * for whatever reason, server A returns an error * the publish process must find a new server for share 1, say it picks B * the publish process sends a readv-and-testv-and-writev for share 1 to server B * **but**, it uses the same test vector that it used for the first request (the one that wrote share 2), which includes a clause that says "the server should not have any unknown shares". This probably only hits when we're first creating the mutable file. * server B receives the request for share 2, and accepts it, and responds with success * server B then receives the request for share 1, looks at the test vector, says "hey, but I already have a share (i.e. share 2)", so the test vector does not match, so the write is rejected * the publish process sees the rejected write and concludes that someone else must have written a share at the same time, so it throws Uncoordinated Write Error So really the sole publisher is colliding with themselves. I think the fix would be to have the publisher keep track of which share requests it has sent, perhaps in the servermap (as "pending writes", or "proposed writes"). When the second writev request is generated, it should build a test vector based upon the pending write (so it includes share2).
warner added the
code-mutable
major
defect
1.2.0
labels 2008-11-25 20:04:28 +00:00
warner added this to the undecided milestone 2008-11-25 20:04:28 +00:00
Author

I think the publisher can also hit this for already-existing files too, where the first message says "I think you have sh1=ver1, here is sh1=ver2", and then (because of some other server having an error) it wants to add a second share to that same server, so it sends "I think you have sh1=ver1, here is sh2=ver2", and is surprised when the server says "actually I have sh1=ver2 you numbskull".

I think zooko's incident-2009-07-29-104230-vyc6byy.flog.bz2 in ticket #786 is related, but I haven't been able to figure it out exactly (it reports a surprise, but the log event says that their report matches our expectations, which makes me think that the code which logs the event is showing a different "expectation" than the one that was bundled in the testv portion of the share-write request.. it feels like two messages being sent at the same time to the same server).

I think the publisher can also hit this for already-existing files too, where the first message says "I think you have sh1=ver1, here is sh1=ver2", and then (because of some other server having an error) it wants to add a second share to that same server, so it sends "I think you have sh1=ver1, here is sh2=ver2", and is surprised when the server says "actually I have sh1=ver2 you numbskull". I think zooko's `incident-2009-07-29-104230-vyc6byy.flog.bz2` in ticket #786 is related, but I haven't been able to figure it out exactly (it reports a surprise, but the log event says that their report matches our expectations, which makes me think that the code which logs the event is showing a different "expectation" than the one that was bundled in the testv portion of the share-write request.. it feels like two messages being sent at the same time to the same server).

This might be related to #899, newly reported by Kyle Markley and Andrej Falout.

This might be related to #899, newly reported by Kyle Markley and Andrej Falout.
tahoe-lafs added
critical
and removed
major
labels 2010-03-24 22:43:35 +00:00

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
zooko modified the milestone from undecided to 1.8.0 2010-05-26 14:42:23 +00:00
kevan commented 2010-05-28 02:37:18 +00:00
Owner

I'm almost certain that I'll end up squashing this with MDMF, so I'll assign it to myself.

I'm almost certain that I'll end up squashing this with MDMF, so I'll assign it to myself.
tahoe-lafs modified the milestone from 1.8.0 to 1.9.0 2010-08-10 03:37:42 +00:00

If you like this ticket, you might like #546 (mutable-file surprise shares raise inappropriate UCWE).

If you like this ticket, you might like #546 (mutable-file surprise shares raise inappropriate UCWE).

If you like this ticket, you might like #547 (mapupdate(MODE_WRITE) triggers on a false boundary).

If you like this ticket, you might like #547 (mapupdate(MODE_WRITE) triggers on a false boundary).
davidsarah commented 2011-07-16 20:34:38 +00:00
Owner

Kevan will look at whether his MDMF patches squash this.

Kevan will look at whether his MDMF patches squash this.
tahoe-lafs modified the milestone from 1.9.0 to soon 2011-07-16 20:44:36 +00:00
zooko added
normal
and removed
critical
labels 2012-11-13 23:27:23 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#540
No description provided.