Exception: <class 'allmydata.mutable.common.UncoordinatedWriteError'> when trying to create directory on testgrid #748

New Issue

tahoe-lafs · 2009-07-04T21:44:07Z

toby.murray commented

2009-07-04 21:44:07 +00:00

In a random attempt to see what the new CSS wui looks like, I connected to testgrid.allmydata.com:3567 just now.

I clicked the "Create a directory" button and was presented with the exception listed above. I'm afraid I see no further information to provide but the error is reproducable. The actual HTML source returned is the following:

<html><head><title>Exception</title></head><body><style type="text/css">
p.error {
  color: black;
  font-family: Verdana, Arial, helvetica, sans-serif;
  font-weight: bold;
  font-size: large;
  margin: 0.25em;
}

div {
  font-family: Verdana, Arial, helvetica, sans-serif;
}

strong.variableClass {
  font-size: small;
}

div.stackTrace {
}

div.frame {
  padding: 0.25em;
  background: white;
  border-bottom: thin black dotted;
}

div.firstFrame {
  padding: 0.25em;
  background: white;
  border-top: thin black dotted;
  border-bottom: thin black dotted;
}

div.location {
    font-size: small;
}

div.snippet {
  background: #FFFFDD;
  padding: 0.25em;
}

div.snippetHighlightLine {
  color: red;
}

span.lineno {
    font-size: small;
}

pre.code {
  margin: 0px;
  padding: 0px;
  display: inline;
  font-size: small;
  font-family: "Courier New", courier, monotype;
}

span.function {
  font-weight: bold;
  font-family: "Courier New", courier, monotype;
}

table.variables {
  border-collapse: collapse;
  width: 100%;
}

td.varName {
  width: 1in;
  vertical-align: top;
  font-style: italic;
  font-size: small;
  padding-right: 0.25em;
}

td.varValue {
  padding-left: 0.25em;
  padding-right: 0.25em;
  font-size: small;
}

div.variables {
  margin-top: 0.5em;
}

div.dict {
  background: #cccc99;
  padding: 2px;
  float: left;
}

td.dictKey {
  background: #ffff99;
  font-weight: bold;
}

td.dictValue {
  background: #ffff99;
}

div.list {
  background: #7777cc;
  padding: 2px;
  float: left;
}

div.listItem {
  background: #9999ff;
}

div.instance {
  width: 100%;
  background: #efefef;
  padding: 2px;
  float: left;
}

span.instanceName {
  font-size: small;
  display: block;
}

span.instanceRepr {
  font-family: "Courier New", courier, monotype;
}

div.function {
  background: orange;
  font-weight: bold;
  float: left;
}
</style><a href="#tracebackEnd"><p class="error">&lt;class 'allmydata.mutable.common.UncoordinatedWriteError'&gt;: </p></a><div class="stackTrace"></div><a name="tracebackEnd"><p class="error">&lt;class 'allmydata.mutable.common.UncoordinatedWriteError'&gt;: </p></a></body></html>

In a random attempt to see what the new CSS wui looks like, I connected to testgrid.allmydata.com:3567 just now. I clicked the "Create a directory" button and was presented with the exception listed above. I'm afraid I see no further information to provide but the error is reproducable. The actual HTML source returned is the following: ``` <html><head><title>Exception</title></head><body><style type="text/css"> p.error { color: black; font-family: Verdana, Arial, helvetica, sans-serif; font-weight: bold; font-size: large; margin: 0.25em; } div { font-family: Verdana, Arial, helvetica, sans-serif; } strong.variableClass { font-size: small; } div.stackTrace { } div.frame { padding: 0.25em; background: white; border-bottom: thin black dotted; } div.firstFrame { padding: 0.25em; background: white; border-top: thin black dotted; border-bottom: thin black dotted; } div.location { font-size: small; } div.snippet { background: #FFFFDD; padding: 0.25em; } div.snippetHighlightLine { color: red; } span.lineno { font-size: small; } pre.code { margin: 0px; padding: 0px; display: inline; font-size: small; font-family: "Courier New", courier, monotype; } span.function { font-weight: bold; font-family: "Courier New", courier, monotype; } table.variables { border-collapse: collapse; width: 100%; } td.varName { width: 1in; vertical-align: top; font-style: italic; font-size: small; padding-right: 0.25em; } td.varValue { padding-left: 0.25em; padding-right: 0.25em; font-size: small; } div.variables { margin-top: 0.5em; } div.dict { background: #cccc99; padding: 2px; float: left; } td.dictKey { background: #ffff99; font-weight: bold; } td.dictValue { background: #ffff99; } div.list { background: #7777cc; padding: 2px; float: left; } div.listItem { background: #9999ff; } div.instance { width: 100%; background: #efefef; padding: 2px; float: left; } span.instanceName { font-size: small; display: block; } span.instanceRepr { font-family: "Courier New", courier, monotype; } div.function { background: orange; font-weight: bold; float: left; } </style><a href="#tracebackEnd"><p class="error"><class 'allmydata.mutable.common.UncoordinatedWriteError'>: </p></a><div class="stackTrace"></div><a name="tracebackEnd"><p class="error"><class 'allmydata.mutable.common.UncoordinatedWriteError'>: </p></a></body></html> ```

tahoe-lafs added the

labels 2009-07-04 21:44:07 +00:00

tahoe-lafs added this to the undecided milestone 2009-07-04 21:44:07 +00:00

zooko commented

2009-07-04 21:54:20 +00:00

Please look in your ~/.tahoe/logs/incidents directory and attach any incident report files that occurred around this time.

Please look in your `~/.tahoe/logs/incidents` directory and attach any incident report files that occurred around this time.

toby.murray commented

2009-07-04 22:03:58 +00:00

I don't understand. This is not on a Tahoe instance running on my machine. It is on the testgrid at testgrid.allmydata.com. Could you visit http://testgrid.allmydata.com:3567 and see if you can reproduce this?

I don't understand. This is not on a Tahoe instance running on my machine. It is on the testgrid at testgrid.allmydata.com. Could you visit <http://testgrid.allmydata.com:3567> and see if you can reproduce this?

zooko commented

2009-07-04 22:48:40 +00:00

Yes, I can. Thanks for the bug report!

warner added

code-mutable

and removed

unknown

labels 2009-07-11 11:34:58 +00:00

zooko commented

2009-07-14 02:01:06 +00:00

Hm, I guess we should look into this.

zooko modified the milestone from undecided to 1.5.0

2009-07-14 02:01:06 +00:00

zooko commented

2009-07-14 02:02:13 +00:00

Soultcer just reported the same thing on IRC.

zooko commented

2009-07-14 04:36:25 +00:00

Okay, so this could be caused by one or more of the following known issues: #651 (handle MemoryError by failing quickly and loudly) (I added a note about this issue on that issue just now, #548 (mutable publish sends queries to servers that have already been asked), #547 (mapupdate(MODE_WRITE) triggers on a false boundary), #546 (mutable-file surprise shares raise inappropriate UCWE), #540 (inappropriate "uncoordinated write error" after handling a server failure). I will investigate more in the evening after work and next weekend.

zooko commented

2009-07-14 14:11:09 +00:00

Attached is the foolscap logtool incident report. The critical excerpt of that report looks like this:

# 05:32:47.692 [22866716]: entering loop
# 05:32:47.693 [22866717]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.693 [22866718]: we are planning to push new seqnum=#1
# 05:32:47.693 [22866719]: 10 queries still outstanding
# 05:32:47.697 [22866720]: 10 shares sent
# 05:32:47.744 [22866724]: _got_write_answer from qvqv7jmm
# 05:32:47.744 [22866725]: entering loop
# 05:32:47.745 [22866726]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.745 [22866727]: we are planning to push new seqnum=#1
# 05:32:47.745 [22866728]: 9 queries still outstanding
# 05:32:47.746 [22866729]: _got_write_answer from 6fyx5u4z
# 05:32:47.746 [22866730]: entering loop
# 05:32:47.747 [22866731]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.747 [22866732]: we are planning to push new seqnum=#1
# 05:32:47.747 [22866733]: 8 queries still outstanding
# 05:32:47.748 [22866734]: _got_write_answer from trjdor3o
# 05:32:47.748 [22866735]: entering loop
# 05:32:47.749 [22866736]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.749 [22866737]: we are planning to push new seqnum=#1
# 05:32:47.749 [22866738]: 7 queries still outstanding
# 05:32:47.750 [22866739]: _got_write_answer from fcmlx6em
# 05:32:47.751 [22866740]: entering loop
# 05:32:47.751 [22866741]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji]
# 05:32:47.751 [22866742]: we are planning to push new seqnum=#1
# 05:32:47.752 [22866743]: 6 queries still outstanding
# 05:32:47.753 [22866744]: _got_write_answer from fcmlx6em

    * 05:32:47.753 [22866745]: WEIRD our testv failed, so the write did not happen [INCIDENT-TRIGGER]
    * 05:32:51.776 [22866746]: somebody modified the share on us: shnum=5: I thought they had #1:R=7ahx, but testv reported #1:R=7ahx

Hm, so far I don't understand how the client was surprised by this chain of events.

But, now it is time to go to work so I'll wonder about it later.

Attached is the foolscap logtool incident report. The critical excerpt of that report looks like this: ``` # 05:32:47.692 [22866716]: entering loop # 05:32:47.693 [22866717]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji] # 05:32:47.693 [22866718]: we are planning to push new seqnum=#1 # 05:32:47.693 [22866719]: 10 queries still outstanding # 05:32:47.697 [22866720]: 10 shares sent # 05:32:47.744 [22866724]: _got_write_answer from qvqv7jmm # 05:32:47.744 [22866725]: entering loop # 05:32:47.745 [22866726]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji] # 05:32:47.745 [22866727]: we are planning to push new seqnum=#1 # 05:32:47.745 [22866728]: 9 queries still outstanding # 05:32:47.746 [22866729]: _got_write_answer from 6fyx5u4z # 05:32:47.746 [22866730]: entering loop # 05:32:47.747 [22866731]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji] # 05:32:47.747 [22866732]: we are planning to push new seqnum=#1 # 05:32:47.747 [22866733]: 8 queries still outstanding # 05:32:47.748 [22866734]: _got_write_answer from trjdor3o # 05:32:47.748 [22866735]: entering loop # 05:32:47.749 [22866736]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji] # 05:32:47.749 [22866737]: we are planning to push new seqnum=#1 # 05:32:47.749 [22866738]: 7 queries still outstanding # 05:32:47.750 [22866739]: _got_write_answer from fcmlx6em # 05:32:47.751 [22866740]: entering loop # 05:32:47.751 [22866741]: current goal: before update: , sh0 to [trjdor3o], sh1 to [xiktf6ok], sh2 to [jfdpabh3], sh3 to [z2unqjbp], sh4 to [qvqv7jmm], sh5 to [fcmlx6em], sh6 to [6fyx5u4z], sh7 to [ivjakubr], sh8 to [wfninubk], sh9 to [lwkv6cji] # 05:32:47.751 [22866742]: we are planning to push new seqnum=#1 # 05:32:47.752 [22866743]: 6 queries still outstanding # 05:32:47.753 [22866744]: _got_write_answer from fcmlx6em * 05:32:47.753 [22866745]: WEIRD our testv failed, so the write did not happen [INCIDENT-TRIGGER] * 05:32:51.776 [22866746]: somebody modified the share on us: shnum=5: I thought they had #1:R=7ahx, but testv reported #1:R=7ahx ``` Hm, so far I don't understand how the client was surprised by this chain of events. But, now it is time to go to work so I'll wonder about it later.

zooko commented

2009-07-14 14:11:26 +00:00

Attachment incident-2009-07-14-053247-3x6tsnq.flog.bz2 (76120 bytes) added

**Attachment** incident-2009-07-14-053247-3x6tsnq.flog.bz2 (76120 bytes) added

incident-2009-07-14-053247-3x6tsnq.flog.bz2

74 KiB

warner commented

2009-07-15 07:09:37 +00:00

since zooko found a MemoryError on one of the testgrid servers, this is most likely another instance of #540.

zooko commented

2009-07-15 13:38:40 +00:00

I upgraded and restarted the tahoebs5 storage servers, but I still get an UCWE from testgrid.allmydata.org:3456.

Attached are two incident reports:

I upgraded and restarted the tahoebs5 storage servers, but I still get an UCWE from testgrid.allmydata.org:3456. Attached are two incident reports:

zooko commented

2009-07-15 13:39:01 +00:00

Attachment incident-2009-07-15-063507-jljozha.flog.bz2 (74310 bytes) added

**Attachment** incident-2009-07-15-063507-jljozha.flog.bz2 (74310 bytes) added

incident-2009-07-15-063507-jljozha.flog.bz2

73 KiB

zooko commented

2009-07-15 13:39:11 +00:00

Attachment incident-2009-07-15-063511-xyeh2hy.flog.bz2 (73453 bytes) added

**Attachment** incident-2009-07-15-063511-xyeh2hy.flog.bz2 (73453 bytes) added

incident-2009-07-15-063511-xyeh2hy.flog.bz2

72 KiB

zooko commented

2009-07-16 03:38:06 +00:00

Terrell Russell pointed out that http://testgrid.allmydata.org:3567 currently says:

Connected to 350  of 61 known storage servers

Perhaps this issue (ticket #748) is caused by #653.

Terrell Russell pointed out that <http://testgrid.allmydata.org:3567> currently says: ``` Connected to 350 of 61 known storage servers ``` Perhaps this issue (ticket #748) is caused by #653.

zooko commented

2009-07-16 16:25:29 +00:00

Grr. My blog broke. My blog is hosted on the Test Grid. That was really the last straw! I don't mind if everyone else who uses the Test Grid is inconvenienced by it being impossible to create directories, but I'm not going to leave my blog broken! So I upgraded the web gateway from:

allmydata-tahoe: 1.4.1-r3897, foolscap: 0.4.1, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev

allmydata-tahoe: 1.4.1-r3995, foolscap: 0.4.1, pycryptopp: 0.5.15, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, sqlite: 3.4.2, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev, pysqlite: 2.3.2

Now instead of Connected to 350 of 61 known storage servers it says Connected to 14 of 24 known storage servers and my blog works nicely. Also, people can create directories again.

This doesn't mean that this issue is resolved for v1.5 release though. We still need to investigate.

Grr. My blog broke. [My blog](http://testgrid.allmydata.org:3567/uri/URI:DIR2-RO:j74uhg25nwdpjpacl6rkat2yhm:kav7ijeft5h7r7rxdp5bgtlt3viv32yabqajkrdykozia5544jqa/wiki.html) is hosted on the Test Grid. That was really the last straw! I don't mind if everyone else who uses the Test Grid is inconvenienced by it being impossible to create directories, but I'm not going to leave my blog broken! So I upgraded the web gateway from: ``` allmydata-tahoe: 1.4.1-r3897, foolscap: 0.4.1, pycryptopp: 0.5.2-1, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev ``` ``` allmydata-tahoe: 1.4.1-r3995, foolscap: 0.4.1, pycryptopp: 0.5.15, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.18, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, sqlite: 3.4.2, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.16-12, setuptools: 0.6c12dev, pysqlite: 2.3.2 ``` Now instead of `Connected to 350 of 61 known storage servers` it says `Connected to 14 of 24 known storage servers` and my blog works nicely. Also, people can create directories again. This doesn't mean that this issue is resolved for v1.5 release though. We still need to investigate.

zooko commented

2009-07-18 16:51:48 +00:00

Okay, we investigated (details posted on #653), and I'm pretty that this was caused by #653, and I'm pretty sure that #653 has been fixed. Closing this as "fixed".