UncoordinatedWriteError on prod grid #899

Closed
opened 2010-01-13 18:44:55 +00:00 by zooko · 9 comments

Kyle Markley reported this on the tahoe-dev list:

http://allmydata.org/pipermail/tahoe-dev/2010-January/003554.html

It could be related to #540, #877, or #893.

I'll ask Kyle to supply more diagnostic info on this ticket.

Kyle Markley reported this on the tahoe-dev list: <http://allmydata.org/pipermail/tahoe-dev/2010-January/003554.html> It could be related to #540, #877, or #893. I'll ask Kyle to supply more diagnostic info on this ticket.
zooko added the
code-mutable
major
defect
1.5.0
labels 2010-01-13 18:44:55 +00:00
zooko added this to the undecided milestone 2010-01-13 18:44:55 +00:00
kmarkley86 commented 2010-01-14 06:32:41 +00:00
Owner

Attachment logs.tgz (1300514 bytes) added

UncoordinatedWriteError log

**Attachment** logs.tgz (1300514 bytes) added [UncoordinatedWriteError](wiki/UncoordinatedWriteError) log
1.2 MiB
kmarkley86 commented 2010-01-14 06:34:19 +00:00
Owner

allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-CPU_000@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

Mutable File Publish Status

* Started: 00:04:12 13-Jan-2010
* Storage Index: mcw73tlgpejftxf55c5bjmiczi
* Helper?: No
* Current Size: 470
* Progress: 20.0%
* Status: [UncoordinatedWriteError](wiki/UncoordinatedWriteError)

Retrieve Results

* Encoding: 3 of 10
* Sharemap:
      o 0 -> Placed on ehnfmjtc
      o 4 -> Placed on [5q4fx2pb]
      o 5 -> Placed on ctchgzgn
* Timings:
      o Total: 1.24s (380Bps)
            + Setup: 581us
            + Encrypting: 37us (12.40MBps)
            + Encoding: 55us (8.53MBps)
            + Packing Shares: 9.0ms (52.1kBps)
                  # RSA Signature: 8.0ms
            + Pushing: 1.23s (383Bps)
      o Per-Server Response Times:
            + ctchgzgn: 77ms
            + ehnfmjtc: 67ms
            + fjsasmll: 1.18s
            + gi3daw4h: 1.12s
            + xc3w2uzy: 1.19s
            + [5q4fx2pb]: 1.18s
            + [6m245fmk]: 103ms
allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-_CPU_000_@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1 Mutable File Publish Status * Started: 00:04:12 13-Jan-2010 * Storage Index: mcw73tlgpejftxf55c5bjmiczi * Helper?: No * Current Size: 470 * Progress: 20.0% * Status: [UncoordinatedWriteError](wiki/UncoordinatedWriteError) Retrieve Results * Encoding: 3 of 10 * Sharemap: o 0 -> Placed on ehnfmjtc o 4 -> Placed on [5q4fx2pb] o 5 -> Placed on ctchgzgn * Timings: o Total: 1.24s (380Bps) + Setup: 581us + Encrypting: 37us (12.40MBps) + Encoding: 55us (8.53MBps) + Packing Shares: 9.0ms (52.1kBps) # RSA Signature: 8.0ms + Pushing: 1.23s (383Bps) o Per-Server Response Times: + ctchgzgn: 77ms + ehnfmjtc: 67ms + fjsasmll: 1.18s + gi3daw4h: 1.12s + xc3w2uzy: 1.19s + [5q4fx2pb]: 1.18s + [6m245fmk]: 103ms
Author

Andrej Falout couldn't attach his incident reports to this ticket because trac doesn't let you upload attachments larger than 1,000,000 bytes. I bunzip2'ed them and 7z'ed them and they came out half as big, so here they are.

Andrej Falout couldn't attach his incident reports to this ticket because trac doesn't let you upload attachments larger than 1,000,000 bytes. I bunzip2'ed them and 7z'ed them and they came out half as big, so here they are.
Author

Attachment tahoeIncident.7z (477142 bytes) added

**Attachment** tahoeIncident.7z (477142 bytes) added
Author

Oh, and I reconfigured trac to allow attachments of up to 10 MB.

Oh, and I reconfigured trac to allow attachments of up to 10 MB.
kmarkley86 commented 2010-01-17 01:25:39 +00:00
Owner

I'm continuing to hit this UncoordinatedWriteError very frequently on the production grid. I think it happens most often when creating directories. I can provide lots of additional incident reports if that would be useful.

This has made it almost impossible for me to run a 'tahoe backup' command to the production grid; should the priority of this ticket be raised?

I'm continuing to hit this [UncoordinatedWriteError](wiki/UncoordinatedWriteError) very frequently on the production grid. I think it happens most often when creating directories. I can provide lots of additional incident reports if that would be useful. This has made it almost impossible for me to run a 'tahoe backup' command to the production grid; should the priority of this ticket be raised?
Author

allmydata.com is continuing to repair servers and configuration issues on the allmydata.com prod grid, so that might be the way that your problem gets solved. However, at the very least your Tahoe-LAFS client is reporting something with a wrong error message. It may also be buggy in some way that leads to this problem.

One thing that you could do that would help is to try the same thing with a newer version of Tahoe-LAFS. Could you try installing the latest version http://allmydata.org/source/tahoe/tarballs/?C=M;O=D , per these install instructions: http://allmydata.org/source/tahoe/trunk/docs/install.html ?

allmydata.com is continuing to repair servers and configuration issues on the allmydata.com prod grid, so that might be the way that your problem gets solved. However, at the very least your Tahoe-LAFS client is reporting something with a wrong error message. It may also be buggy in some way that leads to this problem. One thing that you could do that would help is to try the same thing with a newer version of Tahoe-LAFS. Could you try installing the latest version <http://allmydata.org/source/tahoe/tarballs/?C=M;O=D> , per these install instructions: <http://allmydata.org/source/tahoe/trunk/docs/install.html> ?
kmarkley86 commented 2010-01-17 17:23:08 +00:00
Owner

I haven't seen one of these errors since upgrading from tahoe 1.5.0 to 1.5.0-r4160. Between that and general repair of the grid, the problem has gone away for me.

I haven't seen one of these errors since upgrading from tahoe 1.5.0 to 1.5.0-r4160. Between that and general repair of the grid, the problem has gone away for me.

I glanced through a couple of these Incidents, and all the ones I looked at were that artifact that we fixed in which DeadReferenceError is logged too severely by accident (the one where the ServerFailure that wrapped the DeadReferenceError, preventing the errback code from identifying it as a DeadReferenceError). This got fixed with the overhaul of the add-lease code.

I glanced through a couple of these Incidents, and all the ones I looked at were that artifact that we fixed in which DeadReferenceError is logged too severely by accident (the one where the ServerFailure that wrapped the DeadReferenceError, preventing the errback code from identifying it as a DeadReferenceError). This got fixed with the overhaul of the add-lease code.
tahoe-lafs added the
fixed
label 2010-02-15 19:38:43 +00:00
davidsarah closed this issue 2010-02-15 19:38:43 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#899
No description provided.