upload: tolerate lost or unacceptably slow servers #873

New Issue

warner · 2009-12-27T04:50:22Z

warner commented

2009-12-27 04:50:22 +00:00

As with download in #287, we'd like upload to gracefully handle the event of servers silently disconnecting during the upload process. This is more difficult than for download, because we don't have the option of switching to a different server. Giving up on a server during upload means giving up on the whole share, which reduces reliability. "shares of happiness" is the current threshold used to decide how important this abandon-the-share event is.

To implement this, the upload code needs to use a timeout (to distinguish between slow-server and silently-lost-server) and we need some way to decide what that timeout should be.

As with download in #287, we'd like upload to gracefully handle the event of servers silently disconnecting during the upload process. This is more difficult than for download, because we don't have the option of switching to a different server. Giving up on a server during upload means giving up on the whole share, which reduces reliability. "shares of happiness" is the current threshold used to decide how important this abandon-the-share event is. To implement this, the upload code needs to use a timeout (to distinguish between slow-server and silently-lost-server) and we need some way to decide what that timeout should be.

warner added the

labels 2009-12-27 04:50:22 +00:00

warner added this to the undecided milestone 2009-12-27 04:50:22 +00:00

kmarkley86 commented

2009-12-29 21:36:27 +00:00

Attachment logs.tgz (24156 bytes) added

Contents of Kyle's .tahoe/logs directory after noticing two hung tahoe backup operations.

**Attachment** logs.tgz (24156 bytes) added Contents of Kyle's .tahoe/logs directory after noticing two hung tahoe backup operations.

logs.tgz

24 KiB

kmarkley86 commented

2009-12-29 21:44:25 +00:00

I noticed two 'tahoe backup' operations hang on my node, and attached my .tahoe/logs directory as logs.tgz. Here are my versions:

allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-CPU_000@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

I noticed two 'tahoe backup' operations hang on my node, and attached my .tahoe/logs directory as logs.tgz. Here are my versions: allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-_CPU_000_@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

davidsarah commented

2009-12-30 00:03:22 +00:00

Kyle wrote:

My welcome page says "Connected to 89 of 105 known storage servers" but I don't know how to figure out which servers the hung operations are trying to contact. Here are the Storage Index values from the status pages, if they're worth anything:

twfhdmkbsoidlnf3zijrcut7jm (hung incremental backup)
dt5jrwb3ck2yt3tp7etuw6aply (hung backup of a large file; I can see sharemap 8 is missing)

(I'm on the allmydata.com production grid.)

[Kyle wrote](http://allmydata.org/pipermail/tahoe-dev/2009-December/003437.html): > My welcome page says "Connected to 89 of 105 known storage servers" but I don't know how to figure out which servers the hung operations are trying to contact. Here are the Storage Index values from the status pages, if they're worth anything: * twfhdmkbsoidlnf3zijrcut7jm (hung incremental backup) * dt5jrwb3ck2yt3tp7etuw6aply (hung backup of a large file; I can see sharemap 8 is missing) > (I'm on the allmydata.com production grid.)

zooko modified the milestone from undecided to 1.8.0

2010-05-16 05:21:27 +00:00

zooko self-assigned this 2010-05-16 05:21:27 +00:00

zooko commented

2010-07-24 05:38:14 +00:00

It was impulsive of me to put this ticket into the 1.8 Milestone. This ticket will probably get fixed in a complete rewrite of the upload code at some point.

zooko modified the milestone from 1.8.0 to eventually

2010-07-24 05:38:14 +00:00

zooko changed title from ~~upload: tolerate lost or missing servers~~ to upload: tolerate lost or unacceptably slow servers

2010-07-29 04:53:25 +00:00

davidsarah commented

2011-04-21 14:52:28 +00:00

#1394 is a near-duplicate for the server selection stage of upload. There's a tension between this ticket and #362 ('enhance upload to search longer and more completely for shares'), which I'm not sure how to resolve.

zooko commented

2011-07-22 13:34:03 +00:00

Kevan: does #1382 affect this ticket? Also if you know how to close tickets or clarify the relationships mentioned in comment:74190, that might be good

Kevan: does #1382 affect this ticket? Also if you know how to close tickets or clarify the relationships mentioned in [comment:74190](/tahoe-lafs/trac-2024-07-25/issues/873#issuecomment-74190), that might be good

Sign in to join this conversation.