upload: tolerate lost or unacceptably slow servers #873

Open
opened 2009-12-27 04:50:22 +00:00 by warner · 6 comments

As with download in #287, we'd like upload to gracefully handle the event of servers silently disconnecting during the upload process. This is more difficult than for download, because we don't have the option of switching to a different server. Giving up on a server during upload means giving up on the whole share, which reduces reliability. "shares of happiness" is the current threshold used to decide how important this abandon-the-share event is.

To implement this, the upload code needs to use a timeout (to distinguish between slow-server and silently-lost-server) and we need some way to decide what that timeout should be.

As with download in #287, we'd like upload to gracefully handle the event of servers silently disconnecting during the upload process. This is more difficult than for download, because we don't have the option of switching to a different server. Giving up on a server during upload means giving up on the whole share, which reduces reliability. "shares of happiness" is the current threshold used to decide how important this abandon-the-share event is. To implement this, the upload code needs to use a timeout (to distinguish between slow-server and silently-lost-server) and we need some way to decide what that timeout should be.
warner added the
code-encoding
major
defect
1.5.0
labels 2009-12-27 04:50:22 +00:00
warner added this to the undecided milestone 2009-12-27 04:50:22 +00:00
kmarkley86 commented 2009-12-29 21:36:27 +00:00
Owner

Attachment logs.tgz (24156 bytes) added

Contents of Kyle's .tahoe/logs directory after noticing two hung tahoe backup operations.

**Attachment** logs.tgz (24156 bytes) added Contents of Kyle's .tahoe/logs directory after noticing two hung tahoe backup operations.
24 KiB
kmarkley86 commented 2009-12-29 21:44:25 +00:00
Owner

I noticed two 'tahoe backup' operations hang on my node, and attached my .tahoe/logs directory as logs.tgz. Here are my versions:

allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-CPU_000@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1

I noticed two 'tahoe backup' operations hang on my node, and attached my .tahoe/logs directory as logs.tgz. Here are my versions: allmydata-tahoe: 1.5.0, foolscap: 0.4.2, pycryptopp: 0.5.17, zfec: 1.4.5, Twisted: 8.2.0, Nevow: 0.9.33-r17222, zope.interface: 3.5.2, python: 2.6.2, platform: OpenBSD-4.6-amd64-Genuine_Intel-R-_CPU_000_@_2.93GHz-64bit-ELF, sqlite: 3.6.13, simplejson: 2.0.9, argparse: 0.9.1, pyOpenSSL: 0.9, pyutil: 1.3.34, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.4.1
davidsarah commented 2009-12-30 00:03:22 +00:00
Owner

Kyle wrote:

My welcome page says "Connected to 89 of 105 known storage servers" but I don't know how to figure out which servers the hung operations are trying to contact. Here are the Storage Index values from the status pages, if they're worth anything:

  • twfhdmkbsoidlnf3zijrcut7jm (hung incremental backup)
  • dt5jrwb3ck2yt3tp7etuw6aply (hung backup of a large file; I can see sharemap 8 is missing)

(I'm on the allmydata.com production grid.)

[Kyle wrote](http://allmydata.org/pipermail/tahoe-dev/2009-December/003437.html): > My welcome page says "Connected to 89 of 105 known storage servers" but I don't know how to figure out which servers the hung operations are trying to contact. Here are the Storage Index values from the status pages, if they're worth anything: * twfhdmkbsoidlnf3zijrcut7jm (hung incremental backup) * dt5jrwb3ck2yt3tp7etuw6aply (hung backup of a large file; I can see sharemap 8 is missing) > (I'm on the allmydata.com production grid.)
zooko modified the milestone from undecided to 1.8.0 2010-05-16 05:21:27 +00:00
zooko self-assigned this 2010-05-16 05:21:27 +00:00

It was impulsive of me to put this ticket into the 1.8 Milestone. This ticket will probably get fixed in a complete rewrite of the upload code at some point.

It was impulsive of me to put this ticket into the 1.8 Milestone. This ticket will probably get fixed in a complete rewrite of the upload code at some point.
zooko modified the milestone from 1.8.0 to eventually 2010-07-24 05:38:14 +00:00
zooko changed title from upload: tolerate lost or missing servers to upload: tolerate lost or unacceptably slow servers 2010-07-29 04:53:25 +00:00
davidsarah commented 2011-04-21 14:52:28 +00:00
Owner

#1394 is a near-duplicate for the server selection stage of upload. There's a tension between this ticket and #362 ('enhance upload to search longer and more completely for shares'), which I'm not sure how to resolve.

#1394 is a near-duplicate for the server selection stage of upload. There's a tension between this ticket and #362 ('enhance upload to search longer and more completely for shares'), which I'm not sure how to resolve.

Kevan: does #1382 affect this ticket? Also if you know how to close tickets or clarify the relationships mentioned in comment:74190, that might be good

Kevan: does #1382 affect this ticket? Also if you know how to close tickets or clarify the relationships mentioned in [comment:74190](/tahoe-lafs/trac-2024-07-25/issues/873#issuecomment-74190), that might be good
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#873
No description provided.