remove foolscap 'reference'-token bug workaround in mutable publish #541

Open
opened 2008-11-25 20:25:11 +00:00 by warner · 13 comments

To close this ticket, remove the workaround which was needed only to avoid a Foolscap bug which is long since fixed.

------- old description:

Foolscap 0.2.5 had a bug in which an inbound constraint (say aTupleConstraint) wouldn't accept a 'reference' token that pointed at a tuple. This would occur when the same tuple was sent multiple times in a single request.

The mutable-file publish code, when it is creating a brand-new mutable file, and when there are fewer than N servers (so multiple shares are sent to the same server), can trigger this case, because the same assert-that-there-are-no-shares test vector (a static tuple of numbers and strings) is used once per share. Python seems to compile the tuple once, so foolscap sees the same object appear in the outbound arguments multiple times, and uses the 'reference' token to preserve this (unintentional) relationship.

The workaround is to construct this tuple at runtime, basically changing:

 for x in something:
   testv = (0, 1, 'eq', '')
   stuff.append(testv)

into:

 for x in something:
   testv = tuple([0, 1, 'eq', ''])
   stuff.append(testv)

The constraint bug was fixed in foolscap-0.2.6, but since it's an inbound-side bug, what really matters is the version of foolscap on the other end of the wire, so to interoperate with older versions of tahoe (which may be using older versions of foolscap), we must avoid sending the 'reference' token at all. So this workaround (which was removed in changeset:f7b4c45d4651b25b because tahoe now requires a newer version of foolscap) needs to be reinstated.

In addition, I've seen another problem that relates to this code. A few days ago I started noticing failures in the buildbot 'speed test', which runs trunk code against a statically configured 'perfnet' (with a client node on prodcs2 and four storage nodes plus introducer on tahoebs1). Four nodes and ten shares means multiple shares per node, so the testv-and-readv-and-writev call gets multiple shares, which means a 'reference' token.

The storage server appears to still have problems with the 'reference' token, even though it's running foolscap-0.3.1 . Sometimes I see a Violation that claims getObject(reftoken) returned a non-tuple. Sometimes I see a "dangling reference" BananaError, which indicates that getObject() failed to find the reference. The BananaError causes the connection to be dropped, so any remaining writev calls fail with a DeadReferenceError.

I can provoke this behavior with a local trunk tree against the storage nodes on tahoebs1. I haven't been able to reproduce this with local storage nodes (running foolscap-0.3.1). If I reintroduce the tuple([0,1,'eq','']) workaround, the problem seems to go away.

If I revert the recent versioning changes (which tries to call 'get_version' on the storage server, and falls back to a default value if that fails), then the problem seems to go away. The tahoebs1 perfnet nodes are a few months old (they are running r2901), so they don't have get_version(). So I'm suspecting that the Violation that gets raised somehow messes things up, such that getObject() remembers some bit of bogus state, so later getObject() calls either get a bad value or no value at all.

I'm going to commit the tuple([0,1,'eq','']) workaround, since we need it anyways. But at some point we need to come back to this and examine it further, because I still don't understand the new problem.

The first step will be to add instrumentation to the storage servers, which means first bouncing one of the tahoebs1 nodes and see if it still has the problem. (running the same code locally didn't experience the problem; the only difference I can think of is that the tahoebs1 nodes have been running for two months, and might have all experienced something in that time to mess them up now, whereas my local nodes are younger).

To close this ticket, remove the workaround which was needed only to avoid a Foolscap bug which is long since fixed. ------- old description: Foolscap 0.2.5 had a bug in which an inbound constraint (say a`TupleConstraint`) wouldn't accept a 'reference' token that pointed at a tuple. This would occur when the same tuple was sent multiple times in a single request. The mutable-file publish code, when it is creating a brand-new mutable file, and when there are fewer than N servers (so multiple shares are sent to the same server), can trigger this case, because the same assert-that-there-are-no-shares test vector (a static tuple of numbers and strings) is used once per share. Python seems to compile the tuple once, so foolscap sees the same object appear in the outbound arguments multiple times, and uses the 'reference' token to preserve this (unintentional) relationship. The workaround is to construct this tuple at runtime, basically changing: ``` for x in something: testv = (0, 1, 'eq', '') stuff.append(testv) ``` into: ``` for x in something: testv = tuple([0, 1, 'eq', '']) stuff.append(testv) ``` The constraint bug was fixed in foolscap-0.2.6, but since it's an inbound-side bug, what really matters is the version of foolscap on the other end of the wire, so to interoperate with older versions of tahoe (which may be using older versions of foolscap), we must avoid sending the 'reference' token at all. So this workaround (which was removed in changeset:f7b4c45d4651b25b because tahoe now requires a newer version of foolscap) needs to be reinstated. In addition, I've seen another problem that relates to this code. A few days ago I started noticing failures in the buildbot 'speed test', which runs trunk code against a statically configured 'perfnet' (with a client node on prodcs2 and four storage nodes plus introducer on tahoebs1). Four nodes and ten shares means multiple shares per node, so the testv-and-readv-and-writev call gets multiple shares, which means a 'reference' token. The storage server appears to still have problems with the 'reference' token, even though it's running foolscap-0.3.1 . Sometimes I see a Violation that claims getObject(reftoken) returned a non-tuple. Sometimes I see a "dangling reference" BananaError, which indicates that getObject() failed to find the reference. The BananaError causes the connection to be dropped, so any remaining writev calls fail with a DeadReferenceError. I can provoke this behavior with a local trunk tree against the storage nodes on tahoebs1. I haven't been able to reproduce this with local storage nodes (running foolscap-0.3.1). If I reintroduce the `tuple([0,1,'eq',''])` workaround, the problem seems to go away. If I revert the recent versioning changes (which tries to call 'get_version' on the storage server, and falls back to a default value if that fails), then the problem seems to go away. The tahoebs1 perfnet nodes are a few months old (they are running r2901), so they don't have `get_version()`. So I'm suspecting that the Violation that gets raised somehow messes things up, such that getObject() remembers some bit of bogus state, so later getObject() calls either get a bad value or no value at all. I'm going to commit the `tuple([0,1,'eq',''])` workaround, since we need it anyways. But at some point we need to come back to this and examine it further, because I still don't understand the new problem. The first step will be to add instrumentation to the storage servers, which means first bouncing one of the tahoebs1 nodes and see if it still has the problem. (running the same code locally didn't experience the problem; the only difference I can think of is that the tahoebs1 nodes have been running for two months, and might have all experienced something in that time to mess them up now, whereas my local nodes are younger).
warner added the
code-mutable
major
defect
1.2.0
labels 2008-11-25 20:25:11 +00:00
warner added this to the undecided milestone 2008-11-25 20:25:11 +00:00
Author

More notes on the second problem:

  • bouncing the storage server does not make the problem go away
  • adding instrumentation to TupleConstraint reveals that the "not a tuple" complaint is triggered by a list. Instead of getting a tuple, it got a list of [(0, sharedata)] (a list with one element, which is a tuple with two elements). That looks like a write vector.
  • in one test, that write vector was sent as object 79. A subsequent 'reference' token cited object 78. Maybe there's sometimes an off-by-one error?
  • further testing suggests that the setObject() counters are completely broken
More notes on the second problem: * bouncing the storage server does *not* make the problem go away * adding instrumentation to `TupleConstraint` reveals that the "not a tuple" complaint is triggered by a list. Instead of getting a tuple, it got a list of `[(0, sharedata)]` (a list with one element, which is a tuple with two elements). That looks like a write vector. * in one test, that write vector was sent as object 79. A subsequent 'reference' token cited object 78. Maybe there's sometimes an off-by-one error? * further testing suggests that the setObject() counters are completely broken
Author

Foolscap#104 created to track this one.

[Foolscap#104](http://foolscap.lothar.com/trac/ticket/104) created to track this one.
Author

Ok, I think I have Foolscap !#104 figured out: shared references that follow a Violation on the wire will be broken. I won't be likely to fix it until next week, since GC is more important, but by avoiding 'reference' tokens (i.e. avoiding shared references), Tahoe should be ok for now.

Ok, I think I have Foolscap !#104 figured out: shared references that follow a Violation on the wire will be broken. I won't be likely to fix it until next week, since GC is more important, but by avoiding 'reference' tokens (i.e. avoiding shared references), Tahoe should be ok for now.
davidsarah commented 2010-03-24 22:47:18 +00:00
Owner

Which versions of Tahoe used foolscap 0.2.5, 0.2.6 and 0.3.1?

Which versions of Tahoe used foolscap 0.2.5, 0.2.6 and 0.3.1?

Versions of Tahoe-LAFS don't require specific versions of foolscap, only "at least this version" versions of foolscap, so older versions of Tahoe-LAFS could have been paired with newer versions of foolscap. Here are the details:

  • current darcs head of Tahoe-LAFS requires foolscap >= 0.4.1, since: [r3874]source:_auto_deps.py@20090523011103-4233b-ced27385066c6567bbe14a27d6c8eaf3606f5c23 (2009-05-23)
  • before that it required foolscap >= 0.4.0, since [r3870]source:_auto_deps.py@20090522002100-4233b-fc521428d04b73205d0e17ea1c79a396b2515f89 (2009-05-22)
  • before that it required foolscap >= 0.3.1, since [r2968]source:_auto_deps.py@20080920183853-4233b-9250a0068558bb7bb949e09ef038dd87edac140b (2008-09-20)
  • before that it required foolscap >= 0.3.0, since [r2825]source:_auto_deps.py@20080805235828-4233b-631431b16f6bd6bd6a9aec1a4d57b35579067fe3 (2008-08-05)
  • before that it required foolscap >= 0.2.9, since [r2729]source:_auto_deps.py@20080703004029-e01fd-e0e2206aab2554d38eea96bbecd636c6dac9853a (2008-07-30)
  • before that it required foolscap >= 0.2.8, since [r2666]source:_auto_deps.py@20080609235504-e01fd-773ce8dd085d812a2aa200fa5b7c481693c57783 (2008-06-09)
  • before that it required foolscap >= 0.2.5, since [r2381]source:_auto_deps.py@20080403230646-80e44-c6fa8f63f5f2b5b599269af31942a1cefb81a498 (2008-04-03)
  • before that it required foolscap >= 0.2.4, since [r2062]source:_auto_deps.py@20080205212714-e01fd-396c1fd45ed80959d487e119f9aea7adac5ec33b (2008-02-05)
Versions of Tahoe-LAFS don't require specific versions of foolscap, only "at least this version" versions of foolscap, so older versions of Tahoe-LAFS could have been paired with newer versions of foolscap. Here are the details: * current darcs head of Tahoe-LAFS requires `foolscap >= 0.4.1`, since: [r3874]source:_auto_deps.py@20090523011103-4233b-ced27385066c6567bbe14a27d6c8eaf3606f5c23 (2009-05-23) * before that it required `foolscap >= 0.4.0`, since [r3870]source:_auto_deps.py@20090522002100-4233b-fc521428d04b73205d0e17ea1c79a396b2515f89 (2009-05-22) * before that it required `foolscap >= 0.3.1`, since [r2968]source:_auto_deps.py@20080920183853-4233b-9250a0068558bb7bb949e09ef038dd87edac140b (2008-09-20) * before that it required `foolscap >= 0.3.0`, since [r2825]source:_auto_deps.py@20080805235828-4233b-631431b16f6bd6bd6a9aec1a4d57b35579067fe3 (2008-08-05) * before that it required `foolscap >= 0.2.9`, since [r2729]source:_auto_deps.py@20080703004029-e01fd-e0e2206aab2554d38eea96bbecd636c6dac9853a (2008-07-30) * before that it required `foolscap >= 0.2.8`, since [r2666]source:_auto_deps.py@20080609235504-e01fd-773ce8dd085d812a2aa200fa5b7c481693c57783 (2008-06-09) * before that it required `foolscap >= 0.2.5`, since [r2381]source:_auto_deps.py@20080403230646-80e44-c6fa8f63f5f2b5b599269af31942a1cefb81a498 (2008-04-03) * before that it required `foolscap >= 0.2.4`, since [r2062]source:_auto_deps.py@20080205212714-e01fd-396c1fd45ed80959d487e119f9aea7adac5ec33b (2008-02-05)

Oh, but I didn't answer which versions of Tahoe-LAFS were involved. I get that information by looking at The Parade of Release Notes:

  • Tahoe-LAFS v1.6.1 [r4244]changeset:20100224065755-92b7f-31e18fd9d8572bd26fa42e71e2c55e82800b78d6 (2010-02-24)
  • Tahoe-LAFS v1.6.0 [r4220]changeset:20100202054832-92b7f-87f159e3dd9fd59737fae97e2821fcd65600564d (2010-02-02)
  • Tahoe-LAFS v1.5.0 [r4037]changeset:20090802031003-92b7f-289b017e81538b436d2aff2586888a91a0e67598 (2009-08-02)
  • Tahoe-LAFS v1.4.1 [r3853]changeset:20090414025430-92b7f-6e06ebbd16f80e68a6141d44fc25cc1d49726b22 (2009-04-14)
  • Tahoe-LAFS v1.3.0 [r3620]changeset:20090214000500-92b7f-26d07f519c69a3e086ee4599620f6d2c0c9b2fcd (2009-02-14)
  • Tahoe-LAFS v1.2.0 [r2789]changeset:20080722010403-92b7f-a74ec17a8a9cff0834ecc122ffa275280f563cea (2008-07-22)
  • Tahoe-LAFS v1.1.0 [r2693]changeset:20080611213202-92b7f-645b08f7b74916c35446b2298acb8e77e92ec3e0 (2008-06-11)
  • Tahoe-LAFS v1.0.0 [r2348]changeset:20080326012800-92b7f-521469f2bb6b2b3006718f9d387a1ee75183b161 (2008-03-26)
Oh, but I didn't answer which versions of Tahoe-LAFS were involved. I get that information by looking at [The Parade of Release Notes](wiki/Doc): * Tahoe-LAFS v1.6.1 [r4244]changeset:20100224065755-92b7f-31e18fd9d8572bd26fa42e71e2c55e82800b78d6 (2010-02-24) * Tahoe-LAFS v1.6.0 [r4220]changeset:20100202054832-92b7f-87f159e3dd9fd59737fae97e2821fcd65600564d (2010-02-02) * Tahoe-LAFS v1.5.0 [r4037]changeset:20090802031003-92b7f-289b017e81538b436d2aff2586888a91a0e67598 (2009-08-02) * Tahoe-LAFS v1.4.1 [r3853]changeset:20090414025430-92b7f-6e06ebbd16f80e68a6141d44fc25cc1d49726b22 (2009-04-14) * Tahoe-LAFS v1.3.0 [r3620]changeset:20090214000500-92b7f-26d07f519c69a3e086ee4599620f6d2c0c9b2fcd (2009-02-14) * Tahoe-LAFS v1.2.0 [r2789]changeset:20080722010403-92b7f-a74ec17a8a9cff0834ecc122ffa275280f563cea (2008-07-22) * Tahoe-LAFS v1.1.0 [r2693]changeset:20080611213202-92b7f-645b08f7b74916c35446b2298acb8e77e92ec3e0 (2008-06-11) * Tahoe-LAFS v1.0.0 [r2348]changeset:20080326012800-92b7f-521469f2bb6b2b3006718f9d387a1ee75183b161 (2008-03-26)

I guess we can safely remove the work-around now.

I guess we can safely remove the work-around now.

Well, let's leave it in for one more major release... :-)

Well, let's leave it in for *one* more major release... :-)
zooko modified the milestone from undecided to 1.10.0 2011-04-27 17:03:39 +00:00
davidsarah commented 2012-03-29 21:23:25 +00:00
Owner

(http://foolscap.lothar.com/trac/ticket/104) was fixed in foolscap 0.4, which was first required by Tahoe-LAFS r3870, released as Tahoe-LAFS v1.5.0.

source:relnotes.txt says:

Servers from this release can serve clients of all versions back to v1.0 and clients from this release can use servers of all versions back to v1.0.

which I believe is already incorrect, although I can't remember why not. Should we be making this claim when we don't test it?

(http://foolscap.lothar.com/trac/ticket/104) was fixed in foolscap 0.4, which was first required by Tahoe-LAFS r3870, released as Tahoe-LAFS v1.5.0. source:relnotes.txt says: > Servers from this release can serve clients of all versions back to v1.0 and clients from this release can use servers of all versions back to v1.0. which I believe is already incorrect, although I can't remember why not. Should we be making this claim when we don't test it?
tahoe-lafs modified the milestone from 1.11.0 to 1.12.0 2013-08-13 22:50:29 +00:00
zooko changed title from foolscap 'reference'-token bug workaround in mutable publish to remove foolscap 'reference'-token bug workaround in mutable publish 2015-08-25 17:41:42 +00:00
tahoe-lafs modified the milestone from 1.12.0 to 1.11.0 2015-08-25 20:20:06 +00:00
Author

Milestone renamed

Milestone renamed
warner modified the milestone from 1.11.0 to 1.12.0 2016-03-22 05:02:52 +00:00
Author

moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders

moving most tickets from 1.12 to 1.13 so we can release 1.12 with magic-folders
warner modified the milestone from 1.12.0 to 1.13.0 2016-06-28 18:20:37 +00:00

Moving open issues out of closed milestones.

Moving open issues out of closed milestones.
exarkun modified the milestone from 1.13.0 to 1.15.0 2020-06-30 14:45:13 +00:00
Owner

Ticket retargeted after milestone closed

Ticket retargeted after milestone closed
meejah modified the milestone from 1.15.0 to soon 2021-03-30 18:40:19 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#541
No description provided.