mutable publish sends queries to servers that have already been asked #548

New Issue

warner · 2008-12-05T07:13:36Z

warner commented

2008-12-05 07:13:36 +00:00

another problem that appeared in #546 is in the mapupdate(MODE_WRITE) code,
when run on a servermap that's already been updated once. This occurs when
the mutable file's modify method is used, and the first attempt fails
because of an UncoordinatedWriteError . This triggers a retry, in which the
servermap is updated again, the (new) current version is retrieved, the
modifier function applied again, and (if anything changed) a new publish is
performed.

When this happens, the servermap is not empty: it already has a bunch of
shares from either the previous mapupdate or from the publish write requests
returning.

The mapupdate code starts by sending out N queries to the "must query"
servers: those which we already know have a share of some sort, or which
we've queried in the past. These come back, and we get a boundary map of
"1111111111". To find the real edge we must send out more queries (hoping to
get a map of 1111111111000).

The bug is that the code sends out the next batch of queries to the same
servers that it has already asked. It looks like the new queries are
determined without consulting the list of which servers to which queries have
already been sent. I think this is because those first queries were sent to
the must_query list.

my cryptic notes:

si=njpk4lit4ns3yj7xmgszheh62q
tahoe rm testgrid:recentdir/recent.cd5bb67746f0c3538175c768456c37f3   -> 3   [si=njpk], incident
 parent list [njpk4]
  mapupdate(MODE_READ)  e4515
  retrieve  seq2 e4593  sh0@cfb7, sh1@5xry, sh3@6j2m
 parent modify -> read, write
  mapupdate(MODE_WRITE) e4634  false-boundary
   tx: ehnf mgq3 kuzy cfb7 pfav 6y7v,  5xry b3yc qau2 6j2m 7vi2 bc5x
   rx: kuzy 6y7v mgq3 ehnf cfb7(sh0) pfav b3yc 5xry(sh1) 6j2m(sh3), 001000, boundary??
    pfav mgq3 cfb7(sh0) kuzy 6y7v ehnf 5xry(sh1) b3yc qau2(?) 6j2m(sh3) | 7vi2(?) bc5x(?)
    BROKEN: why did this count as a boundary?
    oh, [cfb7(sh0) kuzy 6y7v ehnf] = 1000
  retrieve seq2 e4720   sh0@cfb7, sh1@5xry, sh3@6j2m
  publish seq3 e4763
    sh0 to [cfb7f3lh], sh1 to [5xryfgeq], sh2 to [pfavfmv3], sh3 to [6j2mb464],
    sh4 to [mgq3xx3t], sh5 to [kuzya6zx], sh6 to [6y7vpksf], sh7 to [ehnfmjtc],
    sh8 to [b3yclx4f], sh9 to [qau2ui2a]
   qua2 has surprising sh2
  retry
   mapupdate(MODE_WRITE) e4832
    sends 10 queries to the publish answerers
    sends 5 queries, to servers already asked: pfav mgq3 cfb7 kuzy 6y7v (first 5 in permuted order)
    log ends

another problem that appeared in #546 is in the mapupdate(MODE_WRITE) code, when run on a servermap that's already been updated once. This occurs when the mutable file's `modify` method is used, and the first attempt fails because of an UncoordinatedWriteError . This triggers a retry, in which the servermap is updated again, the (new) current version is retrieved, the modifier function applied again, and (if anything changed) a new publish is performed. When this happens, the servermap is not empty: it already has a bunch of shares from either the previous mapupdate or from the publish write requests returning. The mapupdate code starts by sending out N queries to the "must query" servers: those which we already know have a share of some sort, or which we've queried in the past. These come back, and we get a boundary map of "1111111111". To find the real edge we must send out more queries (hoping to get a map of 1111111111000). The bug is that the code sends out the next batch of queries to the same servers that it has already asked. It looks like the new queries are determined without consulting the list of which servers to which queries have already been sent. I think this is because those first queries were sent to the must_query list. my cryptic notes: ``` si=njpk4lit4ns3yj7xmgszheh62q tahoe rm testgrid:recentdir/recent.cd5bb67746f0c3538175c768456c37f3 -> 3 [si=njpk], incident parent list [njpk4] mapupdate(MODE_READ) e4515 retrieve seq2 e4593 sh0@cfb7, sh1@5xry, sh3@6j2m parent modify -> read, write mapupdate(MODE_WRITE) e4634 false-boundary tx: ehnf mgq3 kuzy cfb7 pfav 6y7v, 5xry b3yc qau2 6j2m 7vi2 bc5x rx: kuzy 6y7v mgq3 ehnf cfb7(sh0) pfav b3yc 5xry(sh1) 6j2m(sh3), 001000, boundary?? pfav mgq3 cfb7(sh0) kuzy 6y7v ehnf 5xry(sh1) b3yc qau2(?) 6j2m(sh3) | 7vi2(?) bc5x(?) BROKEN: why did this count as a boundary? oh, [cfb7(sh0) kuzy 6y7v ehnf] = 1000 retrieve seq2 e4720 sh0@cfb7, sh1@5xry, sh3@6j2m publish seq3 e4763 sh0 to [cfb7f3lh], sh1 to [5xryfgeq], sh2 to [pfavfmv3], sh3 to [6j2mb464], sh4 to [mgq3xx3t], sh5 to [kuzya6zx], sh6 to [6y7vpksf], sh7 to [ehnfmjtc], sh8 to [b3yclx4f], sh9 to [qau2ui2a] qua2 has surprising sh2 retry mapupdate(MODE_WRITE) e4832 sends 10 queries to the publish answerers sends 5 queries, to servers already asked: pfav mgq3 cfb7 kuzy 6y7v (first 5 in permuted order) log ends ```

warner added the

labels 2008-12-05 07:13:36 +00:00

warner added this to the undecided milestone 2008-12-05 07:13:36 +00:00

tahoe-lafs modified the milestone from undecided to 1.7.0

2010-03-25 01:16:47 +00:00

zooko commented

2010-05-27 22:06:10 +00:00

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

zooko modified the milestone from 1.7.0 to 1.8.0

2010-05-27 22:06:10 +00:00

tahoe-lafs modified the milestone from 1.8.0 to 1.9.0

2010-08-10 04:15:33 +00:00

tahoe-lafs modified the milestone from 1.9.0 to soon

2011-07-16 21:01:29 +00:00

zooko commented

2011-07-16 21:01:58 +00:00

This appears to be an efficiency improvement and not a correctness issue.

zooko modified the milestone from soon to 1.9.0

2011-07-16 21:01:58 +00:00

tahoe-lafs modified the milestone from 1.9.0 to soon