clean up and optimize spans #1196

Closed
opened 2010-09-10 19:36:25 +00:00 by zooko · 1 comment

Several improvements were noted by David-Sarah (and maybe a couple by me) on ticket #798. Here is a copy of comment:72911:

Reviewing changeset:cbcb728e7ea0031d:

  • the (start, length) form of the SimpleSpans constructor is not used outside test code (and the test code can be changed to pass a [(start, length)] array). Removing this would slightly simplify the constructor and avoid a possibly error-prone overloading.

  • in the Spans class comment, ", frequently used to represent .newsrc contents" is out-of-context and not needed.

  • in the _check method of Spans, if assertions are switched on then the self._spans array is re-sorted in order to check whether it is ordered. This is unnecessary: if you add an assert length > 0, length in the loop, then the loop will be checking a condition that is stronger than the array being ordered, given that the starts and lengths are numbers. (The sorting actually takes O(n) time rather than O(n log n) on each call to _check, because Timsort will detect the ordered input, but it's still unnecessary overhead.)

  • the assert statements should include the variables they use in the exception message, e.g. assert start > prev_end, (start, prev_end).

  • "overlap(s_start, s_length, start, length) or adjacent(s_start, s_length, start, length)" is equivalent to "overlap(s_start, s_length, start-1, length+2)".

  • in the only other use of adjacent (in DataSpans.add), only the start0 < start1 case should be able to occur. Inline and simplify.

  • in the loop over enumerate(self._spans), you could exit early when s_start > start+length. At that point you know where to insert the (start, length) span without having to re-sort.

  • a Spans object behaves somewhat like a set of the elements in all of its spans, but the *contains* and *iter* methods are not consistent with that (instead viewing it as a set of (start, length) pairs). I realize this may allow for more convenient use of in and iteration, but it should at least be documented.

  • _check and assert_invariants do similar things; give them the same name.

  • DataSpans._dump is poorly named.

  • DataSpans.assert_invariants should check that none of the data strings are zero-length.

  • is it intentional that DataSpans.add calls self.assert_invariants() but remove (and pop, although that's much simpler) don't?

  • if s_start <= start < s_end: I find this Python construct too clever by half. Anyway, at this point s_start <= start is always true (otherwise we won't get past the first continue), so I would write this as

assert s_start <= start, (s_start, start)
if start < s_end:
  • Perhaps rename s_* to old_*.

  • DataSpans.add: if I understand correctly, case A also covers:

    OLD      OLD    OLD    OLD
NEW       NEW      NEWW   NEEWW

This isn't immediately clear from the comment. Depicting it as

    OLD
NEW.....

might help. Also, use uppercase consistently for the cases.

  • suffix_len in case D has a different meaning to suffix_len in case E. Maybe rename it to replace_len for case D.

Tests:

  • The Spans class is tested by ByteSpans, but it just stores spans of integers, not necessarily byte offsets. I would suggest s/ByteSpans/TestSpans/ and s/StringSpans/TestDataSpans/.

  • I could be wrong, but I don't think the deterministic tests are covering all cases in add and remove. I'd prefer to see those tests have full coverage rather than relying on the randomized tests to make up the gap.

Several improvements were noted by David-Sarah (and maybe a couple by me) on ticket #798. Here is a copy of [comment:72911](/tahoe-lafs/trac-2024-07-25/issues/798#issuecomment-72911): Reviewing changeset:cbcb728e7ea0031d: * the `(start, length)` form of the `SimpleSpans` constructor is not used outside test code (and the test code can be changed to pass a `[(start, length)]` array). Removing this would slightly simplify the constructor and avoid a possibly error-prone overloading. * in the `Spans` class comment, ", frequently used to represent .newsrc contents" is out-of-context and not needed. * in the `_check` method of `Spans`, if assertions are switched on then the `self._spans` array is re-sorted in order to check whether it is ordered. This is unnecessary: if you add an `assert length > 0, length` in the loop, then the loop will be checking a condition that is stronger than the array being ordered, given that the starts and lengths are numbers. (The sorting actually takes O(n) time rather than O(n log n) on each call to `_check`, because [Timsort](http://bugs.python.org/file4451/timsort.txt) will detect the ordered input, but it's still unnecessary overhead.) * the assert statements should include the variables they use in the exception message, e.g. `assert start > prev_end, (start, prev_end)`. * "`overlap(s_start, s_length, start, length) or adjacent(s_start, s_length, start, length)`" is equivalent to "`overlap(s_start, s_length, start-1, length+2)`". * in the only other use of `adjacent` (in `DataSpans.add`), only the `start0 < start1` case should be able to occur. Inline and simplify. * in the loop over `enumerate(self._spans)`, you could exit early when `s_start > start+length`. At that point you know where to insert the `(start, length)` span without having to re-sort. * a `Spans` object behaves somewhat like a set of the elements in all of its spans, but the `*contains*` and `*iter*` methods are not consistent with that (instead viewing it as a set of `(start, length)` pairs). I realize this may allow for more convenient use of `in` and iteration, but it should at least be documented. * `_check` and `assert_invariants` do similar things; give them the same name. * `DataSpans._dump` is poorly named. * `DataSpans.assert_invariants` should check that none of the data strings are zero-length. * is it intentional that `DataSpans.add` calls `self.assert_invariants()` but `remove` (and `pop`, although that's much simpler) don't? * `if s_start <= start < s_end:` I find this Python construct too clever by half. Anyway, at this point `s_start <= start` is always true (otherwise we won't get past the first `continue`), so I would write this as ``` assert s_start <= start, (s_start, start) if start < s_end: ``` * Perhaps rename `s_*` to `old_*`. * `DataSpans.add`: if I understand correctly, case A also covers: ``` OLD OLD OLD OLD NEW NEW NEWW NEEWW ``` This isn't immediately clear from the comment. Depicting it as ``` OLD NEW..... ``` might help. Also, use uppercase consistently for the cases. * `suffix_len` in case D has a different meaning to `suffix_len` in case E. Maybe rename it to `replace_len` for case D. Tests: * The `Spans` class is tested by `ByteSpans`, but it just stores spans of integers, not necessarily byte offsets. I would suggest `s/ByteSpans/TestSpans/` and `s/StringSpans/TestDataSpans/`. * I could be wrong, but I don't think the deterministic tests are covering all cases in `add` and `remove`. I'd prefer to see those tests have full coverage rather than relying on the randomized tests to make up the gap.
zooko added the
unknown
major
enhancement
1.8β
labels 2010-09-10 19:36:25 +00:00
zooko added this to the 1.9.0 milestone 2010-09-10 19:36:25 +00:00
davidsarah commented 2010-09-11 00:18:24 +00:00
Owner

Duplicate of #1182.

Duplicate of #1182.
tahoe-lafs added the
duplicate
label 2010-09-11 00:18:24 +00:00
davidsarah closed this issue 2010-09-11 00:18:24 +00:00
zooko modified the milestone from 1.9.0 to eventually 2011-07-27 18:20:56 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1196
No description provided.