--verify option for tahoe backup #1331

New Issue

tahoe-lafs · 2011-01-18T15:59:59Z

chrysn commented

2011-01-18 15:59:59 +00:00

tahoe backup will happily end its operation even if the files that are to be backupped are not present on any node.

there are two parts of this problem:

the backupdb seems not to track introducer urls (e.g. when one backups the same directory to different clouds)
caps the new version relies on are not verified

while the first could be un-fixable for all i know (that is, in case tahoe has no concept of "different clouds"), for the second one i suggest the following:

have a --verify option that takes four values:
none -- rely on caps remembered in backupdb to be present
shallow -- check for the existence of every cap remembered from backupdb
deep -- do a deep check on all caps used in the backup db
checksum -- calculate the data checksums of all files involved in re-using a cap, and compare to the reference cap (this requires equal convergence secrets)

the current implementation (i'm using 1.7.1, but the changelog doesn't mention anything relevant) does the equivalent of none, which is especially a problem together with the first problem mentioned above.

i'd suggest at least --verify=shallow to be default for backups; it has the advantage of keeping the O(1) network traffic advantage of the backupdb.

another switch should be created to configure whether verify misses are to be treated critical or should just be reported to stderr. (--verify-fatal or similar)

tahoe backup will happily end its operation even if the files that are to be backupped are not present on any node. there are two parts of this problem: * the backupdb seems not to track introducer urls (e.g. when one backups the same directory to different clouds) * caps the new version relies on are not verified while the first could be un-fixable for all i know (that is, in case tahoe has no concept of "different clouds"), for the second one i suggest the following: * have a --verify option that takes four values: * none -- rely on caps remembered in backupdb to be present * shallow -- check for the existence of every cap remembered from backupdb * deep -- do a deep check on all caps used in the backup db * checksum -- calculate the data checksums of all files involved in re-using a cap, and compare to the reference cap (this requires equal convergence secrets) the current implementation (i'm using 1.7.1, but the changelog doesn't mention anything relevant) does the equivalent of none, which is especially a problem together with the first problem mentioned above. i'd suggest at least --verify=shallow to be default for backups; it has the advantage of keeping the O(1) network traffic advantage of the backupdb. another switch should be created to configure whether verify misses are to be treated critical or should just be reported to stderr. (--verify-fatal or similar)

tahoe-lafs added the

labels 2011-01-18 15:59:59 +00:00

tahoe-lafs added this to the undecided milestone 2011-01-18 15:59:59 +00:00

tahoe-lafs added

code-frontend-cli

and removed

unknown

labels 2011-01-18 20:27:53 +00:00

warner commented

2011-01-29 21:50:07 +00:00

Yeah, those are good points.

We don't have a strong notion of "different clouds" yet. We've talked
about putting a "grid id" into each filecap (see #403), but that's a
deep problem, and touches on how we want people to deploy and join
grids, so it's not going to be solved right away. It might help to put a
copy of the introducer.furl (or maybe just its !TubID) into the
backupdb, and then do extra checking if it changes. We don't currently
have a good way to extract the introducer.furl from the webapi, so we
might need to add that.

It's not obvious from the docs, but the "tahoe backup" command does do
lightweight checking of the files it touches on a probabilistic basis:
source:docs/backupdb.rst and source:src/allmydata/scripts/backupdb.py
have some details. In short, each filecap will be checked at least once
every two months, and possibly once every month, on a randomized basis
to spread the load smoothly over multiple "tahoe backup" runs. If you do
a daily backup, about 3% of the files will be checked each time.

This filecheck is the same as what you'd get with "tahoe check" or
"tahoe deep-check": it asks the connected storage servers whether they
have a share or not, and is satisfied if at least N distinct
shares are then found. If not, it re-uploads the file.

That said, it might be a good idea to improve this process, or add some
knows to make for more stringent checking, like your various
--verify options. I'm not sure how to translate from the levels of
verification you describe to the facilities currently present in tahoe:

none: "rely on caps remembered in backupdb to be present": this
is equal to what "tahoe backup" does now for the first four weeks,
before the maybe-check-a-file timer kicks in
shallow: "check for the existence of every cap remembered from
backupdb": the filecheck that "tahoe backup" does at least once every
eight weeks will cover this. Each filecheck sends a message to every
connected storage server (in parallel), so one round-trip-time each.
We don't have anything lighter-weight than a filecheck right now
deep: "do a deep check on all caps used in the backup db". I
think this is equal to the regular file-check, since the backupdb
stores an entry for every tahoe object referenced by the backup (so
the "deep" aspect is redundant). So I think this is the same as
--verify=shallow
checksum: "calculate the data checksums of all files involved
in re-using a cap, and compare to the reference cap". Hmm. A tahoe
"file verify" starts with the filecap and makes sure all the shares
match that (and requires downloading every share, so is N/k
times as expensive as a normal download). A re-upload will recompute
the storage index, believe any shares which exist for it, and upload
new shares when they don't. I suspect that a combination of
--ignore-timestamps (which will force a re-upload of each file)
and a file-verify operation would cover this.

Hm, here's an easy idea: when doing a backup, the very first time we
encounter a file that is already in the backupdb (but not on later files
in that backup run), do an immediate full verify on it (download all
shares and check them against the filecap). If that fails, turn on "do a
filecheck for every file" mode: if we're connected to the wrong grid or
using the wrong client node or something, we'll always hit this. And
filechecks, while not free, are much cheaper than a full fileverify or
re-upload.

If we add the introducer.furl field to the backupdb, then the rule
should simply be that we ignore any backupdb entries that are associated
with the wrong introducer. Alternatively, we could force a file-check on
any entry that had the wrong introducer, which would save time in cases
when e.g. the introducer had merely moved to a new IP address, or when
the introducer changed but all the storage servers remained. However,
that would slow down the case where the client was now on a completely
different grid, since it would do a pointless filecheck for each one
before uploading.

i'd suggest at least --verify=shallow to be default for backups; it
has the advantage of keeping the O(1) network traffic advantage of the
backupdb.

To "check for the existence" of a cap, we have to talk to a bunch of
storage servers (there's no local memory of the cap having been
uploaded, except for the backupdb). So this sort of checking actually
costs O(N) in the number of files (actually
O(numfiles*numservers)).

another switch should be created to configure whether verify misses
are to be treated critical or should just be reported to stderr.
(--verify-fatal or similar)

In the current code, filecheck failures trigger a new upload, so backup
always succeeds if the files can be uploaded to the current grid. But it
might be interesting to have a flag that means "I expect that most of my
data should already be in this grid: please tell me (by failing) if I'm
wrong".

Yeah, those are good points. We don't have a strong notion of "different clouds" yet. We've talked about putting a "grid id" into each filecap (see #403), but that's a deep problem, and touches on how we want people to deploy and join grids, so it's not going to be solved right away. It might help to put a copy of the introducer.furl (or maybe just its !TubID) into the backupdb, and then do extra checking if it changes. We don't currently have a good way to extract the introducer.furl from the webapi, so we might need to add that. It's not obvious from the docs, but the "tahoe backup" command *does* do lightweight checking of the files it touches on a probabilistic basis: source:docs/backupdb.rst and source:src/allmydata/scripts/backupdb.py have some details. In short, each filecap will be checked at least once every two months, and possibly once every month, on a randomized basis to spread the load smoothly over multiple "tahoe backup" runs. If you do a daily backup, about 3% of the files will be checked each time. This filecheck is the same as what you'd get with "tahoe check" or "tahoe deep-check": it asks the connected storage servers whether they have a share or not, and is satisfied if at least `N` distinct shares are then found. If not, it re-uploads the file. That said, it might be a good idea to improve this process, or add some knows to make for more stringent checking, like your various `--verify` options. I'm not sure how to translate from the levels of verification you describe to the facilities currently present in tahoe: * `none`: "rely on caps remembered in backupdb to be present": this is equal to what "tahoe backup" does now for the first four weeks, before the maybe-check-a-file timer kicks in * `shallow`: "check for the existence of every cap remembered from backupdb": the filecheck that "tahoe backup" does at least once every eight weeks will cover this. Each filecheck sends a message to every connected storage server (in parallel), so one round-trip-time each. We don't have anything lighter-weight than a filecheck right now * `deep`: "do a deep check on all caps used in the backup db". I think this is equal to the regular file-check, since the backupdb stores an entry for every tahoe object referenced by the backup (so the "deep" aspect is redundant). So I think this is the same as `--verify=shallow` * `checksum`: "calculate the data checksums of all files involved in re-using a cap, and compare to the reference cap". Hmm. A tahoe "file verify" starts with the filecap and makes sure all the shares match that (and requires downloading every share, so is `N/k` times as expensive as a normal download). A re-upload will recompute the storage index, believe any shares which exist for it, and upload new shares when they don't. I suspect that a combination of `--ignore-timestamps` (which will force a re-upload of each file) and a file-verify operation would cover this. Hm, here's an easy idea: when doing a backup, the very first time we encounter a file that is already in the backupdb (but not on later files in that backup run), do an immediate full verify on it (download all shares and check them against the filecap). If that fails, turn on "do a filecheck for every file" mode: if we're connected to the wrong grid or using the wrong client node or something, we'll always hit this. And filechecks, while not free, are much cheaper than a full fileverify or re-upload. If we add the `introducer.furl` field to the backupdb, then the rule should simply be that we ignore any backupdb entries that are associated with the wrong introducer. Alternatively, we could force a file-check on any entry that had the wrong introducer, which would save time in cases when e.g. the introducer had merely moved to a new IP address, or when the introducer changed but all the storage servers remained. However, that would slow down the case where the client was now on a completely different grid, since it would do a pointless filecheck for each one before uploading. > i'd suggest at least --verify=shallow to be default for backups; it > has the advantage of keeping the O(1) network traffic advantage of the > backupdb. To "check for the existence" of a cap, we have to talk to a bunch of storage servers (there's no local memory of the cap having been uploaded, except for the backupdb). So this sort of checking actually costs O(N) in the number of files (actually `O(numfiles*numservers)`). > another switch should be created to configure whether verify misses > are to be treated critical or should just be reported to stderr. > (--verify-fatal or similar) In the current code, filecheck failures trigger a new upload, so backup always succeeds if the files can be uploaded to the current grid. But it might be interesting to have a flag that means "I expect that most of my data should already be in this grid: please tell me (by failing) if I'm wrong".

warner added

unknown

and removed

code-frontend-cli

labels 2011-01-29 21:50:07 +00:00

warner commented

2011-02-03 18:55:18 +00:00

argh, I did not touch the Component button, I don't know why my comment caused the component to get cleared.

argh, I did *not* touch the Component button, I don't know why my comment caused the component to get cleared.

warner added

code-frontend-cli

and removed