don't overfill your filesystem's directories -- and make intermediate dirs be leading prefixes of storage indexes? #150
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#150
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As mentioned on the Performance page, ext3 can store no more than 32,000 entries in a single directory. This ticket is to work-around such limitations by using a few bits of the storage index to choose a subdirectory. Something like this:
Would allow us to store 524,288,000 shares.
(Yay -- there's a use for z-base-32's bit-length!)
For small-to-medium size stores, it also adds an extra 4kiB per share (for the extra directory),
but I'm ok with that. This brings the per-share overhead (assuming one share per bucket, i.e.
one share per server) from 5486B to 9582B. (1390 bytes of hashes and lease info, 4096 bytes for the bucket directory, and an extra 4096 bytes for the three-character-long intermediate directory that this change would add)
oh, and we should make sure that a call to allocate_buckets() that would fail because of one of these limits should fail gracefully, by telling the caller that we won't accept their lease request. We might already do this, I'm not sure. Basically it's just a matter of getting the 'except EnvironmentError' catches right.
We're focussing on an imminent v0.7.0 (see the roadmap) which hopefully has [#197 Small Distributed Mutable Files] and also a fix for [#199 bad SHA-256]. So I'm bumping less urgent tickets to v0.7.1.
We need to choose a manageable subset of desired improvements for [ http://allmydata.org/trac/tahoe/milestone/0.7.1 v0.7.1], scheduled for two week hence, so I'm bumping this one into v0.7.2, scheduled for mid-December.
This is important for scaling up, and it is easy, and I'll do it. Bringing it forward to Milestone 0.7.1.
cool. If it goes into 0.7.1, how about we make it check the old location too, so that 0.7.0-generated shares are readable by an 0.7.1 node? We could get rid of this extra check when 0.8.0 is released, or maybe give people a little tool to migrate their shares to the new locations.
oh, better yet, if we put the shares in a slightly different place, then we could test for the existence of the old directory at startup, and if it's there, set a flag, which will cause subsequent lookups to check both the old (flat) place and the new (nested) place.
Or do automatic migration at startup. I'm not super-fond of this, both since for a very large store it could take quite a while (although we don't have any very large stores yet), and because for some reason I'm uneasy about automatically moving things around like this.
Or put the nested directories in the same place as the flat shares went (BASEDIR/storage/shares/), but do a quick walk at startup to see if there are any actual shares there (bucket directories with the full-storage-index-sized name). If so, set the flag to look for the flat place in addition to the nested place. With this approach, the "migration tool" would just exist to speed up share lookup slightly (one os.stat instead of two), but otherwise everything would Just Work.
OTOH, the code would be simpler with just a single approach, and if we write up a migration tool at the same time, we can fix all of the 12 storage servers we've currently got in half an hour, and be done with it.
Attachment convertshares.py (860 bytes) added
Yes, let's manage the 0.8.0 storage code (source:src/allmydata/storage.py) and the upgrader tool separately. Here's the upgrader tool attached to this ticket. It assumes that the current working directory is the
storage/shares
directory. It should work even if there is an incomplete upgrade (such as if an earlier run of the upgrade tool died), and even if there is a tahoe node currently running and using this directory.Fixed by changeset:b80cfeb186860ab6.
Don't forget to snarf the
convertshares.py
script attached to this ticket if you want to upgrade shares from the old storage layout.There is one small problem with the current solution -- the names of the intermediate directories (which are the z-base-32 encodings of the first 14 bits of the storage index) are not prefixes of the z-base-32 encodings of the storage indexes themselves. This is because the z-base-32 encoding of the storage index encodes the first 15 bits into the first 3 chars. So, for example:
One potential improvement would be to use [encoding http://allmydata.org/source/z-base-62/]base-62 instead. Two chars of base-62 encodings offer 3844 possibilities, and the names of the intermediate directories could be just the first two chars of the base-62 encoding of the whole storage index.
oops, in that example the leading 3 chars happened to be the same.
(There's a 50% chance per storage index.)
don't overfill your filesystem's directoriesto don't overfill your filesystem's directories -- and make intermediate dirs be leading prefixes of storage indexes?It would be nice if we could predict the location of a given share. Either
changing the encoding to optimally fill an ext3 32000-entry-limited directory
with an integral number of baseN characters, or using multiple levels of
directories. The latter would expose this unfortunate ext3 wart to less code,
but would consume more overhead (the 4kB block per directory, even for mostly
empty ones).
Could we do some math to make a guess as to how many storage indices get
created and placed before we expect to reject a lease request because of the
32000-entry limit? With the current scheme that will occur when we hit 32000
entries in one of the ABC/ top-level directories. Using multiple directory
levels would increase this number, of course, but it would be nice to have a
rough idea of how many levels we'd need to store, say, 1M buckets, 1G
buckets, 1T buckets.
I believe that Rob mentioned that the Yahoo storage folks determined
empirically that 7 bits per directory level gave good performance with common
filesystems (in terms of caching vs lookup speed). Maybe he could chime in
with some more details.
fixed by changeset:e400c5a8fb30ca57