don't overfill your filesystem's directories -- and make intermediate dirs be leading prefixes of storage indexes? #150

New Issue

zooko · 2007-09-26T18:54:50Z

zooko commented

2007-09-26 18:54:50 +00:00

As mentioned on the Performance page, ext3 can store no more than 32,000 entries in a single directory. This ticket is to work-around such limitations by using a few bits of the storage index to choose a subdirectory. Something like this:

p = os.path.join(idlib.b2a_l(s[:2], 14), idlib.b2a(s))

Would allow us to store 524,288,000 shares.

(Yay -- there's a use for z-base-32's bit-length!)

As mentioned on [the Performance page](wiki/Performance#StorageServers), ext3 can store no more than 32,000 entries in a single directory. This ticket is to work-around such limitations by using a few bits of the storage index to choose a subdirectory. Something like this: ``` p = os.path.join(idlib.b2a_l(s[:2], 14), idlib.b2a(s)) ``` Would allow us to store 524,288,000 shares. (Yay -- there's a use for z-base-32's bit-length!)

zooko added the

labels 2007-09-26 18:54:50 +00:00

zooko added this to the 0.7.0 milestone 2007-09-26 18:54:50 +00:00

zooko self-assigned this 2007-09-26 18:54:50 +00:00

warner commented

2007-09-26 21:56:50 +00:00

For small-to-medium size stores, it also adds an extra 4kiB per share (for the extra directory),
but I'm ok with that. This brings the per-share overhead (assuming one share per bucket, i.e.
one share per server) from 5486B to 9582B. (1390 bytes of hashes and lease info, 4096 bytes for the bucket directory, and an extra 4096 bytes for the three-character-long intermediate directory that this change would add)

For small-to-medium size stores, it also adds an extra 4kiB per share (for the extra directory), but I'm ok with that. This brings the per-share overhead (assuming one share per bucket, i.e. one share per server) from 5486B to 9582B. (1390 bytes of hashes and lease info, 4096 bytes for the bucket directory, and an extra 4096 bytes for the three-character-long intermediate directory that this change would add)

warner commented

2007-09-28 02:40:50 +00:00

oh, and we should make sure that a call to allocate_buckets() that would fail because of one of these limits should fail gracefully, by telling the caller that we won't accept their lease request. We might already do this, I'm not sure. Basically it's just a matter of getting the 'except EnvironmentError' catches right.

oh, and we should make sure that a call to allocate_buckets() that would fail because of one of these limits should fail gracefully, by telling the caller that we won't accept their lease request. We might already do this, I'm not sure. Basically it's just a matter of getting the 'except [EnvironmentError](wiki/EnvironmentError)' catches right.

zooko added

0.6.1

and removed

0.6.0

labels 2007-10-18 23:29:27 +00:00

zooko commented

2007-11-01 17:25:02 +00:00

We're focussing on an imminent v0.7.0 (see the roadmap) which hopefully has [#197 Small Distributed Mutable Files] and also a fix for [#199 bad SHA-256]. So I'm bumping less urgent tickets to v0.7.1.

We're focussing on an imminent v0.7.0 (see [the roadmap](http://allmydata.org/trac/tahoe/roadmap)) which hopefully has [#197 Small Distributed Mutable Files] and also a fix for [#199 bad SHA-256]. So I'm bumping less urgent tickets to v0.7.1.

zooko commented

2007-11-13 18:26:29 +00:00

We need to choose a manageable subset of desired improvements for [ http://allmydata.org/trac/tahoe/milestone/0.7.1 v0.7.1], scheduled for two week hence, so I'm bumping this one into v0.7.2, scheduled for mid-December.

We need to choose a manageable subset of desired improvements for [ <http://allmydata.org/trac/tahoe/milestone/0.7.1> v0.7.1], scheduled for two week hence, so I'm bumping this one into [v0.7.2](http://allmydata.org/trac/tahoe/milestone/0.7.2), scheduled for mid-December.

zooko added

0.7.0

and removed

0.6.1

labels 2007-11-13 18:26:29 +00:00

zooko commented

2008-01-04 18:45:56 +00:00

This is important for scaling up, and it is easy, and I'll do it. Bringing it forward to Milestone 0.7.1.

warner commented

2008-01-05 05:30:50 +00:00

cool. If it goes into 0.7.1, how about we make it check the old location too, so that 0.7.0-generated shares are readable by an 0.7.1 node? We could get rid of this extra check when 0.8.0 is released, or maybe give people a little tool to migrate their shares to the new locations.

zooko added this to the 0.8.0 (Allmydata 3.0 Beta) milestone 2008-01-31 19:05:03 +00:00

warner commented

2008-01-31 19:15:02 +00:00

oh, better yet, if we put the shares in a slightly different place, then we could test for the existence of the old directory at startup, and if it's there, set a flag, which will cause subsequent lookups to check both the old (flat) place and the new (nested) place.

Or do automatic migration at startup. I'm not super-fond of this, both since for a very large store it could take quite a while (although we don't have any very large stores yet), and because for some reason I'm uneasy about automatically moving things around like this.

Or put the nested directories in the same place as the flat shares went (BASEDIR/storage/shares/), but do a quick walk at startup to see if there are any actual shares there (bucket directories with the full-storage-index-sized name). If so, set the flag to look for the flat place in addition to the nested place. With this approach, the "migration tool" would just exist to speed up share lookup slightly (one os.stat instead of two), but otherwise everything would Just Work.

OTOH, the code would be simpler with just a single approach, and if we write up a migration tool at the same time, we can fix all of the 12 storage servers we've currently got in half an hour, and be done with it.

oh, better yet, if we put the shares in a slightly different place, then we could test for the existence of the old directory at startup, and if it's there, set a flag, which will cause subsequent lookups to check both the old (flat) place and the new (nested) place. Or do automatic migration at startup. I'm not super-fond of this, both since for a very large store it could take quite a while (although we don't have any very large stores yet), and because for some reason I'm uneasy about automatically moving things around like this. Or put the nested directories in the same place as the flat shares went (BASEDIR/storage/shares/), but do a quick walk at startup to see if there are any actual shares there (bucket directories with the full-storage-index-sized name). If so, set the flag to look for the flat place in addition to the nested place. With this approach, the "migration tool" would just exist to speed up share lookup slightly (one os.stat instead of two), but otherwise everything would Just Work. OTOH, the code would be simpler with just a single approach, and if we write up a migration tool at the same time, we can fix all of the 12 storage servers we've currently got in half an hour, and be done with it.

zooko commented

2008-01-31 22:12:00 +00:00

Attachment convertshares.py (860 bytes) added

**Attachment** convertshares.py (860 bytes) added

convertshares.py

860 B

zooko commented

2008-01-31 22:12:08 +00:00

Yes, let's manage the 0.8.0 storage code (source:src/allmydata/storage.py) and the upgrader tool separately. Here's the upgrader tool attached to this ticket. It assumes that the current working directory is the storage/shares directory. It should work even if there is an incomplete upgrade (such as if an earlier run of the upgrade tool died), and even if there is a tahoe node currently running and using this directory.

Yes, let's manage the 0.8.0 storage code (source:src/allmydata/storage.py) and the upgrader tool separately. Here's the upgrader tool attached to this ticket. It assumes that the current working directory is the `storage/shares` directory. It should work even if there is an incomplete upgrade (such as if an earlier run of the upgrade tool died), and even if there is a tahoe node currently running and using this directory.

zooko commented

2008-01-31 22:37:58 +00:00

Fixed by changeset:b80cfeb186860ab6.

Don't forget to snarf the convertshares.py script attached to this ticket if you want to upgrade shares from the old storage layout.

Fixed by changeset:b80cfeb186860ab6. Don't forget to snarf the `convertshares.py` script attached to this ticket if you want to upgrade shares from the old storage layout.

zooko added the

fixed

label 2008-01-31 22:38:20 +00:00

zooko closed this issue

2008-01-31 22:38:20 +00:00

zooko commented

2008-02-12 18:37:07 +00:00

There is one small problem with the current solution -- the names of the intermediate directories (which are the z-base-32 encodings of the first 14 bits of the storage index) are not prefixes of the z-base-32 encodings of the storage indexes themselves. This is because the z-base-32 encoding of the storage index encodes the first 15 bits into the first 3 chars. So, for example:

MAIN wonwin-mcbrootles-computer:~/.tahoe/storage/shares$ l ueq/*
drwx------    3 wonwinmc wonwinmc      102 Feb  1 16:30 ueq/ueqdjcok1pjs8fwitrcyoc73aw

One potential improvement would be to use [encoding http://allmydata.org/source/z-base-62/]base-62 instead. Two chars of base-62 encodings offer 3844 possibilities, and the names of the intermediate directories could be just the first two chars of the base-62 encoding of the whole storage index.

There is one small problem with the current solution -- the names of the intermediate directories (which are the z-base-32 encodings of the first 14 bits of the storage index) are not prefixes of the z-base-32 encodings of the storage indexes themselves. This is because the z-base-32 encoding of the storage index encodes the first 15 bits into the first 3 chars. So, for example: ``` MAIN wonwin-mcbrootles-computer:~/.tahoe/storage/shares$ l ueq/* drwx------ 3 wonwinmc wonwinmc 102 Feb 1 16:30 ueq/ueqdjcok1pjs8fwitrcyoc73aw ``` One potential improvement would be to use [encoding <http://allmydata.org/source/z-base-62/>]base-62 instead. Two chars of base-62 encodings offer 3844 possibilities, and the names of the intermediate directories could be just the first two chars of the base-62 encoding of the whole storage index.

zooko commented

2008-02-12 18:37:45 +00:00

oops, in that example the leading 3 chars happened to be the same.
(There's a 50% chance per storage index.)

oops, in that example the leading 3 chars happened to be the same. (There's a 50% chance per storage index.)

zooko commented

2008-02-12 18:38:16 +00:00

MAIN wonwin-mcbrootles-computer:~/.tahoe/storage/shares$ l aho/*
drwx------    3 wonwinmc wonwinmc      102 Feb  1 10:56 aho/ahtm1iurtjzqdgrtobeat5dpje

``` MAIN wonwin-mcbrootles-computer:~/.tahoe/storage/shares$ l aho/* drwx------ 3 wonwinmc wonwinmc 102 Feb 1 10:56 aho/ahtm1iurtjzqdgrtobeat5dpje ```

zooko removed the

fixed

label 2008-02-12 18:38:42 +00:00

zooko reopened this issue

2008-02-12 18:38:42 +00:00

zooko changed title from ~~don't overfill your filesystem's directories~~ to don't overfill your filesystem's directories -- and make intermediate dirs be leading prefixes of storage indexes?

2008-02-12 18:38:42 +00:00

warner commented

2008-02-12 20:39:01 +00:00

It would be nice if we could predict the location of a given share. Either
changing the encoding to optimally fill an ext3 32000-entry-limited directory
with an integral number of baseN characters, or using multiple levels of
directories. The latter would expose this unfortunate ext3 wart to less code,
but would consume more overhead (the 4kB block per directory, even for mostly
empty ones).

Could we do some math to make a guess as to how many storage indices get
created and placed before we expect to reject a lease request because of the
32000-entry limit? With the current scheme that will occur when we hit 32000
entries in one of the ABC/ top-level directories. Using multiple directory
levels would increase this number, of course, but it would be nice to have a
rough idea of how many levels we'd need to store, say, 1M buckets, 1G
buckets, 1T buckets.

I believe that Rob mentioned that the Yahoo storage folks determined
empirically that 7 bits per directory level gave good performance with common
filesystems (in terms of caching vs lookup speed). Maybe he could chime in
with some more details.

It would be nice if we could predict the location of a given share. Either changing the encoding to optimally fill an ext3 32000-entry-limited directory with an integral number of baseN characters, or using multiple levels of directories. The latter would expose this unfortunate ext3 wart to less code, but would consume more overhead (the 4kB block per directory, even for mostly empty ones). Could we do some math to make a guess as to how many storage indices get created and placed before we expect to reject a lease request because of the 32000-entry limit? With the current scheme that will occur when we hit 32000 entries in one of the ABC/ top-level directories. Using multiple directory levels would increase this number, of course, but it would be nice to have a rough idea of how many levels we'd need to store, say, 1M buckets, 1G buckets, 1T buckets. I believe that Rob mentioned that the Yahoo storage folks determined empirically that 7 bits per directory level gave good performance with common filesystems (in terms of caching vs lookup speed). Maybe he could chime in with some more details.

zooko commented

2008-02-14 14:44:57 +00:00

fixed by changeset:e400c5a8fb30ca57

zooko added the

fixed

label 2008-02-14 14:44:57 +00:00

zooko closed this issue

2008-02-14 14:44:57 +00:00

Sign in to join this conversation.