Peer selection doesn't rebalance shares on overwrite of mutable file. #232
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#232
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When you upload a new version of a mutable file, it currently uploads the new shares to peers which already have old shares, then checks that enough shares have been uploaded, then is happy. However, this means it never "rebalances", so if there were few peers (or just yourself!) the first time, and many peers the second time, the file is still stored on only those few peers.
This is an instance of the general principle that shares are not the right units for robustness measurements -- servers are.
Actually I'm going to bump this out of the v0.7.0 Milestone and instead document that you have to have as many servers as your "total shares" parameter if you want robust storage. As mentioned in (@@http://allmydata.org/trac/tahoe/ticket/115#comment:-1@@) , the WUI should be enhanced to indicate the status of the creation of the private directory this to the user.
peer selection doesn't rebalance sharesto peer selection doesn't rebalance shares on overwrite of mutable fileThis is related to ticket #213 -- "good handling of small numbers of servers, or strange choice of servers".
peer selection doesn't rebalance shares on overwrite of mutable fileto Peer selection doesn't rebalance shares on overwrite of mutable file.Oh, I think it's actually more complicated than that. When we decide to take
the plunge, our peer selection algorithm should be aware of the chassis,
rack, and colo of each storage server. It should start by putting shares in
different colos. If it is forced to put two shares in the same colo, it
should try to put them in different racks. If they must share a rack, get
them in different chassis. If they must share a chassis, put them on
different disks. Only when all other options are exhausted, then two shares
can be put on the same disk (but we shouldn't be happy about it).
For now, in small grids, getting the shares onto different nodes is a good
start.
When a mutable file is modified, it's fairly easy to detect an improvement
that could be made and move shares to new servers. Another desireable feature
would be for the addition of a new server to automatically kick off a wave of
rebalancing. We have to decide upon how we want to trigger that, though: the
most naive approach (sweep through all files and check/repair/rebalance each
one every month) will have a certain bandwidth/diskio cost that might be
excessive and/or starve normal traffic.
I'm moving this to the 0.8.0 milestone since it matches the 0.8.0 goals.
There are a couple of different levels of support we might provide, so once
we come up with a plan, we might want to make a couple of new tickets and
schedule them differently.
One more wrinkle is that if N/(K+1) is large enough (>= 2, perhaps), then perhaps it should put K+1 shares into the same co-lo in order to enable regeneration of shares using only in-co-lo bandwidth.
Brian: did you leave this behavior unchanged in the recent mutable-file upload/download refactoring?
Yes, this behavior is unchanged, and this ticket remains open. The publish process will seek to update the shares in-place, and will only look for new homes for shares that cannot be found.
To get automatic rebalancing, the publish process (specifically Publish.update_goal) needs to count how many shares are present on each server, and gently try to find a new home for them if there is more than one. ("gentle" in the sense that it should leave the share where it is if there are not extra empty servers to be found). In addition, we need to consider deleting the old share rather than merely creating a new copy of it.
One additional thing to consider when working on this: if the mutable share lives on a server which is now full, the client should have the option of removing the share from that server (so it can go to a not-yet-full one). This can get tricky.
The first thing we need is a storage-server API to cancel leases on mutable shares, then code to delete the share when the lease count goes to zero. A mutable file that has multiple leases on it will be particularly tricky to consider.
The following clump of tickets might be of interest to people who are interested in this ticket: #711 (repair to different levels of M), #699 (optionally rebalance during repair or upload), #543 ('rebalancing manager'), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), #678 (converge same file, same K, different M), #610 (upload should take better advantage of existing shares), #573 (Allow client to control which storage servers receive shares).
Also related: #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better).
Sorry, not integrity, only reliability.
moving this to category=mutable, since it's more of an issue with the mutable publish code than with the general category of peer selection
It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)
It was a mistake to put this ticket into the 1.8 Milestone. :-)
Related to #1057 (Alter mutable files to use servers of happiness). Ideally the server selection for mutable and immutable files would use the same code, as far as possible.
See also #1816: ideally, only the shares that are still needed for the new version should have their leases renewed.