Dynamic share migration to maintain file health #661

New Issue

tahoe-lafs · 2009-03-11T17:15:56Z

mmore commented

2009-03-11 17:15:56 +00:00

Dynamic share repair to maintain file health. based on the following features
already exist in Allmydata-Tahoe1.3 we can improve automatic repair:

Foolscap provides the knowledge of the alive nodes.
Verification of file availability can be delegated to other node through
read-cap or a verify-cap without security risk.

The proposed auto repair process:

Using memory-based algorithm, because client know where the file shares
exist so we can keep tack of alive file shares, for simplicity we
consider that share availability from its node availability.
repair process triggered automatically from the repairer, repair
responsibility has many technique based repair cost ; network bandwidth
and fault tolerant.
time out , we can use lazy repair technique to avoid node temporary node
failure, i.e waiting for a certain time before repair process starts.
reintegration, using memory-based repair technique remembering failed
storage servers, who come back to life, will help in reducing Tahoe grid
resources such as network bandwidth and storage space.
repairer, selection of repair responsibly takes many issues into
consideration: security , repairer location , repairer resources.

Dynamic share repair to maintain file health. based on the following features already exist in Allmydata-Tahoe1.3 we can improve automatic repair: 1. Foolscap provides the knowledge of the alive nodes. 2. Verification of file availability can be delegated to other node through read-cap or a verify-cap without security risk. The proposed auto repair process: 1. Using memory-based algorithm, because client know where the file shares exist so we can keep tack of alive file shares, for simplicity we consider that share availability from its node availability. 2. repair process triggered automatically from the repairer, repair responsibility has many technique based repair cost ; network bandwidth and fault tolerant. 3. time out , we can use lazy repair technique to avoid node temporary node failure, i.e waiting for a certain time before repair process starts. 4. reintegration, using memory-based repair technique remembering failed storage servers, who come back to life, will help in reducing Tahoe grid resources such as network bandwidth and storage space. 5. repairer, selection of repair responsibly takes many issues into consideration: security , repairer location , repairer resources.

tahoe-lafs added the

labels 2009-03-11 17:15:56 +00:00

tahoe-lafs added this to the undecided milestone 2009-03-11 17:15:56 +00:00

zooko commented

2009-03-11 22:07:16 +00:00

I reformatted the original description so that trac will represent the numbered items as a list.

warner commented

2009-03-12 21:03:22 +00:00

re-reformatted it: I think trac requires the leading space to trigger the "display as list" formatter

warner added

code-encoding

and removed

dev-infrastructure

labels 2009-06-12 00:56:32 +00:00

davidsarah commented

2010-03-25 03:27:24 +00:00

The following clump of tickets are closely related:

#450 Checker/repair agent
#483 Repairer service
#543 Rebalancing manager
#643 Automatically schedule repair service
#661 Dynamic share migration to maintain file health
#864 Automated migration of shares between storage servers

Actually there are probably too many overlapping tickets here.

Part of the redundancy is due to distinguishing repair from rebalancing. But when #614 and #778 are fixed, a healthy file will by definition be balanced across servers, so there's no need to make that distinction. Perhaps there will also be a "super-healthy" status that means shares are balanced across the maximum number of servers, i.e. N. (When we support geographic dispersal / rack-awareness, the definitions of "healthy" and "super-healthy" will presumably change again so that they also imply that shares have the desired distribution.)

There are basically four options for how repair/rebalancing could be triggered:

a webapi operation performed by a gateway, and triggered by CLI commands. We already have this. Scheduling this operation automatically is #643.
triggered by write operations on a particular file. This is #232 and #699.
moving a server's shares elsewhere when it is about to be decommissioned or is running out of space. This is #864.
a more autonomous repair/rebalancing service that would run continuously.

The last option does not justify 4 tickets! (#450, #483, #543, #661) Unless anyone objects, I'm going to merge these all into #483 [actually #543]edit:.

The following clump of tickets are closely related: * #450 Checker/repair agent * #483 Repairer service * #543 Rebalancing manager * #643 Automatically schedule repair service * #661 Dynamic share migration to maintain file health * #864 Automated migration of shares between storage servers Actually there are probably too many overlapping tickets here. Part of the redundancy is due to distinguishing repair from rebalancing. But when #614 and #778 are fixed, a healthy file will by definition be balanced across servers, so there's no need to make that distinction. Perhaps there will also be a "super-healthy" status that means shares are balanced across the *maximum* number of servers, i.e. N. (When we support geographic dispersal / rack-awareness, the definitions of "healthy" and "super-healthy" will presumably change again so that they also imply that shares have the desired distribution.) There are basically four options for how repair/rebalancing could be triggered: * a webapi operation performed by a gateway, and triggered by CLI commands. We already have this. Scheduling this operation automatically is #643. * triggered by write operations on a particular file. This is #232 and #699. * moving a server's shares elsewhere when it is about to be decommissioned or is running out of space. This is #864. * a more autonomous repair/rebalancing service that would run continuously. The last option does not justify 4 tickets! (#450, #483, #543, #661) Unless anyone objects, I'm going to merge these all into #483 [actually #543]edit:.

daira commented

2014-12-29 20:21:04 +00:00

Duplicate of #543.

tahoe-lafs added the

duplicate

label 2014-12-29 20:21:04 +00:00

daira closed this issue

2014-12-29 20:21:04 +00:00

Sign in to join this conversation.