Dynamic share migration to maintain file health #661

Closed
opened 2009-03-11 17:15:56 +00:00 by mmore · 4 comments
mmore commented 2009-03-11 17:15:56 +00:00
Owner

Dynamic share repair to maintain file health. based on the following features
already exist in Allmydata-Tahoe1.3 we can improve automatic repair:

  1. Foolscap provides the knowledge of the alive nodes.

  2. Verification of file availability can be delegated to other node through
    read-cap or a verify-cap without security risk.

The proposed auto repair process:

  1. Using memory-based algorithm, because client know where the file shares
    exist so we can keep tack of alive file shares, for simplicity we
    consider that share availability from its node availability.

  2. repair process triggered automatically from the repairer, repair
    responsibility has many technique based repair cost ; network bandwidth
    and fault tolerant.

  3. time out , we can use lazy repair technique to avoid node temporary node
    failure, i.e waiting for a certain time before repair process starts.

  4. reintegration, using memory-based repair technique remembering failed
    storage servers, who come back to life, will help in reducing Tahoe grid
    resources such as network bandwidth and storage space.

  5. repairer, selection of repair responsibly takes many issues into
    consideration: security , repairer location , repairer resources.

Dynamic share repair to maintain file health. based on the following features already exist in Allmydata-Tahoe1.3 we can improve automatic repair: 1. Foolscap provides the knowledge of the alive nodes. 2. Verification of file availability can be delegated to other node through read-cap or a verify-cap without security risk. The proposed auto repair process: 1. Using memory-based algorithm, because client know where the file shares exist so we can keep tack of alive file shares, for simplicity we consider that share availability from its node availability. 2. repair process triggered automatically from the repairer, repair responsibility has many technique based repair cost ; network bandwidth and fault tolerant. 3. time out , we can use lazy repair technique to avoid node temporary node failure, i.e waiting for a certain time before repair process starts. 4. reintegration, using memory-based repair technique remembering failed storage servers, who come back to life, will help in reducing Tahoe grid resources such as network bandwidth and storage space. 5. repairer, selection of repair responsibly takes many issues into consideration: security , repairer location , repairer resources.
tahoe-lafs added the
dev-infrastructure
major
enhancement
1.3.0
labels 2009-03-11 17:15:56 +00:00
tahoe-lafs added this to the undecided milestone 2009-03-11 17:15:56 +00:00

I reformatted the original description so that trac will represent the numbered items as a list.

I reformatted the original description so that trac will represent the numbered items as a list.

re-reformatted it: I think trac requires the leading space to trigger the "display as list" formatter

re-reformatted it: I think trac requires the leading space to trigger the "display as list" formatter
warner added
code-encoding
and removed
dev-infrastructure
labels 2009-06-12 00:56:32 +00:00
davidsarah commented 2010-03-25 03:27:24 +00:00
Author
Owner

The following clump of tickets are closely related:

  • #450 Checker/repair agent
  • #483 Repairer service
  • #543 Rebalancing manager
  • #643 Automatically schedule repair service
  • #661 Dynamic share migration to maintain file health
  • #864 Automated migration of shares between storage servers

Actually there are probably too many overlapping tickets here.

Part of the redundancy is due to distinguishing repair from rebalancing. But when #614 and #778 are fixed, a healthy file will by definition be balanced across servers, so there's no need to make that distinction. Perhaps there will also be a "super-healthy" status that means shares are balanced across the maximum number of servers, i.e. N. (When we support geographic dispersal / rack-awareness, the definitions of "healthy" and "super-healthy" will presumably change again so that they also imply that shares have the desired distribution.)

There are basically four options for how repair/rebalancing could be triggered:

  • a webapi operation performed by a gateway, and triggered by CLI commands. We already have this. Scheduling this operation automatically is #643.
  • triggered by write operations on a particular file. This is #232 and #699.
  • moving a server's shares elsewhere when it is about to be decommissioned or is running out of space. This is #864.
  • a more autonomous repair/rebalancing service that would run continuously.

The last option does not justify 4 tickets! (#450, #483, #543, #661) Unless anyone objects, I'm going to merge these all into #483 [actually #543]edit:.

The following clump of tickets are closely related: * #450 Checker/repair agent * #483 Repairer service * #543 Rebalancing manager * #643 Automatically schedule repair service * #661 Dynamic share migration to maintain file health * #864 Automated migration of shares between storage servers Actually there are probably too many overlapping tickets here. Part of the redundancy is due to distinguishing repair from rebalancing. But when #614 and #778 are fixed, a healthy file will by definition be balanced across servers, so there's no need to make that distinction. Perhaps there will also be a "super-healthy" status that means shares are balanced across the *maximum* number of servers, i.e. N. (When we support geographic dispersal / rack-awareness, the definitions of "healthy" and "super-healthy" will presumably change again so that they also imply that shares have the desired distribution.) There are basically four options for how repair/rebalancing could be triggered: * a webapi operation performed by a gateway, and triggered by CLI commands. We already have this. Scheduling this operation automatically is #643. * triggered by write operations on a particular file. This is #232 and #699. * moving a server's shares elsewhere when it is about to be decommissioned or is running out of space. This is #864. * a more autonomous repair/rebalancing service that would run continuously. The last option does not justify 4 tickets! (#450, #483, #543, #661) Unless anyone objects, I'm going to merge these all into #483 [actually #543]edit:.
daira commented 2014-12-29 20:21:04 +00:00
Author
Owner

Duplicate of #543.

Duplicate of #543.
tahoe-lafs added the
duplicate
label 2014-12-29 20:21:04 +00:00
daira closed this issue 2014-12-29 20:21:04 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#661
No description provided.