peers failure tolerance #66
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#66
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
-Linux ubuntu4lu 2.6.20-15-server #2 SMP Sun Apr 15 07:41:34 UTC 2007 i686 GNU/Linux
-# allmydata-tahoe --version
Twisted version: 2.5.0
In my testing, I have 1 introducer and 3 clients.
I uploaded a large file (450MB) via peer #3 (btw, it only works with peer #3 due to it having 1GB RAM. The others have 512MB and could not handle the large file)
If I take down BOTH peer #1 and peer#2 , I can still download the file.
If I take down peer#3, the file starts to download but cannot be completed.
Is there any predictable way to know what is the failure tolerance ?
It would be good to also have the client knows ahead of time that it cannot provide a complete file based on the information polled from all other peers.
Thanks.
Lu
Dear lvo:
This is a very good question.
If I (or Brian Warner) answer this question, will you submit a patch to the relevant docs which will make the answer apparent to the next person who comes after you, who is wondering the same thing?
Simplistically, the current failure tolerance is 3-out-of-4. If less than 3/4 of your servers fail, then you'll almost certainly be able to get your data back. If more than 3/4 of your servers fail, then you'll almost certainly not be able to. If exactly 3/4 of your servers fail, then it depends. :-)
Thanks,
Zooko
Thanks Zooko. I will.
Is that your answer ? or is it just a simplistic answer and you or Brian will supply a more detailed answer ? :-)
I have since added a 4th peer to my setup and you are correct in saying that if 3/4 of servers fail, if the remaining 1/4 is NOT the peer that served as the original peer for upload then the file is lost. Otherwise the file is intact.
Lu
So to be precise, we're using 25-out-of-100 encoding by default, so what matters is whether you are able to retrieve at least 25 shares. The shares are assigned to peers according to the "tahoe three"/TahoeThree algorithm, which will distribute them evenly only in the limit as the number of peers you have is much larger than 100.
Imagine a clock face, with 100 marks evenly spaced around the edge: these represent the shares. Now choose a random location for each peer (these represent the permuted peerlist, in which each peerid is hashed together with the per-file storage index). Each share travels clockwise until it hits a peer. That's it. You can see that for some files, two peers will wind up very close to each other, in which case one of them will get a lot more shares than the other. If there are lots of peers, this tends to be a bit more uniform, but if you only have 3 peers, then the distribution will be a lot less uniform.
Also note that each file you upload gets a different mapping, so if you upload a few hundred equally-sized files and then compare all the peers, you should see them all hosting about the same amount of space. But if you only upload one file, you'll see very non-uniform distribution of shares.
So in very small networks, it is not easy to predict how many (or which) peers need to be alive to provide for any individual file.
It is probably the case that the "tahoe two"/TahoeTwo algorithm provides more uniform allocation of shares, even in small networks. Ticket #16 is a request to explain and justify our choice of tahoe three over tahoe two: I suspect that this non-uniform allocation of shares is an argument to move back to tahoe two.
When a file is downloaded, the very first thing source:src/allmydata/download.py does is to ask around and find out who has which shares. If it cannot find enough, you get an immediate NotEnoughPeersError.
Oh, also note when counting peers, your own host is just as valuable a peer as all the others. So if you join a mesh that already has three clients, your own machine is a fourth, and on average each client (including your own) will wind up holding 25% of the total shares for anything you upload. That means that your own machine, all by itself, should be sufficient (!!!on average!!!) to recover any files you've uploaded. But of course the non-uniformity of share distribution probably gives you a 50/50 chance of success.
I really appreciate the detailed explanation Warner.
However, I am still unclear on this point: "That means that your own machine, all by itself, should be sufficient (!!!on average!!!) to recover any files you've uploaded"
If my machine is only 1 out of 1000 peers, and if the file I upload is divided into ~ 100 shares (which I understand to be what you refer to as segments), and the shares are distributed starting at a random location and going around the rim, how would my machine managed to have most or all of that shares ?
Thanks.
Lu
Dear Lu:
In the imminent release of Tahoe v0.6, we have fixed the distribution of shares onto a small number of peers to be more even.
So with v0.6 the behavior is better, but not until ticket #92 is done will the user be able to see what the behavior is. Merging this ticket into ticket #92.