UploadUnhappinessError with available storage nodes > shares.happy #1791
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1791
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The error happened with 1.9.1 too. I just upgraded to 1.9.2 and fixed some files/dir that 1.9.1 couldn't repair reliably hoping the following problem would get away too (it didn't).
There are some peculiarities in my setup: I use USB disks connected to a single server so all storage nodes are running on the same server although physically on a disk that can easily be sent away for increasing the durability of the whole storage. At the time of failure there were 7 such storage nodes in my setup and my whole store was fully repaired on these 7 nodes, all the content is/was uploaded with
shares.needed = 4
shares.happy = 6
shares.total = 6
Although 7 >= 6 I get this error when trying to tahoe cp a new file:
I recently found out about flogtool, so I run it on the client node (which is one of the 7 storage nodes btw), I only pasted the last part from CHKUploader (I can attach the whole log if needs be):
Thanks for the log, that's very useful.
The uploader only tried to contact 5 servers, which is the problem. Are you absolutely sure that more than the 5 servers mentioned (i.e. zp6jpfeu, pa2myijh, omkzwfx5, wo6akhxt, ughwvrtu) are connected?
Maybe we should include the set of connected servers in the log or the error message.
Yes, I'm quite sure all 7 were active. I start all nodes with a single script that in turns:
If one of the expected node can't be started I see it right away in the script output when starting the grid.
At the time of failure, I even checked the web interface of both the node I use as a client and the introducer and they both listed all 7 storage nodes.
I even checked that there was plenty of free space on each storage node and that there was no configured reserved space that could explain a possible node refusal of storing data.
I just rechecked and noticed something. The server has 3 IP addresses: the loopback, a private IP on a local network and a private IP on a VPN (managed by OpenVPN). Apparently each node advertises its services on all 3 IPs (I assume it's by design).
But the listing of storage nodes given by my "client" node isn't exactly consistent with the one given by the introducer.
Here are the current outputs (there shouldn't be any security problem publishing this so I didn't obfuscate anything):
Introducer's Service Announcements:
Introducer's Subscribed Clients:
"The storage node I use as a client"'s status:
Connected to 7 of 7 known storage servers:
I'm not sure how the service announcement and IP selection works, but there seems to be at least some amount of chance involved in the IP selection. All nodes should behave themselves in the same way so AFAIK the same IP should be selected.
Hmm. d3fycapp and lehmccp7 were the servers that were not contacted in the failed upload, and they have IPs 127.0.0.1 and 192.168.0.1 respectively. But wo6akhxt also has IP 127.0.0.1 and that was contacted in the failed upload, so that might be a red herring.
I don't know why the determination of which IP to use is nondeterministic. warner?
BTW, I have seen nondeterministic choice of IP on the Least Authority Enterprises servers (which are EC2 instances running Ubuntu) as well.
Replying to davidsarah:
Note that the assigned IPs are not always the same: a restart of all storage nodes reshuffles them. My last failed attempt was with d3fycapp and lehmccp7 seen both on 10.8.0.10 (VPN IP) by the client node (5 nodes where seen on 10.8.0.10 with ughwvrtu on 192.168.0.1 and the client (omkzwfx5) on loopback. It seems the IP addresses used don't change the error: I've always seen the same (only 5 servers queried) since the problem appeared.
The problem may not have anything to do with IP address choices but it seems to me these inconsistencies are odd enough to keep in mind.
Please add the following just after line 225 (i.e. after
readonly_servers =
... and before# decide upon the renewal/cancel secrets
...) of [src/allmydata/immutable/upload.py in 1.9.2]source:1.9.2/src/allmydata/immutable/upload.py:and then show the log for a failing upload.
(You need to restart the gateway after changing the code, but it's not necessary to rebuild it.)
Replying to davidsarah:
I may not have done it right : I got the same output with this at the end:
BUT... I may have a lead looking at the last error message in my original log dump.
server selection unsuccessful for : shares could be placed on only 5 server(s) [...], merged=sh0: zp6jpfeu, sh1: pa2myijh, sh2: pa2myijh, sh3: omkzwfx5, sh4: wo6akhxt, sh5: ughwvrtu
I assume the sh are the shares to be placed. sh1 and sh2 were affected to pa2myijh. I'm not sure if this repartition is the result of share detection (my guess) or the result of a share placement algorithm that could produce invalid placement and needs a check before upload (late error detection isn't good practice so I bet it's not the case).
What if these shares are already stored on pa2myijh before the upload attempt (due to past uploads with a buggy version or whatever happened in the store directory out of Tahoe's control). Is the code able to detect such a case and reupload one of the two shares on a free (without one of the 6 shares) server? If not, it might be the cause of my problem (the file was part of a long list of files I tried to upload with only partial success weeks ago...) and my storage nodes are most probably polluted by "dangling" shares.
There was a bug in the statement I asked you to add; please replace it entirely with this one:
In answer to your last question, the fact that there are existing shares should not cause an
UploadUnhappinessError
. However, bugs #1124 and #1130 describe cases where we don't achieve that. I don't think that your problem is due to that, though, because it seems from the logs that the gateway is not contacting enough servers to make it possible to achieve happiness, regardless of the existing share distribution.[more precisely it isn't receiving responses from enough servers. At this point we're not sure whether it is contacting them, although the "sent 5 queries to 5 servers" in the
UploadUnhappinessError
message suggests that it isn't.]Edit:I'm hoping that this bug is the same one that has been occasionally reported on VolunteerGrid2 with uploads where
shares.happy
is close toshares.total
(and to the number of servers). It has very similar symptoms, but gyver seems to be able to reproduce it more easily.Replying to davidsarah:
Here's the log with a bit of context:
Replying to [gyver]comment:12:
OK, that proves that the problem occurs after deciding which servers are writeable. We seem to be logging only responses to remote
allocate_buckets
requests at the gateway, so the next steps are:a) Log when the gateway sends an
allocate_buckets
request.b) Look at the logs of the storage servers to see how many of them receive an
allocate_buckets
request (which is logged [here]source:1.9.2/src/allmydata/storage/server.py#L248 as "storage: allocate_buckets <SI>
") and what they do about it.To do a), add this at line 105 of
src/allmydata/immutable/upload.py
(in thequery
method ofServerTracker
afterrref = self._server.get_rref()
):Same bug as #2016?
From #2016 which has now been marked as a duplicate:
[//trac/tahoe-lafs/ticket/2016#comment:89311 daira] wrote:
[//trac/tahoe-lafs/ticket/2016#comment:89312 daira] wrote:
[//trac/tahoe-lafs/ticket/2016#comment:89313 zooko] replied:
[//trac/tahoe-lafs/ticket/2016#comment:89315 kapiteined] wrote:
[//trac/tahoe-lafs/ticket/2016#comment:89316 daira] wrote:
[//trac/tahoe-lafs/ticket/2016#comment:9 daira] wrote:
Replying to myself in comment:89320:
I was wrong here; it is quite similar to #1130, which also has h = N. (#1130 has some additional oddities in the share distribution that was chosen, but I don't think they're relevant.) The fact that we terminate the distribution algorithm as soon as all shares are placed is the underlying problem in all these cases.
So, the branch from #1382 will fix this bug.