1.19.0 node connection issues. #4097

Open
opened 2024-04-09 17:35:43 +00:00 by tlhonmey · 1 comment
tlhonmey commented 2024-04-09 17:35:43 +00:00
Owner

I recently decided to update my grid. It was running a mix of 1.14, 1.15, and 1.17. I had upgraded one of the nodes to 1.19 and it started complaining about SSL bad certificate issues when trying to communicate with other nodes.

After some discussion with meejah on IRC, it seemed like the best way to deal with the certificate mismatches was to just rebuild the grid, and then copy in the old storage folder.

After rebuilding the grid, things are... Strange.

The introducer node, can talk to everyone. That's good.
Node No. 1, which is running on the same machine as the introducer, with a different port, can talk to everyone as well. That's good.

All the other nodes in the grid can only talk to one or maybe two different nodes, and that doesn't necessarily include themselves for some reason.

What's more, the helpful connection error report on the web status page has been replaced with opaque stack traces -- without even any line breaks -- like:

failure: [Failure instance: Traceback: <class 'allmydata.util.deferredutil.MultiFailure'>: /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:916:errback /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:984:_startRunCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1078:_runCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1949:_gotResultInlineCallbacks --- <exception caught here> --- /home/annie/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1078:_runCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:809:convertCancelled /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:292:_cancelledToTimedOutError /home/user/.local/lib/python3.12/site-packages/twisted/python/failure.py:481:trap /home/user/.local/lib/python3.12/site-packages/twisted/python/failure.py:505:raiseException /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1999:_inlineCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/python/failure.py:519:throwExceptionIntoGenerator /home/user/.local/lib/python3.12/site-packages/allmydata/storage_client.py:1348:_pick_server_and_get_version /home/user/.local/lib/python3.12/site-packages/allmydata/storage_client.py:1338:get_istorage_server ]

The stdout of the half-connected nodes contains nothing but messages about factories being started and stopped, with no real indication about why.

Meejah seemed to think this may have something to do with GBS. I'd be happy to do some diagnostics if there's some way we can coax something useful out of the system.

I recently decided to update my grid. It was running a mix of 1.14, 1.15, and 1.17. I had upgraded one of the nodes to 1.19 and it started complaining about SSL bad certificate issues when trying to communicate with other nodes. After some discussion with meejah on IRC, it seemed like the best way to deal with the certificate mismatches was to just rebuild the grid, and then copy in the old storage folder. After rebuilding the grid, things are... Strange. The introducer node, can talk to everyone. That's good. Node No. 1, which is running on the same machine as the introducer, with a different port, can talk to everyone as well. That's good. All the other nodes in the grid can only talk to one or maybe two different nodes, and that doesn't necessarily include themselves for some reason. What's more, the helpful connection error report on the web status page has been replaced with opaque stack traces -- without even any line breaks -- like: ``` failure: [Failure instance: Traceback: <class 'allmydata.util.deferredutil.MultiFailure'>: /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:916:errback /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:984:_startRunCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1078:_runCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1949:_gotResultInlineCallbacks --- <exception caught here> --- /home/annie/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1078:_runCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:809:convertCancelled /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:292:_cancelledToTimedOutError /home/user/.local/lib/python3.12/site-packages/twisted/python/failure.py:481:trap /home/user/.local/lib/python3.12/site-packages/twisted/python/failure.py:505:raiseException /home/user/.local/lib/python3.12/site-packages/twisted/internet/defer.py:1999:_inlineCallbacks /home/user/.local/lib/python3.12/site-packages/twisted/python/failure.py:519:throwExceptionIntoGenerator /home/user/.local/lib/python3.12/site-packages/allmydata/storage_client.py:1348:_pick_server_and_get_version /home/user/.local/lib/python3.12/site-packages/allmydata/storage_client.py:1338:get_istorage_server ] ``` The stdout of the half-connected nodes contains nothing but messages about factories being started and stopped, with no real indication about why. Meejah seemed to think this may have something to do with GBS. I'd be happy to do some diagnostics if there's some way we can coax something useful out of the system.
tahoe-lafs added the
unknown
normal
defect
n/a
labels 2024-04-09 17:35:43 +00:00
tahoe-lafs added this to the undecided milestone 2024-04-09 17:35:43 +00:00
Owner

One thing to try, in case it's "something GBS related" or something HTTP related would be to turn off GBS. In tahoe.cfg you can do this in both the storage and client sections with a line force_foolscap = true

One thing to try, in case it's "something GBS related" or something HTTP related would be to turn off GBS. In `tahoe.cfg` you can do this in both the `storage` and `client` sections with a line `force_foolscap = true`
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#4097
No description provided.