scale up to many nodes #235
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#235
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I updated [The UseCases Page](wiki/UseCases) to reflect that someone might want to run a managed Tahoe grid comprising one thousand nodes. (If each node has a single 1 TB hard drive, that's a 1 PB grid. Obviously there are lots of other options, such as each node having six 1 TB hard drives in a RAID-6 configuration, resulting in 4 usable TB per node or a 4 PB grid.)
Anyway, we expect that the current Tahoe grid would have problems handling more simultaneously connected nodes. One known problem is that pyOpenSSL uses almost 1 MB of RAM per SSL connection. (See also #11.)
This ticket can be closed when Tahoe is demonstrated to handle one thousand simultaneously connected nodes smoothly.
From ticket:872#comment:16 :
Note that this shouldn't be needed in order to scale to 1000 nodes -- the size of the location and public key info for 1000 nodes should easily be small enough to fit into memory. Do we need another ticket for scaling to grids with hundreds of thousands of nodes, or am I being too prematurely ambitious? :-)
There are a number of hurdles to scale up to lots of nodes. This ticket is sort of a reminder to enumerate some of them, or record known limitations and potential solutions.
The limitation alluded to in the summary is probably the first hurdle: Foolscap, at least, has been observed (as of the creation date of this ticket, some 2 years ago) to consume an unreasonable approx. 1MB of RAM per open connection. I seem to remember doing some analysis and deciding that pyOpenSSL was to blame, rather than foolscap, but that was a long time ago and the tests should be run again before putting too much energy into it. There's no good reason for it to use this much memory.. the connection state and buffers should really fit into a couple of kilobytes.
The next hurdle will be the current practice of maintaining open connections to all known storage servers. If we left the protocols alone, we could change this to open connections on-demand, but that would incur a significant per-file latency (for both upload and download), and of course things like file-check and mutable-file publish would become really really slow because both want to query lots of servers. So changing the peer-selection protocols would probably be necessary to effectively remove this limitation.
If you like this ticket, you might like #444 (reduce number of active connections: connect on demand).