'rm' via sftp+sshfs may hang if previous operations on the file are "stuck" #1201
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#1201
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I was trying to figure out what part of the bonnie++ benchmark was failing on tahoe (via sshfs), so I mounted the sshfs interface and ran:
It hung for a while after "start 'em", so I checked the flog and there was no activity, so I hit Ctr-C, which gave me those 2 errors and then the interface hung.
I had to kill -9 sshfs to get everything to free up, then I re-mounted the sshfs interface, did a:
And it hung on the 'rm'.
I had flogtool running for the 'rm', and the output was (below).
After restarting the tahoe client process and then reconnecting via sshfs, I was able to delete the file.
The most likely explanation is that the .sync on
<GeneralSFTPFile #7>(/Bonnie.3520)
is hanging because some previous operation on that file object failed to complete. By design, operations that would be synchronous in a POSIX filesystem (like unlink/delete) wait for previous operations on the same file, which is necessary to avoid race conditions. (Otherwise, create-followed-by-delete could be reordered to delete-followed-by-create, which would leave the file in place.)When an SFTP client connection is closed, we still may have operations that were performed on that connection that need to complete, so we don't just drop open handles to a file at that point. However, there's no timeout for how long an operation takes, so operations that never complete can result in a file becoming "stuck" and undeletable, until the gateway is restarted. This seems to be much more likely to happen if the sshfs process is killed (even though the server should be able to tolerate that). I'll have to have a good think about how to resolve this; it isn't obvious how to do it without risking integrity.
(The calls to .abandon mark the file handle as being immediately closeable, but in this case it isn't closed before the .sync, so that doesn't help.)
It's a bit strange that there are so many copies of the same file object in the heisenfile dicts (shown after '
files =
' in line 6 of the log). There should normally be no more than two per open handle to a file.'rm' via sftp+sshfsto 'rm' via sftp+sshfs may hang if previous operations on the file are "stuck"Ah, it's all coming back to me now. I'd originally intended to close any open files on an SFTP connection when the connection was dropped (as it is when the sshfs process is killed), but then realized that is incorrect and might lead to data loss. So that explains why the comment at source:src/allmydata/frontends/sftpd.py@4545#L837 (which claims that this bug can't happen!) is incorrect.
Attachment stra2 (5402993 bytes) added
strace of the bonnie hang
Attachment stra4 (7787 bytes) added
strace of 'rm' hang
Ok I posted the straces, after i 'kill -9' sshfs, the rest of the traces come out:
bonnie:
and rm: