logging: compress/truncate/encode/decode/format for human readability and privacy #57
Labels
No Label
0.2.0
0.3.0
0.4.0
0.5.0
0.5.1
0.6.0
0.6.1
0.7.0
0.8.0
0.9.0
1.0.0
1.1.0
1.10.0
1.10.1
1.10.2
1.10a2
1.11.0
1.12.0
1.12.1
1.13.0
1.14.0
1.15.0
1.15.1
1.2.0
1.3.0
1.4.1
1.5.0
1.6.0
1.6.1
1.7.0
1.7.1
1.7β
1.8.0
1.8.1
1.8.2
1.8.3
1.8β
1.9.0
1.9.0-s3branch
1.9.0a1
1.9.0a2
1.9.0b1
1.9.1
1.9.2
1.9.2a1
LeastAuthority.com automation
blocker
cannot reproduce
cloud-branch
code
code-dirnodes
code-encoding
code-frontend
code-frontend-cli
code-frontend-ftp-sftp
code-frontend-magic-folder
code-frontend-web
code-mutable
code-network
code-nodeadmin
code-peerselection
code-storage
contrib
critical
defect
dev-infrastructure
documentation
duplicate
enhancement
fixed
invalid
major
minor
n/a
normal
operational
packaging
somebody else's problem
supercritical
task
trivial
unknown
was already fixed
website
wontfix
worksforme
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Reference: tahoe-lafs/trac-2024-07-25#57
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Mojo Nation's and Mnet's logging system had a
humanreadable
function that would be used to compress, truncate, encode, decode and format data before it was interpolated into the format string. For example, 20-byte binary identifiers can be represented as "<abcde>
", and 64,000 byte strings can be represented as "abcdefg ...63,986 bytes omitted... 4567890
".This humanreadability scheme derived from the Python Standard Library's
repr
module.One might ask: why should humanreadability be integrated into the logging system? Why shouldn't people write:
The answer is found in ticket #56 -- laziness. Also in the fact that you sometimes want to log an object, and you don't know if that object should be turned into a humanreadable form by calling .str() on it or by passing it to humanreadable().
Although perhaps this could be both lazy and humanreadable by dint of some sort of lazy Adaptor pattern?
Another feature that this could implement is hiding private information from logs.
Capabilities, for example, could be securely hashed, or even just truncated, so that the logs don't contain the authorities to a bunch of stuff. (Actually the truncation for human-readability might well accomplish this already.)
the foolscap logging work I've been doing this week is leading me to log calls that look like the following:
or maybe
(which is easier to implement but harder to use).
I want to keep the machine-readable data in the log events, so e.g. I want to record the num_peers integer rather than mostly-irrevocably squashing it into a string right away (which is useful for a human but more of a hassle for a log viewer that wants to apply a filter to show all begin-upload messages for which num_peers<5).
OTOH, I don't want to have the log events retain references to large or mutable objects. What I'm planning is to pickle a dictionary that includes the format string and the arguments.
Anyways, this gives us the opportunity to ask those objects how they want to be represented, and we could apply a humanreadable()-type transformation on them at that time. Note that this could occur well after the log message was first recorded: for example, a post-mortem log display tool might apply humanreadable() only if the message is actually being displayed to a human. Many uses of the logfiles will display only a tiny fraction of the log events to a human, so doing lazy conversion seems to be a good idea.
I'm not entirely convinced that this will help us hide private information from the logs: either you explicitly add code to the log.msg() call to wrap the arguments (in which case you make it a point to log e.g. the storage index instead of the full URI), or the code is blindly throwing its arguments into the log message (
log.msg("callback fired: %s %s", args, kwargs)
, in which case the logger or humanreadable-izer or privacy-filterer needs to recursively traverse through the arguments and apply the filter to everything inside them.Hmm. If I use foolscap serialization when recording these arguments, I get more control over what Slicers get used for each object. We could set up a context for that serialization that says "ask each object how it wants to be serialized, but remind them that we don't want to record any private information".
I still can't really see how to use that easily or safely, though.
Replying to warner:
I'm also skeptical, but adding the privacy keyword anyway.
There are three potentially private types of data in logs: caps, filenames, and IP addresses. I've had one user decline, at first, to send me his logs, and when I persuaded him that we weren't logging any filenames or caps then he went and ran some sort of sed script on the
twistd.log
to censor his IP address, and then when he realized that he couldn't use sed to do that to the foolscap logs (they are in a binary pickle format) then he went ahead and sent the foolscap logs even though they had his IP address in them.I'm not sure what the moral of the story is, other than that there are only three specific types that (as far as I know) anybody cares to withhold when sharing their logs.
logging: compress/truncate/encode/decode/format for human readabilityto logging: compress/truncate/encode/decode/format for human readability and privacy