logging: compress/truncate/encode/decode/format for human readability and privacy #57

Open
opened 2007-05-25 20:02:53 +00:00 by zooko · 4 comments

Mojo Nation's and Mnet's logging system had a humanreadable function that would be used to compress, truncate, encode, decode and format data before it was interpolated into the format string. For example, 20-byte binary identifiers can be represented as "<abcde>", and 64,000 byte strings can be represented as "abcdefg ...63,986 bytes omitted... 4567890".

This humanreadability scheme derived from the Python Standard Library's repr module.

One might ask: why should humanreadability be integrated into the logging system? Why shouldn't people write:

log("We received this string: %s, but it wasn't a substring of %s as we expected.", humanreadable(thisstring), humanreadable(thatstring))

The answer is found in ticket #56 -- laziness. Also in the fact that you sometimes want to log an object, and you don't know if that object should be turned into a humanreadable form by calling .str() on it or by passing it to humanreadable().

Although perhaps this could be both lazy and humanreadable by dint of some sort of lazy Adaptor pattern?

Mojo Nation's and Mnet's logging system had a `humanreadable` function that would be used to compress, truncate, encode, decode and format data before it was interpolated into the format string. For example, 20-byte binary identifiers can be represented as "`<abcde>`", and 64,000 byte strings can be represented as "`abcdefg ...63,986 bytes omitted... 4567890`". This humanreadability scheme derived from the Python Standard Library's `repr` module. One might ask: why should humanreadability be integrated into the logging system? Why shouldn't people write: ``` log("We received this string: %s, but it wasn't a substring of %s as we expected.", humanreadable(thisstring), humanreadable(thatstring)) ``` The answer is found in ticket #56 -- laziness. Also in the fact that you sometimes want to log an object, and you don't know if that object should be turned into a humanreadable form by calling .*str*() on it or by passing it to humanreadable(). Although perhaps this could be both lazy and humanreadable by dint of some sort of lazy Adaptor pattern?
zooko added the
code
minor
enhancement
0.2.0
labels 2007-05-25 20:02:53 +00:00
warner was assigned by zooko 2007-05-25 20:02:53 +00:00
warner added
code-nodeadmin
and removed
code
labels 2007-08-14 18:57:48 +00:00
warner added this to the undecided milestone 2007-10-10 23:17:19 +00:00
Author

Another feature that this could implement is hiding private information from logs.

Capabilities, for example, could be securely hashed, or even just truncated, so that the logs don't contain the authorities to a bunch of stuff. (Actually the truncation for human-readability might well accomplish this already.)

Another feature that this could implement is hiding private information from logs. Capabilities, for example, could be securely hashed, or even just truncated, so that the logs don't contain the authorities to a bunch of stuff. (Actually the truncation for human-readability might well accomplish this already.)

the foolscap logging work I've been doing this week is leading me to log calls that look like the following:

log.msg("beginning upload of SI=%s: %d peers available", storage_index, num_peers)

or maybe

log.msg("beginning upload of SI=%(si)s: %(numpeers)d peers available", si=storage_index, numpeers=num_peers)

(which is easier to implement but harder to use).

I want to keep the machine-readable data in the log events, so e.g. I want to record the num_peers integer rather than mostly-irrevocably squashing it into a string right away (which is useful for a human but more of a hassle for a log viewer that wants to apply a filter to show all begin-upload messages for which num_peers<5).

OTOH, I don't want to have the log events retain references to large or mutable objects. What I'm planning is to pickle a dictionary that includes the format string and the arguments.

Anyways, this gives us the opportunity to ask those objects how they want to be represented, and we could apply a humanreadable()-type transformation on them at that time. Note that this could occur well after the log message was first recorded: for example, a post-mortem log display tool might apply humanreadable() only if the message is actually being displayed to a human. Many uses of the logfiles will display only a tiny fraction of the log events to a human, so doing lazy conversion seems to be a good idea.

I'm not entirely convinced that this will help us hide private information from the logs: either you explicitly add code to the log.msg() call to wrap the arguments (in which case you make it a point to log e.g. the storage index instead of the full URI), or the code is blindly throwing its arguments into the log message (log.msg("callback fired: %s %s", args, kwargs), in which case the logger or humanreadable-izer or privacy-filterer needs to recursively traverse through the arguments and apply the filter to everything inside them.

Hmm. If I use foolscap serialization when recording these arguments, I get more control over what Slicers get used for each object. We could set up a context for that serialization that says "ask each object how it wants to be serialized, but remind them that we don't want to record any private information".

I still can't really see how to use that easily or safely, though.

the foolscap logging work I've been doing this week is leading me to log calls that look like the following: ``` log.msg("beginning upload of SI=%s: %d peers available", storage_index, num_peers) ``` or maybe ``` log.msg("beginning upload of SI=%(si)s: %(numpeers)d peers available", si=storage_index, numpeers=num_peers) ``` (which is easier to implement but harder to use). I want to keep the machine-readable data in the log events, so e.g. I want to record the num_peers integer rather than mostly-irrevocably squashing it into a string right away (which is useful for a human but more of a hassle for a log viewer that wants to apply a filter to show all begin-upload messages for which num_peers<5). OTOH, I don't want to have the log events retain references to large or mutable objects. What I'm planning is to pickle a dictionary that includes the format string and the arguments. Anyways, this gives us the opportunity to ask those objects how they want to be represented, and we could apply a humanreadable()-type transformation on them at that time. Note that this could occur well after the log message was first recorded: for example, a post-mortem log display tool might apply humanreadable() only if the message is actually being displayed to a human. Many uses of the logfiles will display only a tiny fraction of the log events to a human, so doing lazy conversion seems to be a good idea. I'm not entirely convinced that this will help us hide private information from the logs: either you explicitly add code to the log.msg() call to wrap the arguments (in which case you make it a point to log e.g. the storage index instead of the full URI), or the code is blindly throwing its arguments into the log message (`log.msg("callback fired: %s %s", args, kwargs)`, in which case the logger or humanreadable-izer or privacy-filterer needs to recursively traverse through the arguments and apply the filter to everything inside them. Hmm. If I use foolscap serialization when recording these arguments, I get more control over what Slicers get used for each object. We could set up a context for that serialization that says "ask each object how it wants to be serialized, but remind them that we don't want to record any private information". I still can't really see how to use that easily or safely, though.
warner modified the milestone from eventually to undecided 2008-06-01 20:53:24 +00:00
davidsarah commented 2009-12-13 03:26:47 +00:00
Owner

Replying to warner:

I'm not entirely convinced that this will help us hide private information from the logs: either you explicitly add code to the log.msg() call to wrap the arguments (in which case you make it a point to log e.g. the storage index instead of the full URI), or the code is blindly throwing its arguments into the log message (log.msg("callback fired: %s %s", args, kwargs), in which case the logger or humanreadable-izer or privacy-filterer needs to recursively traverse through the arguments and apply the filter to everything inside them.

I'm also skeptical, but adding the privacy keyword anyway.

Replying to [warner](/tahoe-lafs/trac-2024-07-25/issues/57#issuecomment-60300): > I'm not entirely convinced that this will help us hide private information from the logs: either you explicitly add code to the log.msg() call to wrap the arguments (in which case you make it a point to log e.g. the storage index instead of the full URI), or the code is blindly throwing its arguments into the log message (`log.msg("callback fired: %s %s", args, kwargs)`, in which case the logger or humanreadable-izer or privacy-filterer needs to recursively traverse through the arguments and apply the filter to everything inside them. I'm also skeptical, but adding the privacy keyword anyway.
Author

There are three potentially private types of data in logs: caps, filenames, and IP addresses. I've had one user decline, at first, to send me his logs, and when I persuaded him that we weren't logging any filenames or caps then he went and ran some sort of sed script on the twistd.log to censor his IP address, and then when he realized that he couldn't use sed to do that to the foolscap logs (they are in a binary pickle format) then he went ahead and sent the foolscap logs even though they had his IP address in them.

I'm not sure what the moral of the story is, other than that there are only three specific types that (as far as I know) anybody cares to withhold when sharing their logs.

There are three potentially private types of data in logs: caps, filenames, and IP addresses. I've had one user decline, at first, to send me his logs, and when I persuaded him that we weren't logging any filenames or caps then he went and ran some sort of sed script on the `twistd.log` to censor his IP address, and then when he realized that he couldn't use sed to do that to the foolscap logs (they are in a binary pickle format) then he went ahead and sent the foolscap logs even though they had his IP address in them. I'm not sure what the moral of the story is, other than that there are only three specific types that (as far as I know) anybody cares to withhold when sharing their logs.
tahoe-lafs changed title from logging: compress/truncate/encode/decode/format for human readability to logging: compress/truncate/encode/decode/format for human readability and privacy 2012-02-16 16:30:01 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#57
No description provided.