WUI server should have a disallow-all robots.txt #823

New Issue

tahoe-lafs · 2009-11-01T01:13:48Z

davidsarah commented

2009-11-01 01:13:48 +00:00

Currently, if a web crawler gets access to a Tahoe WUI gateway server then it will crawl all reachable links. This is probably undesirable, or at least not a sensible default (even though it is understood that robots.txt is not meant as a security mechanism).

WUI servers should have a disallow-all robots.txt:

User-agent: *
Disallow: /

The robots.txt specification is at http://www.robotstxt.org/orig.html

Currently, if a web crawler gets access to a Tahoe WUI gateway server then it will crawl all reachable links. This is probably undesirable, or at least not a sensible default (even though it is understood that `robots.txt` is not meant as a security mechanism). WUI servers should have a disallow-all `robots.txt`: ``` User-agent: * Disallow: / ``` The robots.txt specification is at <http://www.robotstxt.org/orig.html>

tahoe-lafs added the

labels 2009-11-01 01:13:48 +00:00

tahoe-lafs added this to the undecided milestone 2009-11-01 01:13:48 +00:00

davidsarah commented

2009-11-01 01:21:08 +00:00

On closer examination, the Welcome (root) page only links to statistics pages. OTOH, a directory page might be linked from elsewhere on the web, in which case everything reachable from that directory would be crawled. Anyway, it seems easy to fix.

davidsarah commented

2009-11-01 02:04:17 +00:00

The Welcome page does include the introducer FURL, which some users might want to keep private as per #562.

zooko commented

2009-11-01 04:42:06 +00:00

I think it is kind of cool that I occasionally find files on Tahoe-LAFS grid in google search results.

davidsarah commented

2009-12-20 23:44:08 +00:00

If you like this bug, you might also like #860.

davidsarah commented

2010-12-26 03:25:41 +00:00

warner in /tahoe-lafs/trac-2024-07-25/issues/5189#comment:29 gives another reason to fix this ticket:

Incidentally, someone told me the other day that any URLs sent through various google products (Google Talk the IM system, Gmail, anything you browse while the Google Toolbar is in your browser) gets spidered and added to the public index. The person couldn't think of any conventions (beyond robots.txt) to convince them to not follow those links, but they could think of lots of things to encourage their spider even more.

I plan to do some tests of this (or just ask google's spider to tell me about tests which somebody else has undoubtedly performed already).

I know, I know, it's one of those boiling the ocean things, it's really unfortunate that so many tools are so hostile to the really-convenient idea of secret URLs.

warner in [/tahoe-lafs/trac-2024-07-25/issues/5189](/tahoe-lafs/trac-2024-07-25/issues/5189)#comment:29 gives another reason to fix this ticket: > Incidentally, someone told me the other day that any URLs sent through various google products (Google Talk the IM system, Gmail, anything you browse while the Google Toolbar is in your browser) gets spidered and added to the public index. The person couldn't think of any conventions (beyond robots.txt) to convince them to *not* follow those links, but they could think of lots of things to encourage their spider even more. > > I plan to do some tests of this (or just ask google's spider to tell me about tests which somebody else has undoubtedly performed already). > > I know, I know, it's one of those boiling the ocean things, it's really unfortunate that so many tools are so hostile to the really-convenient idea of secret URLs.

zooko commented

2010-12-28 16:00:24 +00:00

I disagree with "WUI server should have a disallow-all robots.txt". I think if a web crawler gets access to a cap then it should crawl and index all the files and directories reachable from that cap. I suppose you can put a robots.txt file into a directory in Tahoe-LAFS if you want crawlers to ignore that directory.

I disagree with "WUI server should have a disallow-all robots.txt". I think if a web crawler gets access to a cap then it *should* crawl and index all the files and directories reachable from that cap. I suppose you can put a `robots.txt` file into a directory in Tahoe-LAFS if you want crawlers to ignore that directory.

Sign in to join this conversation.