WUI server should have a disallow-all robots.txt #823

Open
opened 2009-11-01 01:13:48 +00:00 by davidsarah · 6 comments
davidsarah commented 2009-11-01 01:13:48 +00:00
Owner

Currently, if a web crawler gets access to a Tahoe WUI gateway server then it will crawl all reachable links. This is probably undesirable, or at least not a sensible default (even though it is understood that robots.txt is not meant as a security mechanism).

WUI servers should have a disallow-all robots.txt:

User-agent: *
Disallow: /

The robots.txt specification is at http://www.robotstxt.org/orig.html

Currently, if a web crawler gets access to a Tahoe WUI gateway server then it will crawl all reachable links. This is probably undesirable, or at least not a sensible default (even though it is understood that `robots.txt` is not meant as a security mechanism). WUI servers should have a disallow-all `robots.txt`: ``` User-agent: * Disallow: / ``` The robots.txt specification is at <http://www.robotstxt.org/orig.html>
tahoe-lafs added the
code-frontend-web
major
defect
1.5.0
labels 2009-11-01 01:13:48 +00:00
tahoe-lafs added this to the undecided milestone 2009-11-01 01:13:48 +00:00
davidsarah commented 2009-11-01 01:21:08 +00:00
Author
Owner

On closer examination, the Welcome (root) page only links to statistics pages. OTOH, a directory page might be linked from elsewhere on the web, in which case everything reachable from that directory would be crawled. Anyway, it seems easy to fix.

On closer examination, the Welcome (root) page only links to statistics pages. OTOH, a directory page might be linked from elsewhere on the web, in which case everything reachable from that directory would be crawled. Anyway, it seems easy to fix.
davidsarah commented 2009-11-01 02:04:17 +00:00
Author
Owner

The Welcome page does include the introducer FURL, which some users might want to keep private as per #562.

The Welcome page does include the introducer FURL, which some users might want to keep private as per #562.

I think it is kind of cool that I occasionally find files on Tahoe-LAFS grid in google search results.

I think it is kind of cool that I occasionally find files on Tahoe-LAFS grid in google search results.
davidsarah commented 2009-12-20 23:44:08 +00:00
Author
Owner

If you like this bug, you might also like #860.

If you like this bug, you might also like #860.
davidsarah commented 2010-12-26 03:25:41 +00:00
Author
Owner

warner in /tahoe-lafs/trac-2024-07-25/issues/5189#comment:29 gives another reason to fix this ticket:

Incidentally, someone told me the other day that any URLs sent through various google products (Google Talk the IM system, Gmail, anything you browse while the Google Toolbar is in your browser) gets spidered and added to the public index. The person couldn't think of any conventions (beyond robots.txt) to convince them to not follow those links, but they could think of lots of things to encourage their spider even more.

I plan to do some tests of this (or just ask google's spider to tell me about tests which somebody else has undoubtedly performed already).

I know, I know, it's one of those boiling the ocean things, it's really unfortunate that so many tools are so hostile to the really-convenient idea of secret URLs.

warner in [/tahoe-lafs/trac-2024-07-25/issues/5189](/tahoe-lafs/trac-2024-07-25/issues/5189)#comment:29 gives another reason to fix this ticket: > Incidentally, someone told me the other day that any URLs sent through various google products (Google Talk the IM system, Gmail, anything you browse while the Google Toolbar is in your browser) gets spidered and added to the public index. The person couldn't think of any conventions (beyond robots.txt) to convince them to *not* follow those links, but they could think of lots of things to encourage their spider even more. > > I plan to do some tests of this (or just ask google's spider to tell me about tests which somebody else has undoubtedly performed already). > > I know, I know, it's one of those boiling the ocean things, it's really unfortunate that so many tools are so hostile to the really-convenient idea of secret URLs.

I disagree with "WUI server should have a disallow-all robots.txt". I think if a web crawler gets access to a cap then it should crawl and index all the files and directories reachable from that cap. I suppose you can put a robots.txt file into a directory in Tahoe-LAFS if you want crawlers to ignore that directory.

I disagree with "WUI server should have a disallow-all robots.txt". I think if a web crawler gets access to a cap then it *should* crawl and index all the files and directories reachable from that cap. I suppose you can put a `robots.txt` file into a directory in Tahoe-LAFS if you want crawlers to ignore that directory.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#823
No description provided.