Improve google search results for phrases like "tahoe file storage" #1719

Open
opened 2012-04-13 07:10:58 +00:00 by amiller · 4 comments

Tahoe-LAFS could benefit from some SEO.

If you search for "tahoe lafs", the first result is tahoe-lafs.org - straight to where you'd expect. However, if you search for "tahoe secure file storage", "tahoe secure", or other reasonable phrases (omitting "lafs"), the results are much less useful. The pycon talk notes tend to show up as the first result -they're filled with allmydata.org links that correctly redirect to https://tahoe-lafs.org, at least.

<zooko> I think we may be telling google not to index any of https://tahoe-lafs.org 
        with our robots.txt, which would be the first thing to change for that.
<zooko> There might be a ticket about the terrible anti-SEO.

Beyond that, perhaps by helping web crawlers access the site, we can benefit from the external search engines when searching for tickets, code, etc. (See #1691 for trac search delays)

Tahoe-LAFS could benefit from some SEO. If you search for "tahoe lafs", the first result is tahoe-lafs.org - straight to where you'd expect. However, if you search for ["tahoe secure file storage"](https://encrypted.google.com/search?hl=en&q=tahoe%20secure%20file%20storage), ["tahoe secure"](https://encrypted.google.com/search?hl=en&q=tahoe%20secure), or other reasonable phrases (omitting "lafs"), the results are much less useful. The [pycon talk notes](https://tahoe-lafs.org/~warner/pycon-tahoe.html) tend to show up as the first result -they're filled with allmydata.org links that correctly redirect to <https://tahoe-lafs.org>, at least. ``` <zooko> I think we may be telling google not to index any of https://tahoe-lafs.org with our robots.txt, which would be the first thing to change for that. <zooko> There might be a ticket about the terrible anti-SEO. ``` Beyond that, perhaps by helping web crawlers access the site, we can benefit from the external search engines when searching for tickets, code, etc. (See #1691 for trac search delays)
amiller added the
website
normal
defect
n/a
labels 2012-04-13 07:10:58 +00:00
amiller added this to the undecided milestone 2012-04-13 07:10:58 +00:00

I was wrong about robots.txt. https://tahoe-lafs.org/robots.txt currently says:

User-agent: *
Disallow: /trac/
Allow: /trac/tahoe-lafs/wiki/
Disallow: /source/
Disallow: /darcs.cgi/
Disallow: /buildbot
Crawl-Delay: 30

Which I think ought to allow search engines to inde the wiki. I don't know what else is needed to get search engines to give useful results to people making those sorts of services.

I was wrong about robots.txt. <https://tahoe-lafs.org/robots.txt> currently says: ``` User-agent: * Disallow: /trac/ Allow: /trac/tahoe-lafs/wiki/ Disallow: /source/ Disallow: /darcs.cgi/ Disallow: /buildbot Crawl-Delay: 30 ``` Which I think ought to allow search engines to inde the wiki. I don't know what else is needed to get search engines to give useful results to people making those sorts of services.

Some of our content, such as https://tahoe-lafs.org/trac/tahoe-lafs/browser/docs/about.rst for example, is served up directly from the trac source browser. To let that stuff be indexable, at Tony Arcieri's suggestion, I removed the exclusion of trac from robots.txt. It now looks like this:

User-agent: *
Disallow: /source/
Disallow: /buildbot-tahoe-lafs
Disallow: /buildbot-zfec
Disallow: /buildbot-pycryptopp
Crawl-Delay: 60

This might impose too much CPU and disk-IO load on our server. We'll see.

Some of our content, such as <https://tahoe-lafs.org/trac/tahoe-lafs/browser/docs/about.rst> for example, is served up directly from the trac source browser. To let that stuff be indexable, at Tony Arcieri's suggestion, I removed the exclusion of trac from robots.txt. It now looks like this: ``` User-agent: * Disallow: /source/ Disallow: /buildbot-tahoe-lafs Disallow: /buildbot-zfec Disallow: /buildbot-pycryptopp Crawl-Delay: 60 ``` This might impose too much CPU and disk-IO load on our server. We'll see.

Brian pointed out that this might also clobber the trac.db, which contains cached information from darcs. Specifically, it caches the "annotate" results (a.k.a. "blame") from darcs. I don't know if it caches anything else.

It currently looks like this:

-rw-rw-r--  1 trac source 408165376 2012-05-09 19:13 trac.db

But "annotate"/"blame" has been broken ever since I upgraded the darcs executable from v2.5 to v2.8, so maybe nothing will get cached.

Brian pointed out that this might also clobber the trac.db, which contains cached information from darcs. Specifically, it caches the "annotate" results (a.k.a. "blame") from darcs. I don't know if it caches anything else. It currently looks like this: ``` -rw-rw-r-- 1 trac source 408165376 2012-05-09 19:13 trac.db ``` But "annotate"/"blame" has been broken ever since I upgraded the darcs executable from v2.5 to v2.8, so maybe nothing will get cached.

Looking at the HTTP logs, I'm seeing hits with the Googlebot UA happening a lot faster than every 60 seconds, e.g. 18 hits in a 4 minute period. The "Crawl-Delay" wasn't changed, though, so I'm wondering if maybe that's the wrong field name.

The site feels slower than it did a few months ago, but I don't have any measurements to support it.

The trac.db file today (2012-05-31) is currently at 567MB, up from 408MB in the last three weeks.

Looking at the HTTP logs, I'm seeing hits with the Googlebot UA happening a lot faster than every 60 seconds, e.g. 18 hits in a 4 minute period. The "Crawl-Delay" wasn't changed, though, so I'm wondering if maybe that's the wrong field name. The site **feels** slower than it did a few months ago, but I don't have any measurements to support it. The trac.db file today (2012-05-31) is currently at 567MB, up from 408MB in the last three weeks.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: tahoe-lafs/trac-2024-07-25#1719
No description provided.