You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Richard Eckart de Castilho <re...@apache.org> on 2016/04/06 22:43:13 UTC

Avoid indexing of old UIMA documentation

Hi all,

I believe some time back we were talking about a strategy to avoid search engines pointing to ancient version of the UIMA documentation.

I have read a bit on rel="canonical" and robots.txt.

1) per webpage - Apparently, one can place a `link rel="canonical"` element on any HTML page. Search engines seeing this tag will then not index this page because it is considered to be a duplicate of whatever other page the link points to.

2) via http header/htaccess - Since we probably don't want to patch up all our JavaDoc files, the information about a canonical source can also be sent in the HTTP header, e.g. via a suitable htaccess file.

I guess the idea would be that for any old documentation page, we would want it to point to its latest version as its canonical source. I mean for every page, not only for the index page. This seems a bit tedious.

My suggestion would be an alternative that exploits the website folder structure and uses robots.txt.

We disallow indexing of the "d" folder on the UIMA website.
We place all the "*-current" folders (svn copies of the latest documentation versions) under a dedicated folder (e.g. "d/current") and allow indexing that.

In that way, the outdated versions of the documentation should be hidden from the search engines and the respective latest versions should be indexed.

Opinions? Does anybody have experience with SEO?

Cheers,

-- Richard


Re: Avoid indexing of old UIMA documentation

Posted by Marshall Schor <ms...@schor.com>.
+1 -Marshall

On 4/7/2016 4:57 PM, Richard Eckart de Castilho wrote:
> We could try this:
>
> --- 
>
> # robots.txt for http://uima.apache.org
>
> User-agent: *
> Disallow: /docs/d/
> Allow: /docs/d/ruta-current/
> Allow: /docs/d/uima-addons-current/
> Allow: /docs/d/uima-as-current/
> Allow: /docs/d/uima-ducc-current/
> Allow: /docs/d/uimacpp-current/
> Allow: /docs/d/uimafit-current/
> Allow: /docs/d/uimaj-current/
>
> ---
>
> Sources on the net say that "Allow" wasn't originally defined, so if we do
> the above, it might be that some search engines don't index the docs anymore
> at all. We might want to set the user-agent to "googlebot".
>
> Also, not all of the documentations use the "*-current" trick yet. But that
> is easy to fix.
>
> Cheers,
>
> -- Richard
>
>> On 07.04.2016, at 22:40, Richard Eckart de Castilho <re...@apache.org> wrote:
>>
>> We can just disallow /d and then allow all the  *-current folders
>> under it explicitly. The only difference I see is that we'd have
>> a couple of more entries in the robots.txt.
>>
>> -- Richard
>>
>>> On 07.04.2016, at 22:36, Marshall Schor <ms...@schor.com> wrote:
>>>
>>> Hi,
>>>
>>> This sounds like a good idea to me :-)
>>>
>>> There's one small issue possibly, to changing the folder structure.  The DOCBOOK
>>> schemes have some fancy way to link between docbooks; these require that the
>>> books be kept relative to one another in some file tree structure.  As long as
>>> that's not changed, I think there will be no problem. 
>>>
>>> If anyone's curious, the relevant bits of config info are in the
>>> uima-docbook-olink project, in the various "site.xml" files.  You can see refs
>>> to the famous "d" folder there.  There may be a dependency on the "books" being
>>> just one directory layer under d/, so putting an extra layer might break things
>>> (but I'm not sure...).
>>>
>>> Maybe there's a way to do this without introducing a new level in the directory?
>>>
>>> -Marshall
>>>
>>> On 4/6/2016 4:43 PM, Richard Eckart de Castilho wrote:
>>>> Hi all,
>>>>
>>>> I believe some time back we were talking about a strategy to avoid search engines pointing to ancient version of the UIMA documentation.
>>>>
>>>> I have read a bit on rel="canonical" and robots.txt.
>>>>
>>>> 1) per webpage - Apparently, one can place a `link rel="canonical"` element on any HTML page. Search engines seeing this tag will then not index this page because it is considered to be a duplicate of whatever other page the link points to.
>>>>
>>>> 2) via http header/htaccess - Since we probably don't want to patch up all our JavaDoc files, the information about a canonical source can also be sent in the HTTP header, e.g. via a suitable htaccess file.
>>>>
>>>> I guess the idea would be that for any old documentation page, we would want it to point to its latest version as its canonical source. I mean for every page, not only for the index page. This seems a bit tedious.
>>>>
>>>> My suggestion would be an alternative that exploits the website folder structure and uses robots.txt.
>>>>
>>>> We disallow indexing of the "d" folder on the UIMA website.
>>>> We place all the "*-current" folders (svn copies of the latest documentation versions) under a dedicated folder (e.g. "d/current") and allow indexing that.
>>>>
>>>> In that way, the outdated versions of the documentation should be hidden from the search engines and the respective latest versions should be indexed.
>>>>
>>>> Opinions? Does anybody have experience with SEO?
>>>>
>>>> Cheers,
>>>>
>>>> -- Richard
>>>>
>>>>
>


Re: Avoid indexing of old UIMA documentation

Posted by Richard Eckart de Castilho <re...@apache.org>.
We could try this:

--- 

# robots.txt for http://uima.apache.org

User-agent: *
Disallow: /docs/d/
Allow: /docs/d/ruta-current/
Allow: /docs/d/uima-addons-current/
Allow: /docs/d/uima-as-current/
Allow: /docs/d/uima-ducc-current/
Allow: /docs/d/uimacpp-current/
Allow: /docs/d/uimafit-current/
Allow: /docs/d/uimaj-current/

---

Sources on the net say that "Allow" wasn't originally defined, so if we do
the above, it might be that some search engines don't index the docs anymore
at all. We might want to set the user-agent to "googlebot".

Also, not all of the documentations use the "*-current" trick yet. But that
is easy to fix.

Cheers,

-- Richard

> On 07.04.2016, at 22:40, Richard Eckart de Castilho <re...@apache.org> wrote:
> 
> We can just disallow /d and then allow all the  *-current folders
> under it explicitly. The only difference I see is that we'd have
> a couple of more entries in the robots.txt.
> 
> -- Richard
> 
>> On 07.04.2016, at 22:36, Marshall Schor <ms...@schor.com> wrote:
>> 
>> Hi,
>> 
>> This sounds like a good idea to me :-)
>> 
>> There's one small issue possibly, to changing the folder structure.  The DOCBOOK
>> schemes have some fancy way to link between docbooks; these require that the
>> books be kept relative to one another in some file tree structure.  As long as
>> that's not changed, I think there will be no problem. 
>> 
>> If anyone's curious, the relevant bits of config info are in the
>> uima-docbook-olink project, in the various "site.xml" files.  You can see refs
>> to the famous "d" folder there.  There may be a dependency on the "books" being
>> just one directory layer under d/, so putting an extra layer might break things
>> (but I'm not sure...).
>> 
>> Maybe there's a way to do this without introducing a new level in the directory?
>> 
>> -Marshall
>> 
>> On 4/6/2016 4:43 PM, Richard Eckart de Castilho wrote:
>>> Hi all,
>>> 
>>> I believe some time back we were talking about a strategy to avoid search engines pointing to ancient version of the UIMA documentation.
>>> 
>>> I have read a bit on rel="canonical" and robots.txt.
>>> 
>>> 1) per webpage - Apparently, one can place a `link rel="canonical"` element on any HTML page. Search engines seeing this tag will then not index this page because it is considered to be a duplicate of whatever other page the link points to.
>>> 
>>> 2) via http header/htaccess - Since we probably don't want to patch up all our JavaDoc files, the information about a canonical source can also be sent in the HTTP header, e.g. via a suitable htaccess file.
>>> 
>>> I guess the idea would be that for any old documentation page, we would want it to point to its latest version as its canonical source. I mean for every page, not only for the index page. This seems a bit tedious.
>>> 
>>> My suggestion would be an alternative that exploits the website folder structure and uses robots.txt.
>>> 
>>> We disallow indexing of the "d" folder on the UIMA website.
>>> We place all the "*-current" folders (svn copies of the latest documentation versions) under a dedicated folder (e.g. "d/current") and allow indexing that.
>>> 
>>> In that way, the outdated versions of the documentation should be hidden from the search engines and the respective latest versions should be indexed.
>>> 
>>> Opinions? Does anybody have experience with SEO?
>>> 
>>> Cheers,
>>> 
>>> -- Richard
>>> 
>>> 
>> 
> 


Re: Avoid indexing of old UIMA documentation

Posted by Richard Eckart de Castilho <re...@apache.org>.
We can just disallow /d and then allow all the  *-current folders
under it explicitly. The only difference I see is that we'd have
a couple of more entries in the robots.txt.

-- Richard

> On 07.04.2016, at 22:36, Marshall Schor <ms...@schor.com> wrote:
> 
> Hi,
> 
> This sounds like a good idea to me :-)
> 
> There's one small issue possibly, to changing the folder structure.  The DOCBOOK
> schemes have some fancy way to link between docbooks; these require that the
> books be kept relative to one another in some file tree structure.  As long as
> that's not changed, I think there will be no problem. 
> 
> If anyone's curious, the relevant bits of config info are in the
> uima-docbook-olink project, in the various "site.xml" files.  You can see refs
> to the famous "d" folder there.  There may be a dependency on the "books" being
> just one directory layer under d/, so putting an extra layer might break things
> (but I'm not sure...).
> 
> Maybe there's a way to do this without introducing a new level in the directory?
> 
> -Marshall
> 
> On 4/6/2016 4:43 PM, Richard Eckart de Castilho wrote:
>> Hi all,
>> 
>> I believe some time back we were talking about a strategy to avoid search engines pointing to ancient version of the UIMA documentation.
>> 
>> I have read a bit on rel="canonical" and robots.txt.
>> 
>> 1) per webpage - Apparently, one can place a `link rel="canonical"` element on any HTML page. Search engines seeing this tag will then not index this page because it is considered to be a duplicate of whatever other page the link points to.
>> 
>> 2) via http header/htaccess - Since we probably don't want to patch up all our JavaDoc files, the information about a canonical source can also be sent in the HTTP header, e.g. via a suitable htaccess file.
>> 
>> I guess the idea would be that for any old documentation page, we would want it to point to its latest version as its canonical source. I mean for every page, not only for the index page. This seems a bit tedious.
>> 
>> My suggestion would be an alternative that exploits the website folder structure and uses robots.txt.
>> 
>> We disallow indexing of the "d" folder on the UIMA website.
>> We place all the "*-current" folders (svn copies of the latest documentation versions) under a dedicated folder (e.g. "d/current") and allow indexing that.
>> 
>> In that way, the outdated versions of the documentation should be hidden from the search engines and the respective latest versions should be indexed.
>> 
>> Opinions? Does anybody have experience with SEO?
>> 
>> Cheers,
>> 
>> -- Richard
>> 
>> 
> 


Re: Avoid indexing of old UIMA documentation

Posted by Marshall Schor <ms...@schor.com>.
Hi,

This sounds like a good idea to me :-)

There's one small issue possibly, to changing the folder structure.  The DOCBOOK
schemes have some fancy way to link between docbooks; these require that the
books be kept relative to one another in some file tree structure.  As long as
that's not changed, I think there will be no problem. 

If anyone's curious, the relevant bits of config info are in the
uima-docbook-olink project, in the various "site.xml" files.  You can see refs
to the famous "d" folder there.  There may be a dependency on the "books" being
just one directory layer under d/, so putting an extra layer might break things
(but I'm not sure...).

Maybe there's a way to do this without introducing a new level in the directory?

-Marshall

On 4/6/2016 4:43 PM, Richard Eckart de Castilho wrote:
> Hi all,
>
> I believe some time back we were talking about a strategy to avoid search engines pointing to ancient version of the UIMA documentation.
>
> I have read a bit on rel="canonical" and robots.txt.
>
> 1) per webpage - Apparently, one can place a `link rel="canonical"` element on any HTML page. Search engines seeing this tag will then not index this page because it is considered to be a duplicate of whatever other page the link points to.
>
> 2) via http header/htaccess - Since we probably don't want to patch up all our JavaDoc files, the information about a canonical source can also be sent in the HTTP header, e.g. via a suitable htaccess file.
>
> I guess the idea would be that for any old documentation page, we would want it to point to its latest version as its canonical source. I mean for every page, not only for the index page. This seems a bit tedious.
>
> My suggestion would be an alternative that exploits the website folder structure and uses robots.txt.
>
> We disallow indexing of the "d" folder on the UIMA website.
> We place all the "*-current" folders (svn copies of the latest documentation versions) under a dedicated folder (e.g. "d/current") and allow indexing that.
>
> In that way, the outdated versions of the documentation should be hidden from the search engines and the respective latest versions should be indexed.
>
> Opinions? Does anybody have experience with SEO?
>
> Cheers,
>
> -- Richard
>
>