You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2011/01/27 16:04:21 UTC

Web crawler does not follow the robots meta tag rules

I just figured out that the web crawler does not follow the rules 
defined by the robots meta tag. I created a document with the following tag:
<meta name="robots" content="noindex, nofollow">

This document has also a link to another document in order to test the 
"nofollow" rule, but both documents were fetched and indexed by Solr.

Should I open a Jira issue about this? I hope it's easy to rewrite the 
crawler in order to add this functionality since this is a blocker for us.

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
OK, both CONNECTORS-157 and CONNECTORS-153 should now be fixed.
Karl

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
I created a ticket: CONNECTORS-157 to cover the path-resolution issue.
Karl

On Wed, Feb 2, 2011 at 11:34 AM, Karl Wright <da...@gmail.com> wrote:
> Turns out Java doesn't like the form of those URLs; it doesn't they're proper:
>
> WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
> path in absolute URI: http://ridder.uio.nodokument.pdf
> WEB: In html document 'http://ridder.uio.no', found an unincluded URL
> 'dokument.pdf'
>
> This is the java.net.URI class:
>
>        java.net.URI parentURL = new java.net.URI(parentIdentifier);
>        url = parentURL.resolve(rawURL);
>
> ... and this is throwing a java.net.URISyntaxException.
>
> I'm going to have to go look at the standards to figure out what we
> should do here.  Perhaps the right approach is to note the exception
> and retry with a "/" glommed on the front if we get it.
>
> But clearly you must have modified the web connector in order to get
> it to crawl your stuff in the first place.
>
> Karl
>
> On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright <da...@gmail.com> wrote:
>> Hmm.  I get 701 bytes from your seed, but no parseable links.  Investigating...
>> Karl
>>
>> On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>>> On 28.01.11 14.32, Karl Wright wrote:
>>>>
>>>> Thanks.  I tested my changes enough so that I was confident in
>>>> committing the patch, so the changes are in trunk.
>>>
>>> I'm afraid that it doesn't work properly. I downloaded the latest version
>>> from trunk and started the crawler.
>>>
>>> Try to use the following address in your seed list and the following rule in
>>> the includes list:
>>> ^http://ridder.uio.no/.*
>>>
>>> The following document was fetched and sent to Solr for indexing even though
>>> it includes a robots noindex rule:
>>> http://ridder.uio.no/test_closed/
>>>
>>> Here's the line from the history telling me that Sole should index it:
>>> 02-02-2011 16:12:33.283         document ingest (Solr)
>>> http://ridder.uio.no/test_closed/
>>>        200
>>>
>>> I can try to modify the code you have added in order to get around this
>>> tomorrow. I guess I can find the relevant check somewhere in the following
>>> folder?
>>> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>
>

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
I was using seed: http://ridder.uio.no
Perhaps that accounts for the difference.  Nevertheless, since
http://ridder.uio.no is fetchable, a fix for that problem is still
needed.  (I did, BTW, try appending a "/" to the URI if the path part
was determined to be null, but that too did not work.)

Karl

On Thu, Feb 3, 2011 at 4:57 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>
> Honestly, I haven't modified any crawler code at all. Are you sure you
> entered a url with an trailing slash in the seed list? I tried to skip that
> slash, and then the crawler began to act strangely. I cannot reproduce your
> results.
>
> This is my settings:
> Seed: http://ridder.uio.no/
> Inclusions: ^http://ridder.uio.no/.* (marked "include only host matching
> ...")
>
> Everything works like a dream. The only problem I have with the PDF document
> is that it does not parse the Norwegian characters correctly, but this can
> be a Tika bug since all other document formats are parsed correctly.
>
> BTW: I did a svn update, ant clean -> build, and now the document with the
> noindex rule is skipped. Great. Thanks a zillion!
>
> And regarding the Solr trick with the jar files I had to move manually since
> they were excluded from solr.jar (my last home lesson):
> - When Solr is running in a servlet container such as Resin, you have to
> move the following jars manually into the <solr.home>/lib directory in order
> to enable the ExtractingRequestHandler:
>  - apache-solr-cell-*.jar
>  - the other Tika jars
>
> You will find the same information in the following file:
> solr_trunk/solr/contrib/extraction/CHANGES.txt.
>
> Erlend
>
>
> On 02.02.11 17.34, Karl Wright wrote:
>>
>> Turns out Java doesn't like the form of those URLs; it doesn't they're
>> proper:
>>
>> WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
>> path in absolute URI: http://ridder.uio.nodokument.pdf
>> WEB: In html document 'http://ridder.uio.no', found an unincluded URL
>> 'dokument.pdf'
>>
>> This is the java.net.URI class:
>>
>>         java.net.URI parentURL = new java.net.URI(parentIdentifier);
>>         url = parentURL.resolve(rawURL);
>>
>> ... and this is throwing a java.net.URISyntaxException.
>>
>> I'm going to have to go look at the standards to figure out what we
>> should do here.  Perhaps the right approach is to note the exception
>> and retry with a "/" glommed on the front if we get it.
>>
>> But clearly you must have modified the web connector in order to get
>> it to crawl your stuff in the first place.
>>
>> Karl
>>
>> On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright<da...@gmail.com>  wrote:
>>>
>>> Hmm.  I get 701 bytes from your seed, but no parseable links.
>>>  Investigating...
>>> Karl
>>>
>>> On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen<e....@usit.uio.no>
>>>  wrote:
>>>>
>>>> On 28.01.11 14.32, Karl Wright wrote:
>>>>>
>>>>> Thanks.  I tested my changes enough so that I was confident in
>>>>> committing the patch, so the changes are in trunk.
>>>>
>>>> I'm afraid that it doesn't work properly. I downloaded the latest
>>>> version
>>>> from trunk and started the crawler.
>>>>
>>>> Try to use the following address in your seed list and the following
>>>> rule in
>>>> the includes list:
>>>> ^http://ridder.uio.no/.*
>>>>
>>>> The following document was fetched and sent to Solr for indexing even
>>>> though
>>>> it includes a robots noindex rule:
>>>> http://ridder.uio.no/test_closed/
>>>>
>>>> Here's the line from the history telling me that Sole should index it:
>>>> 02-02-2011 16:12:33.283         document ingest (Solr)
>>>> http://ridder.uio.no/test_closed/
>>>>        200
>>>>
>>>> I can try to modify the code you have added in order to get around this
>>>> tomorrow. I guess I can find the relevant check somewhere in the
>>>> following
>>>> folder?
>>>>
>>>> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Web crawler does not follow the robots meta tag rules

Posted by Erlend Garåsen <e....@usit.uio.no>.
Honestly, I haven't modified any crawler code at all. Are you sure you 
entered a url with an trailing slash in the seed list? I tried to skip 
that slash, and then the crawler began to act strangely. I cannot 
reproduce your results.

This is my settings:
Seed: http://ridder.uio.no/
Inclusions: ^http://ridder.uio.no/.* (marked "include only host matching 
...")

Everything works like a dream. The only problem I have with the PDF 
document is that it does not parse the Norwegian characters correctly, 
but this can be a Tika bug since all other document formats are parsed 
correctly.

BTW: I did a svn update, ant clean -> build, and now the document with 
the noindex rule is skipped. Great. Thanks a zillion!

And regarding the Solr trick with the jar files I had to move manually 
since they were excluded from solr.jar (my last home lesson):
- When Solr is running in a servlet container such as Resin, you have to 
move the following jars manually into the <solr.home>/lib directory in 
order to enable the ExtractingRequestHandler:
   - apache-solr-cell-*.jar
   - the other Tika jars

You will find the same information in the following file:
solr_trunk/solr/contrib/extraction/CHANGES.txt.

Erlend


On 02.02.11 17.34, Karl Wright wrote:
> Turns out Java doesn't like the form of those URLs; it doesn't they're proper:
>
> WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
> path in absolute URI: http://ridder.uio.nodokument.pdf
> WEB: In html document 'http://ridder.uio.no', found an unincluded URL
> 'dokument.pdf'
>
> This is the java.net.URI class:
>
>          java.net.URI parentURL = new java.net.URI(parentIdentifier);
>          url = parentURL.resolve(rawURL);
>
> ... and this is throwing a java.net.URISyntaxException.
>
> I'm going to have to go look at the standards to figure out what we
> should do here.  Perhaps the right approach is to note the exception
> and retry with a "/" glommed on the front if we get it.
>
> But clearly you must have modified the web connector in order to get
> it to crawl your stuff in the first place.
>
> Karl
>
> On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright<da...@gmail.com>  wrote:
>> Hmm.  I get 701 bytes from your seed, but no parseable links.  Investigating...
>> Karl
>>
>> On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen<e....@usit.uio.no>  wrote:
>>> On 28.01.11 14.32, Karl Wright wrote:
>>>>
>>>> Thanks.  I tested my changes enough so that I was confident in
>>>> committing the patch, so the changes are in trunk.
>>>
>>> I'm afraid that it doesn't work properly. I downloaded the latest version
>>> from trunk and started the crawler.
>>>
>>> Try to use the following address in your seed list and the following rule in
>>> the includes list:
>>> ^http://ridder.uio.no/.*
>>>
>>> The following document was fetched and sent to Solr for indexing even though
>>> it includes a robots noindex rule:
>>> http://ridder.uio.no/test_closed/
>>>
>>> Here's the line from the history telling me that Sole should index it:
>>> 02-02-2011 16:12:33.283         document ingest (Solr)
>>> http://ridder.uio.no/test_closed/
>>>         200
>>>
>>> I can try to modify the code you have added in order to get around this
>>> tomorrow. I guess I can find the relevant check somewhere in the following
>>> folder?
>>> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
Turns out Java doesn't like the form of those URLs; it doesn't they're proper:

WEB: Can't use url 'dokument.pdf' because it is badly formed: Relative
path in absolute URI: http://ridder.uio.nodokument.pdf
WEB: In html document 'http://ridder.uio.no', found an unincluded URL
'dokument.pdf'

This is the java.net.URI class:

        java.net.URI parentURL = new java.net.URI(parentIdentifier);
        url = parentURL.resolve(rawURL);

... and this is throwing a java.net.URISyntaxException.

I'm going to have to go look at the standards to figure out what we
should do here.  Perhaps the right approach is to note the exception
and retry with a "/" glommed on the front if we get it.

But clearly you must have modified the web connector in order to get
it to crawl your stuff in the first place.

Karl

On Wed, Feb 2, 2011 at 11:08 AM, Karl Wright <da...@gmail.com> wrote:
> Hmm.  I get 701 bytes from your seed, but no parseable links.  Investigating...
> Karl
>
> On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
>> On 28.01.11 14.32, Karl Wright wrote:
>>>
>>> Thanks.  I tested my changes enough so that I was confident in
>>> committing the patch, so the changes are in trunk.
>>
>> I'm afraid that it doesn't work properly. I downloaded the latest version
>> from trunk and started the crawler.
>>
>> Try to use the following address in your seed list and the following rule in
>> the includes list:
>> ^http://ridder.uio.no/.*
>>
>> The following document was fetched and sent to Solr for indexing even though
>> it includes a robots noindex rule:
>> http://ridder.uio.no/test_closed/
>>
>> Here's the line from the history telling me that Sole should index it:
>> 02-02-2011 16:12:33.283         document ingest (Solr)
>> http://ridder.uio.no/test_closed/
>>        200
>>
>> I can try to modify the code you have added in order to get around this
>> tomorrow. I guess I can find the relevant check somewhere in the following
>> folder?
>> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
Hmm.  I get 701 bytes from your seed, but no parseable links.  Investigating...
Karl

On Wed, Feb 2, 2011 at 10:45 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 28.01.11 14.32, Karl Wright wrote:
>>
>> Thanks.  I tested my changes enough so that I was confident in
>> committing the patch, so the changes are in trunk.
>
> I'm afraid that it doesn't work properly. I downloaded the latest version
> from trunk and started the crawler.
>
> Try to use the following address in your seed list and the following rule in
> the includes list:
> ^http://ridder.uio.no/.*
>
> The following document was fetched and sent to Solr for indexing even though
> it includes a robots noindex rule:
> http://ridder.uio.no/test_closed/
>
> Here's the line from the history telling me that Sole should index it:
> 02-02-2011 16:12:33.283         document ingest (Solr)
> http://ridder.uio.no/test_closed/
>        200
>
> I can try to modify the code you have added in order to get around this
> tomorrow. I guess I can find the relevant check somewhere in the following
> folder?
> mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Web crawler does not follow the robots meta tag rules

Posted by Erlend Garåsen <e....@usit.uio.no>.
On 28.01.11 14.32, Karl Wright wrote:
> Thanks.  I tested my changes enough so that I was confident in
> committing the patch, so the changes are in trunk.

I'm afraid that it doesn't work properly. I downloaded the latest 
version from trunk and started the crawler.

Try to use the following address in your seed list and the following 
rule in the includes list:
^http://ridder.uio.no/.*

The following document was fetched and sent to Solr for indexing even 
though it includes a robots noindex rule:
http://ridder.uio.no/test_closed/

Here's the line from the history telling me that Sole should index it:
02-02-2011 16:12:33.283 	document ingest (Solr) 
http://ridder.uio.no/test_closed/
	200

I can try to modify the code you have added in order to get around this 
tomorrow. I guess I can find the relevant check somewhere in the 
following folder?
mcf-trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler

Erlend

-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
Thanks.  I tested my changes enough so that I was confident in
committing the patch, so the changes are in trunk.

Karl

On Fri, Jan 28, 2011 at 5:37 AM, Erlend Garåsen <e....@usit.uio.no> wrote:
> On 27.01.11 18.43, Karl Wright wrote:
>>
>> I've written the necessary code for ManifoldCF, so if you create the
>> ticket, I can attach a patch.  I don't know if it works yet, but I
>> presume you will be in a position to try it out?
>
> Great! Sure, I will test it and try it out, probably next week where you
> also will get the other information from me which you requested.
>
> Anyway, I have created a ticket:
> https://issues.apache.org/jira/browse/CONNECTORS-153
>
> Erlend
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Web crawler does not follow the robots meta tag rules

Posted by Erlend Garåsen <e....@usit.uio.no>.
On 27.01.11 18.43, Karl Wright wrote:
> I've written the necessary code for ManifoldCF, so if you create the
> ticket, I can attach a patch.  I don't know if it works yet, but I
> presume you will be in a position to try it out?

Great! Sure, I will test it and try it out, probably next week where you 
also will get the other information from me which you requested.

Anyway, I have created a ticket:
https://issues.apache.org/jira/browse/CONNECTORS-153

Erlend
-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
I've written the necessary code for ManifoldCF, so if you create the
ticket, I can attach a patch.  I don't know if it works yet, but I
presume you will be in a position to try it out?

Karl

On Thu, Jan 27, 2011 at 10:55 AM, Erlend Garåsen
<e....@usit.uio.no> wrote:
>
> Thanks for your reply.
>
> OK, now I got two home lessons:
> - Create a Jira issue about this
> - Explain how it is possible to use ExtractingRequestHandler with Solr 1.4.1
> by copying jars etc.
>
> BTW, I just figured out that Tika parses all the meta tag information, so I
> can rewrite the ExtractingRequestHandler classes in order to skip files with
> these meta directives. The following was included into my index last time i
> started the ManifoldCF job:
> <arr name="ignored_meta">
> <str>robots</str>
> <str>noindex,nofollow</str>
>
> I have already rewritten some of these classes in order to implement
> language detection, so it seems that we can implement all the functionality
> we need by using ManifoldCF. :)
>
> Erlend
>
> On 27.01.11 16.37, Karl Wright wrote:
>>
>> There's also ordering; the meta tag must precede all links on the page
>> that you don't want the crawler to follow.  Hope this is OK.
>>
>> Karl
>>
>> On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright<da...@gmail.com>  wrote:
>>>
>>> Sure, please open a ticket.
>>> Interpreting the tag should not be difficult.  The main issues will be
>>> around noting the crawler's decision to skip documents or content in
>>> the activities history.  And, of course, this will not be available in
>>> the ManifoldCF-0.1-incubating release.
>>>
>>> Please specify what variants of the tag you think should be supported,
>>> and if supported, how you think it should work.  For example,
>>> including "nofollow" does not usually block crawlers from reaching
>>> your linked documents from other directions; if you want that
>>> functionality, you probably won't find that anywhere.  This is why
>>> most people use robots.txt rather than the meta tag.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
>>> <e....@usit.uio.no>  wrote:
>>>>
>>>> I just figured out that the web crawler does not follow the rules
>>>> defined by
>>>> the robots meta tag. I created a document with the following tag:
>>>> <meta name="robots" content="noindex, nofollow">
>>>>
>>>> This document has also a link to another document in order to test the
>>>> "nofollow" rule, but both documents were fetched and indexed by Solr.
>>>>
>>>> Should I open a Jira issue about this? I hope it's easy to rewrite the
>>>> crawler in order to add this functionality since this is a blocker for
>>>> us.
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Re: Web crawler does not follow the robots meta tag rules

Posted by Erlend Garåsen <e....@usit.uio.no>.
Thanks for your reply.

OK, now I got two home lessons:
- Create a Jira issue about this
- Explain how it is possible to use ExtractingRequestHandler with Solr 
1.4.1 by copying jars etc.

BTW, I just figured out that Tika parses all the meta tag information, 
so I can rewrite the ExtractingRequestHandler classes in order to skip 
files with these meta directives. The following was included into my 
index last time i started the ManifoldCF job:
<arr name="ignored_meta">
<str>robots</str>
<str>noindex,nofollow</str>

I have already rewritten some of these classes in order to implement 
language detection, so it seems that we can implement all the 
functionality we need by using ManifoldCF. :)

Erlend

On 27.01.11 16.37, Karl Wright wrote:
> There's also ordering; the meta tag must precede all links on the page
> that you don't want the crawler to follow.  Hope this is OK.
>
> Karl
>
> On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright<da...@gmail.com>  wrote:
>> Sure, please open a ticket.
>> Interpreting the tag should not be difficult.  The main issues will be
>> around noting the crawler's decision to skip documents or content in
>> the activities history.  And, of course, this will not be available in
>> the ManifoldCF-0.1-incubating release.
>>
>> Please specify what variants of the tag you think should be supported,
>> and if supported, how you think it should work.  For example,
>> including "nofollow" does not usually block crawlers from reaching
>> your linked documents from other directions; if you want that
>> functionality, you probably won't find that anywhere.  This is why
>> most people use robots.txt rather than the meta tag.
>>
>> Karl
>>
>>
>> On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
>> <e....@usit.uio.no>  wrote:
>>>
>>> I just figured out that the web crawler does not follow the rules defined by
>>> the robots meta tag. I created a document with the following tag:
>>> <meta name="robots" content="noindex, nofollow">
>>>
>>> This document has also a link to another document in order to test the
>>> "nofollow" rule, but both documents were fetched and indexed by Solr.
>>>
>>> Should I open a Jira issue about this? I hope it's easy to rewrite the
>>> crawler in order to add this functionality since this is a blocker for us.
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
There's also ordering; the meta tag must precede all links on the page
that you don't want the crawler to follow.  Hope this is OK.

Karl

On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright <da...@gmail.com> wrote:
> Sure, please open a ticket.
> Interpreting the tag should not be difficult.  The main issues will be
> around noting the crawler's decision to skip documents or content in
> the activities history.  And, of course, this will not be available in
> the ManifoldCF-0.1-incubating release.
>
> Please specify what variants of the tag you think should be supported,
> and if supported, how you think it should work.  For example,
> including "nofollow" does not usually block crawlers from reaching
> your linked documents from other directions; if you want that
> functionality, you probably won't find that anywhere.  This is why
> most people use robots.txt rather than the meta tag.
>
> Karl
>
>
> On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
> <e....@usit.uio.no> wrote:
>>
>> I just figured out that the web crawler does not follow the rules defined by
>> the robots meta tag. I created a document with the following tag:
>> <meta name="robots" content="noindex, nofollow">
>>
>> This document has also a link to another document in order to test the
>> "nofollow" rule, but both documents were fetched and indexed by Solr.
>>
>> Should I open a Jira issue about this? I hope it's easy to rewrite the
>> crawler in order to add this functionality since this is a blocker for us.
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Re: Web crawler does not follow the robots meta tag rules

Posted by Karl Wright <da...@gmail.com>.
Sure, please open a ticket.
Interpreting the tag should not be difficult.  The main issues will be
around noting the crawler's decision to skip documents or content in
the activities history.  And, of course, this will not be available in
the ManifoldCF-0.1-incubating release.

Please specify what variants of the tag you think should be supported,
and if supported, how you think it should work.  For example,
including "nofollow" does not usually block crawlers from reaching
your linked documents from other directions; if you want that
functionality, you probably won't find that anywhere.  This is why
most people use robots.txt rather than the meta tag.

Karl


On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
<e....@usit.uio.no> wrote:
>
> I just figured out that the web crawler does not follow the rules defined by
> the robots meta tag. I created a document with the following tag:
> <meta name="robots" content="noindex, nofollow">
>
> This document has also a link to another document in order to test the
> "nofollow" rule, but both documents were fetched and indexed by Solr.
>
> Should I open a Jira issue about this? I hope it's easy to rewrite the
> crawler in order to add this functionality since this is a blocker for us.
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>