You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Chris Schneider <Sc...@TransPac.com> on 2006/05/17 22:56:45 UTC

Following
tags

Gang,

I had a webmaster complain that our crawler was following his <form action> links. Although he admits that his use of the GET method is a bit unorthodox, he feels strongly that form submissions with input fields shouldn't be followed by crawlers. Would it make sense to modify the HTML parser so that it checked to see whether such input fields exist before following <form action> links?

- Chris

At 1:47 PM -0700 5/17/06, Chris Schneider wrote:
>Mark,
>
>At 8:15 AM +1200 5/18/06, Mark Rowe wrote:
>>On 18/05/2006, at 5:39 AM, Chris Schneider wrote:
>>>Thanks for providing the technical details about your pages. After reviewing the HTML parser used by Nutch, it appears that specifying either the rel="nofollow" attribute or the method="post" attribute would prevent our crawler (and other Nutch crawlers) from following these <form action...> links. If I understand your HTML correctly, it seems like you really are making this call to retrieve information, so method="get" (the default) does appear to be more appropriate.
>>
>>You understand incorrectly.  While the page in question abuses the GET method to perform a mutating action, I still feel that it is incorrect for it to be followed in this situation.  (The mutating action in this case is for another computer to spend up to an hour to download, build and compile a piece of software -- definitely not information retrieval.)
>
>I agree that the page in question should not be crawled. The remaining question is how to prevent that from happening.
>
>>>Thus, I humbly suggest that you add a re="nofollow" to these links. This will not only prevent our crawler from following them, but solve the problem for Nutch and other crawler technologies that honor this attribute. Here's some technical information about it:
>>>
>>>http://microformats.org/wiki/rel-nofollow
>>
>>It is my understanding that 'rel="nofollow"' is only valid for <a> tags, and furthermore does not prevent the crawling of such links. 
>
>The rel attribute is valid for both <a> and <form action=...> tags. I see nothing in the specification restricting rel="nofollow" to <a> tags, so I would assume that it is valid for <form action=...> tags as well.
>
>>According to the link you mentioned, it "indicates that the destination of that hyperlink SHOULD NOT be afforded any additional weight or ranking by user agents which perform link analysis upon web pages".  This doesn't stop a crawler from following the link, only from inferring a relationship between the source and destination.
>
>You are correct. However, it does in point of fact prevent the Nutch crawler's HTML parser from following <a> and <form action...> tags that have this attribute. I would imagine that this would prevent other crawler technologies from following these links as well.
>
>>>If you have specific suggestions for other ways that Nutch might differentiate links like yours from other <form action...> links that *are* of potential interest when crawling, then I could post this to the Nutch developer group mailing list.
>>
>>In my opinion it seems bizarre to submit a form with empty input fields in the hope that you will get a valid page out the other end. 
>
>Perhaps, but many HTML pages still do use this technique, allowing a button or some other control to load a second page, etc.
>
>>Submitting a form is, in my mind, a much stronger action than following a hyperlink.  This applies doubly to forms with associated input fields.  I can't think of very few examples of forms off the top of my head where it would be desirable to crawl the resultant page after submitting with all inputs empty.
>
>I will post a message to the nutch developer mailing list describing your suggestion about not following these <form action...> links if there are input fields in the form. However, it seems like a lot of work for the parser. Although I have absolutely no control over the behavior of other Nutch crawlers out there, I will consider making a change to our Nutch installation to avoid following such links.
>
>Best Regards,
>
>- Chris

At 9:51 AM +1200 5/17/06, mrowe@bdash.net.nz wrote:
>Hi Chris,
>
>An example of the type of form is visible at
>http://build.webkit.org/post-commit-powerpc-mac-os-x/builds/1921.  The
>markup relevant to the form is:
>
><form action="1921/rebuild" class="command rebuild">
><div class="row">
>  <span class="label">Your name:</span>
>  <span class="field"><input type='text' name='username' /></span>
></div>
><div class="row">
>  <span class="label">Reason for re-running build:</span>
>  <span class="field"><input type='text' name='comments' /></span>
></div><input type="submit" value="Rebuild" />
></form>
>
> When the /rebuild link is activated it causes several machines within our
>build system to download + recompile our application.  As you can
>probably appreciate, this is computationally intensive and is best
>avoided.
>
>Thanks,
>
>Mark
>
>> Mark,
>>
>> We're using the Nutch OpenSource crawler technology for our crawls and
>> have not modified the algorithm controlling which areas of HTML pages are
>> searched while harvesting outlinks. Our URL filter should be preventing us
>> from following links that include queries (i.e., those containing a "?"
>> character), though. Could you provide some specific details about the
>> <form> tag and the embedded URLs within it that our crawler seems to be
>> following?
>>
>> Thanks,
>>
>> Chris Schneider
>>
>> At 10:07 AM +1200 5/13/06, Mark Rowe wrote:
>>>Hi,
>>>
>>>Your crawler is doing the most insanely stupid thing possible.  It is
>>> following the URLs in <form> tags.  It is the *only* web crawler that I
>>> have seen do such a thing, and it is ridiculous.  Some functionality is
>>> behind form tags for the reason that web crawlers follow hyperlinks in A
>>> tags, but do not submit forms.  I will be preventing your crawlers IP
>>> range from accessing my server by firewall rules until you change this
>>> braindead behaviour.
>>>
>>>Regards,
>>>
>>>Mark Rowe
>>><http://bdash.net.nz/>

-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Re: Following tags

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Andrzej Bialecki wrote:
>> I read through your email exchange, and setting aside all emotional 
>> content I think this is a valid request - indeed, as far as I can 
>> tell other major crawlers don't follow these links. We could either 
>> remove this, or make it optional (default not to use them).
>
> Is this as simple as deleting line 60 from DOMContentUtils.java (in 
> the html-parser plugin)?

Yes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Following tags

Posted by Doug Cutting <cu...@apache.org>.
Andrzej Bialecki wrote:
> I read through your email exchange, and setting aside all emotional 
> content I think this is a valid request - indeed, as far as I can tell 
> other major crawlers don't follow these links. We could either remove 
> this, or make it optional (default not to use them).

Is this as simple as deleting line 60 from DOMContentUtils.java (in the 
html-parser plugin)?

Doug

Re: Following tags

Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris Schneider wrote:
> Gang,
>
> I had a webmaster complain that our crawler was following his <form action> links. Although he admits that his use of the GET method is a bit unorthodox, he feels strongly that form submissions with input fields shouldn't be followed by crawlers. Would it make sense to modify the HTML parser so that it checked to see whether such input fields exist before following <form action> links?
>
>   

I read through your email exchange, and setting aside all emotional 
content I think this is a valid request - indeed, as far as I can tell 
other major crawlers don't follow these links. We could either remove 
this, or make it optional (default not to use them).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com