You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/11/22 20:30:56 UTC

[Fwd: Spider Causing Contact Form Submissions]

It looks as though Nutch is inadvertantly submitting forms.

At DOMContentUtils.java:58 we specify that the "action" parameter of an 
HTML form should be extracted as a link.  Yet we ignore the "method" 
parameter of the form.  I think we should only follow these when the 
method is "get", not when it is "post".

Do others agree?

Doug

-------- Original Message --------
Subject: Spider Causing Contact Form Submissions
Date: Tue, 22 Nov 2005 10:23:16 -0000
From: Jane de Silva <ma...@indigographics.co.uk>
Reply-To: nutch-agent@lucene.apache.org,	"Jane de Silva" 
<ma...@indigographics.co.uk>
To: <nu...@lucene.apache.org>

Hi

For the past couple of weeks I have been receiving blank contact form 
submissions caused by sitesell.com 's use of your software. When this 
first started happening I contacted sitesell and was assured that the 
problem would be fixed. For a few days it appeared that it had been 
attended to, but the emails bagan to appear again this weekend and are 
still continuing to arrive. All sitesell will now say is that I should 
edit the robotstxt file, and they have essentially washed their hands of 
the problem. I run a very small part-time concern, do not have the 
foggiet idea about this file or how to edit it, and in any case I 
consider this a point of principle...

If sitesell or any other company finds value in using your software - 
and it is certainly of no value to me! - they should have the courtesy 
to properly address issues such as this when they arise. If I was forced 
to leave open the door to my home because I could not locate my key or 
was unable, for one reason or another, to operate the locking mechanism, 
it does not make it right for a person to trespass on my property and 
meddle with its contents. Although I have no problem in principle with 
the spider accessing my site (given that their reasons for doing so are 
not against my best interests), it should not cause me inconvenience by 
so doing.

Sitesell's apparent lack of interest in my problem displays a cavalier 
attitude and is discourteous and unneighbourly. I would be grateful if 
you would help me put an end to their unwelcome intrusion into my life.

Kind regards

Jane de SIlva
www.indigographics.co.uk
mail@indigographics.co.uk





Re: [Fwd: Spider Causing Contact Form Submissions]

Posted by Ben Halsted <bh...@gmail.com>.
Hi Doug,

I'm not a 'nutch dev' but I agree with you. I'm not 100% sure, but I think
even the google accelorator does it this way.

Cheers,
Ben

On 11/22/05, Doug Cutting <cu...@nutch.org> wrote:
>
> It looks as though Nutch is inadvertantly submitting forms.
>
> At DOMContentUtils.java:58 we specify that the "action" parameter of an
> HTML form should be extracted as a link. Yet we ignore the "method"
> parameter of the form. I think we should only follow these when the
> method is "get", not when it is "post".
>
> Do others agree?

Re: Small bug in Generator

Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:

> Not to mention that the code uses a local variable with the same name 
> and different type, to obscure the picture... I'll fix it. Thanks!

We were both too late... ;-)

------------------------------------------------------------------------
r332371 | cutting | 2005-11-10 22:03:16 +0100 (Thu, 10 Nov 2005) | 3 lines

Fix to not increment count of urls when urls are filtered by
maxPerHost limit.  Patch contributed by Rod Taylor.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Small bug in Generator

Posted by Andrzej Bialecki <ab...@getopt.org>.
anton@orbita1.ru wrote:

>In method Generator.Selector.reduce small bug. 
>
>Now it:
>...
>while (values.hasNext() && ++count < limit) {
>...
>
>Must be:
>...
>while (values.hasNext() && ++count <= limit) {
>...
>
>  
>

Not to mention that the code uses a local variable with the same name 
and different type, to obscure the picture... I'll fix it. Thanks!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Small bug in Generator

Posted by an...@orbita1.ru.
In method Generator.Selector.reduce small bug. 

Now it:
...
while (values.hasNext() && ++count < limit) {
...

Must be:
...
while (values.hasNext() && ++count <= limit) {
...





Re: [Fwd: Spider Causing Contact Form Submissions]

Posted by Doug Cutting <cu...@nutch.org>.
Doug Cutting wrote:
> At DOMContentUtils.java:58 we specify that the "action" parameter of an 
> HTML form should be extracted as a link.  Yet we ignore the "method" 
> parameter of the form.  I think we should only follow these when the 
> method is "get", not when it is "post".

Here's a patch to the mapred branch that implements this.

I have not yet tested it.  If anyone tries it, please tell if it works.

Doug