You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/11/22 20:30:56 UTC
[Fwd: Spider Causing Contact Form Submissions]
It looks as though Nutch is inadvertantly submitting forms.
At DOMContentUtils.java:58 we specify that the "action" parameter of an
HTML form should be extracted as a link. Yet we ignore the "method"
parameter of the form. I think we should only follow these when the
method is "get", not when it is "post".
Do others agree?
Doug
-------- Original Message --------
Subject: Spider Causing Contact Form Submissions
Date: Tue, 22 Nov 2005 10:23:16 -0000
From: Jane de Silva <ma...@indigographics.co.uk>
Reply-To: nutch-agent@lucene.apache.org, "Jane de Silva"
<ma...@indigographics.co.uk>
To: <nu...@lucene.apache.org>
Hi
For the past couple of weeks I have been receiving blank contact form
submissions caused by sitesell.com 's use of your software. When this
first started happening I contacted sitesell and was assured that the
problem would be fixed. For a few days it appeared that it had been
attended to, but the emails bagan to appear again this weekend and are
still continuing to arrive. All sitesell will now say is that I should
edit the robotstxt file, and they have essentially washed their hands of
the problem. I run a very small part-time concern, do not have the
foggiet idea about this file or how to edit it, and in any case I
consider this a point of principle...
If sitesell or any other company finds value in using your software -
and it is certainly of no value to me! - they should have the courtesy
to properly address issues such as this when they arise. If I was forced
to leave open the door to my home because I could not locate my key or
was unable, for one reason or another, to operate the locking mechanism,
it does not make it right for a person to trespass on my property and
meddle with its contents. Although I have no problem in principle with
the spider accessing my site (given that their reasons for doing so are
not against my best interests), it should not cause me inconvenience by
so doing.
Sitesell's apparent lack of interest in my problem displays a cavalier
attitude and is discourteous and unneighbourly. I would be grateful if
you would help me put an end to their unwelcome intrusion into my life.
Kind regards
Jane de SIlva
www.indigographics.co.uk
mail@indigographics.co.uk
Re: [Fwd: Spider Causing Contact Form Submissions]
Posted by Ben Halsted <bh...@gmail.com>.
Hi Doug,
I'm not a 'nutch dev' but I agree with you. I'm not 100% sure, but I think
even the google accelorator does it this way.
Cheers,
Ben
On 11/22/05, Doug Cutting <cu...@nutch.org> wrote:
>
> It looks as though Nutch is inadvertantly submitting forms.
>
> At DOMContentUtils.java:58 we specify that the "action" parameter of an
> HTML form should be extracted as a link. Yet we ignore the "method"
> parameter of the form. I think we should only follow these when the
> method is "get", not when it is "post".
>
> Do others agree?
Re: Small bug in Generator
Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki wrote:
> Not to mention that the code uses a local variable with the same name
> and different type, to obscure the picture... I'll fix it. Thanks!
We were both too late... ;-)
------------------------------------------------------------------------
r332371 | cutting | 2005-11-10 22:03:16 +0100 (Thu, 10 Nov 2005) | 3 lines
Fix to not increment count of urls when urls are filtered by
maxPerHost limit. Patch contributed by Rod Taylor.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Small bug in Generator
Posted by Andrzej Bialecki <ab...@getopt.org>.
anton@orbita1.ru wrote:
>In method Generator.Selector.reduce small bug.
>
>Now it:
>...
>while (values.hasNext() && ++count < limit) {
>...
>
>Must be:
>...
>while (values.hasNext() && ++count <= limit) {
>...
>
>
>
Not to mention that the code uses a local variable with the same name
and different type, to obscure the picture... I'll fix it. Thanks!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Small bug in Generator
Posted by an...@orbita1.ru.
In method Generator.Selector.reduce small bug.
Now it:
...
while (values.hasNext() && ++count < limit) {
...
Must be:
...
while (values.hasNext() && ++count <= limit) {
...
Re: [Fwd: Spider Causing Contact Form Submissions]
Posted by Doug Cutting <cu...@nutch.org>.
Doug Cutting wrote:
> At DOMContentUtils.java:58 we specify that the "action" parameter of an
> HTML form should be extracted as a link. Yet we ignore the "method"
> parameter of the form. I think we should only follow these when the
> method is "get", not when it is "post".
Here's a patch to the mapred branch that implements this.
I have not yet tested it. If anyone tries it, please tell if it works.
Doug