You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by bluebrit <bl...@blue-candy.com> on 2007/12/07 21:55:42 UTC

Fw: Blocked nutch spider accessing pages

Hi,

I just saw that my emails to you appear on this page http://www.nabble.com/Fw:-Blocked-nutch-spider-accessing-pages-t4877480.html

It was not my intent for these emails to be made available for everybody. This was a personal email between myself and you and was considered private.

Please remove all details regarding me from your database and remove the emails from the above domain and relevant pages.

For your additional information in response to your reply on this page however, you state.

"Hi 

I am sory for your bandwidth consumption but I do not think that is my 
spider. You are right about robot.txt files. I did not read them because I 
do not know existence of that file. Thank you for advice. But I never send 
my spider to your domain. I crawled Only small amount of host and none of 
them about your host. How can you be sure about that is my spider. Please 
warn me if the problem continues and you sure about the spider. 

Thank you"

I can be sure it is your spider because my log file detects you and gives me a link with a live url that I can follow. The link is live in this email at the moment and it was in the past emails but I see you have removed the link on the above page.

      Nutch 608+203 3.02 MB 07 Dec 2007 - 01:57 


If this isn't you, then you should have a safeguard in place that stops this happening or gives each user a distinct number that has to remain in the software for it to work. At least that way a method of tracking the users would be available.

I also think your comment regarding your not knowing of the existence of robots.txt files a bit unrealistic as you obviously have the knowledge to use software such as this.

At the end of the day, there is no reason for software such as this to ignore a robots.txt file unless it is attempting something it shouldn't.

Please take note of my above request to remove these emails from your pages as I can see a future problem coming from this where my address is harvested and I end up getting spammed unnecessarily. I see that as a far worse problem than the one i have with you at the moment.

regards
owner blue-candy.com



----- Original Message ----- 
From: bluebrit 
To: nutch-agent@lucene.apache.org 
Sent: Monday, November 26, 2007 5:13 AM
Subject: Fw: Blocked nutch spider accessing pages


I sent the below original email to you without reply two weeks ago and as you can see my domain is still being crawled by your spider.
Please advise me how to block it permanently from my domain or i will seek avenues to report your spider for its intrusive behaviour to the major search engines possibly resulting in your domains removal from their listings.


      Nutch 1733+29 10.79 MB 23 Nov 2007 - 11:06 



regards
owner blue-candy.com


----- Original Message ----- 
From: bluebrit 
To: nutch-agent@lucene.apache.org 
Sent: Monday, November 12, 2007 12:43 PM
Subject: Blocked nutch spider accessing pages


Hello, I am writing this email to you because of the following.

Blocked spider in robots.txt found in log file.

User-agent: Nutch
Disallow: /

To date this month Nutch has appeared in my site log an unreasonable amount of times bearing in mind it is supposed to be blocked. It is obvious that your spider is not reading the robots.txt file and as my domain contains a copyright warning, can i assume you will be able to ensure that your spider or the user of your spider will stop the repeated visits and possible copying of text / graphics as well.

Below is a copy of log files from the last six months, that although not large in bandwidth usage, does constitute a problem as it seems to show an increasing demand.

      NutchCVS 588+8 2.37 MB 28 Jun 2007 - 06:43 

      Nutch 807+21 2.25 MB 28 Jun 2007 - 15:46 


      Nutch 324+223 1.35 MB 31 Jul 2007 - 04:11 


      Nutch 105+18 657.46 KB 31 Aug 2007 - 18:41 

      NutchCVS 712+12 2.73 MB 15 Aug 2007 - 00:38 


      Nutch 42+13 315.86 KB 30 Sep 2007 - 04:34 


      Nutch 30+12 182.74 KB 24 Oct 2007 - 19:56 


      Nutch 977+15 6.87 MB 08 Nov 2007 - 22:41 


My domain is http://www.blue-candy.com

Please note this is an adult domain and ALL of the images / video clips are also copyright protected by the sponsoring companies.

Thank you for your reply regarding the above and for any additional information you can supply regarding steps that can be taken to block Nutch once and for all from spidering my domain.

Regards
Owner blue-candy.com


Re: Fw: Blocked nutch spider accessing pages

Posted by "Ricardo J. Méndez" <me...@gmail.com>.
Nutch-agent is a mailing list related to the usage of Nutch as a search
agent, not a person.  The reason your messages are showing up on Nabble is
because they're being sent to a public list that is indexed by many sites.


-- Ricardo

Re: Fw: Blocked nutch spider accessing pages

Posted by Martin Kuen <ma...@gmail.com>.
Hi,

a few things should be said in order to clarify the situation:

1. Nutch is NO SERVICE. Nutch is a free software project which is
subject to the "Apache 2.0" license.
2. Nutch can be seen as a TOOLKIT to build a search application. To
create the search index a spider (the nutch spider) may be used.
3. The software (Nutch spider) forces a given user to supply a
"customized" agent-name using a configuration file. Without modifying
the source code it is not possible to advertise only "Nutch" as
agent-name. It would be sth. like "me/Nutch" or "you/Nutch".
5. It is a pity that somebody is using this software in this way.
However, if this is bothering you that much you will have to take
steps against the person/party sending the spider to your domain
(IP-address?).
6. Unfortunately the nutch spider is sometimes (too often) used as a
site scraping tool. The spider can be used without the search/index
capabilities of Nutch.
7. A properly configured nutch robot will obey your robots.txt file.
With "properly" I mean "configured as intended".

citation:
"Thank you for your reply regarding the above and for any additional
information you can supply regarding steps that can be taken to block
Nutch once and for all from spidering my domain."
Well you could take your server offline ;). I really don't want to
insult you, but that's the only solution. Next time somebody will
modify "wget" to show the same kind of misbehaviour. "Nutch is like
giving TNT-sticks to children" (quote).

citation:
"Please advise me how to block it permanently from my domain or i will
seek avenues to report your spider for its intrusive behaviour to the
major search engines possibly resulting in your domains removal from
their listings."
I really don't want to comment on that one . . .


However, regarding your site - I want to point out something:
citation:
"ALL pages on blue-candy.com are copyright protected. Copying of any
page for any use is not allowed."
First, looking at your front page I found the following meta-tag:
"<meta name="Robots" content="index,follow">"
Ahm . . . Well this will make any robot copy your site's content. At
least add "noarchive" to it (page is indexed, but the page itself is
not stored).
Second, you should add a rel="nofollow" attribute if you want a robot
not to follow a given (for images . . . )

You're not alone:
http://johannburkard.de/blog/www/spam/this-much-nutch-is-too-much-nutch.html

The people developing nutch are serious people. Sorry that you are the
victim of some . . . well . . . script-kiddie. Probably some people
are more comfortable with modifying existing, well behaving code than
with using a mouse (or a download manager?).

I am just an individual and cannot/must not speak for the Apache
organisation. I am not affiliated in any way with them. This is just
my own private opinion.


Just my two cents,

Martin


PS: I hope your request for removal of your messages is approved



On Dec 7, 2007 9:55 PM, bluebrit <bl...@blue-candy.com> wrote:
> Hi,
>
> I just saw that my emails to you appear on this page http://www.nabble.com/Fw:-Blocked-nutch-spider-accessing-pages-t4877480.html
>
> It was not my intent for these emails to be made available for everybody. This was a personal email between myself and you and was considered private.
>
> Please remove all details regarding me from your database and remove the emails from the above domain and relevant pages.
>
> For your additional information in response to your reply on this page however, you state.
>
> "Hi
>
> I am sory for your bandwidth consumption but I do not think that is my
> spider. You are right about robot.txt files. I did not read them because I
> do not know existence of that file. Thank you for advice. But I never send
> my spider to your domain. I crawled Only small amount of host and none of
> them about your host. How can you be sure about that is my spider. Please
> warn me if the problem continues and you sure about the spider.
>
> Thank you"
>
> I can be sure it is your spider because my log file detects you and gives me a link with a live url that I can follow. The link is live in this email at the moment and it was in the past emails but I see you have removed the link on the above page.
>
>      Nutch 608+203 3.02 MB 07 Dec 2007 - 01:57
>
>
> If this isn't you, then you should have a safeguard in place that stops this happening or gives each user a distinct number that has to remain in the software for it to work. At least that way a method of tracking the users would be available.
>
> I also think your comment regarding your not knowing of the existence of robots.txt files a bit unrealistic as you obviously have the knowledge to use software such as this.
>
> At the end of the day, there is no reason for software such as this to ignore a robots.txt file unless it is attempting something it shouldn't.
>
> Please take note of my above request to remove these emails from your pages as I can see a future problem coming from this where my address is harvested and I end up getting spammed unnecessarily. I see that as a far worse problem than the one i have with you at the moment.
>
> regards
> owner blue-candy.com
>
>
>
> ----- Original Message -----
> From: bluebrit
> To: nutch-agent@lucene.apache.org
> Sent: Monday, November 26, 2007 5:13 AM
> Subject: Fw: Blocked nutch spider accessing pages
>
>
> I sent the below original email to you without reply two weeks ago and as you can see my domain is still being crawled by your spider.
> Please advise me how to block it permanently from my domain or i will seek avenues to report your spider for its intrusive behaviour to the major search engines possibly resulting in your domains removal from their listings.
>
>
>      Nutch 1733+29 10.79 MB 23 Nov 2007 - 11:06
>
>
>
> regards
> owner blue-candy.com
>
>
> ----- Original Message -----
> From: bluebrit
> To: nutch-agent@lucene.apache.org
> Sent: Monday, November 12, 2007 12:43 PM
> Subject: Blocked nutch spider accessing pages
>
>
> Hello, I am writing this email to you because of the following.
>
> Blocked spider in robots.txt found in log file.
>
> User-agent: Nutch
> Disallow: /
>
> To date this month Nutch has appeared in my site log an unreasonable amount of times bearing in mind it is supposed to be blocked. It is obvious that your spider is not reading the robots.txt file and as my domain contains a copyright warning, can i assume you will be able to ensure that your spider or the user of your spider will stop the repeated visits and possible copying of text / graphics as well.
>
> Below is a copy of log files from the last six months, that although not large in bandwidth usage, does constitute a problem as it seems to show an increasing demand.
>
>      NutchCVS 588+8 2.37 MB 28 Jun 2007 - 06:43
>
>      Nutch 807+21 2.25 MB 28 Jun 2007 - 15:46
>
>
>      Nutch 324+223 1.35 MB 31 Jul 2007 - 04:11
>
>
>      Nutch 105+18 657.46 KB 31 Aug 2007 - 18:41
>
>      NutchCVS 712+12 2.73 MB 15 Aug 2007 - 00:38
>
>
>      Nutch 42+13 315.86 KB 30 Sep 2007 - 04:34
>
>
>      Nutch 30+12 182.74 KB 24 Oct 2007 - 19:56
>
>
>      Nutch 977+15 6.87 MB 08 Nov 2007 - 22:41
>
>
> My domain is http://www.blue-candy.com
>
> Please note this is an adult domain and ALL of the images / video clips are also copyright protected by the sponsoring companies.
>
> Thank you for your reply regarding the above and for any additional information you can supply regarding steps that can be taken to block Nutch once and for all from spidering my domain.
>
> Regards
> Owner blue-candy.com
>
>