You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/14 17:57:25 UTC

When a crawl goes bad...

> Dear @80legs stop crushing metafilter.com from 2226 distinct IP  
> addresses.
> Your bots are DDOSing the site with thousands of requests. Stop.
> <http://twitter.com/mathowie/status/20326707535>

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: When a crawl goes bad...

Posted by Julien Nioche <li...@gmail.com>.
It's probably more an issue with DNS resolution than robots.txt. Even if you
respect the robots.txt instructions you can still have N host or even domain
names pointing to a single server. This can be avoided in Nutch by setting
'partition.url.mode' and 'fetcher.queue.mode' to 'byIP'.


On 16 August 2010 08:06, CatOs Mandros <ca...@gmail.com> wrote:

> Rather amusing :)
>
> Something similar was what made Grub gain a bit of bad reputation...
> thank god we have the robots.txt file.
>
> On Sat, Aug 14, 2010 at 7:48 PM, Mattmann, Chris A (388J)
> <ch...@jpl.nasa.gov> wrote:
> > LOL...
> >
> >
> > On 8/14/10 8:57 AM, "Ken Krugler" <kk...@transpac.com> wrote:
> >
> > Dear @80legs stop crushing metafilter.com from 2226 distinct IP
> addresses.
> > Your bots are DDOSing the site with thousands of requests. Stop.
> > <http://twitter.com/mathowie/status/20326707535>
> >
> > -- Ken
> >
> >
> > --------------------------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://bixolabs.com
> > e l a s t i c   w e b   m i n i n g
> >
> >
> >
> >
> >
> >
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: Chris.Mattmann@jpl.nasa.gov
> > WWW:   http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: When a crawl goes bad...

Posted by CatOs Mandros <ca...@gmail.com>.
Rather amusing :)

Something similar was what made Grub gain a bit of bad reputation...
thank god we have the robots.txt file.

On Sat, Aug 14, 2010 at 7:48 PM, Mattmann, Chris A (388J)
<ch...@jpl.nasa.gov> wrote:
> LOL...
>
>
> On 8/14/10 8:57 AM, "Ken Krugler" <kk...@transpac.com> wrote:
>
> Dear @80legs stop crushing metafilter.com from 2226 distinct IP addresses.
> Your bots are DDOSing the site with thousands of requests. Stop.
> <http://twitter.com/mathowie/status/20326707535>
>
> -- Ken
>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Re: When a crawl goes bad...

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
LOL...


On 8/14/10 8:57 AM, "Ken Krugler" <kk...@transpac.com> wrote:

Dear @80legs stop crushing metafilter.com from 2226 distinct IP addresses.
Your bots are DDOSing the site with thousands of requests. Stop.
<http://twitter.com/mathowie/status/20326707535>

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g








++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++