You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by brainstorm <br...@gmail.com> on 2008/06/29 17:56:53 UTC

Nutch spider trap detection

Hi!

I guess it is implemented, but cannot find it by myself on nutch API
docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
detect spider traps[1] ?

Thanks,
Roman

[1] http://en.wikipedia.org/wiki/Spider_trap

Re: Nutch spider trap detection

Posted by brainstorm <br...@gmail.com>.

Thanks ! I guess you mean:

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

In conf/regex-urlfilter.txt, am I wrong ?

The DomContentUtils on
/nutch/trunk/src/java/org/apache/nutch/parse/*.java is a bit confusing
to me and cannot see the recursion "protection" code.

Thanks !

On Mon, Jun 30, 2008 at 12:21 AM, Dennis Kubes <ku...@apache.org> wrote:
> There are some regexes in the url normalizers and there is some code in
> DomContentUtils for recursion.
>
> Dennis
>
> brainstorm wrote:
>>
>> Hi!
>>
>> I guess it is implemented, but cannot find it by myself on nutch API
>> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
>> detect spider traps[1] ?
>>
>> Thanks,
>> Roman
>>
>> [1] http://en.wikipedia.org/wiki/Spider_trap
>

Re: Nutch spider trap detection

Posted by Dennis Kubes <ku...@apache.org>.

There are some regexes in the url normalizers and there is some code in 
DomContentUtils for recursion.

Dennis

brainstorm wrote:
> Hi!
> 
> I guess it is implemented, but cannot find it by myself on nutch API
> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
> detect spider traps[1] ?
> 
> Thanks,
> Roman
> 
> [1] http://en.wikipedia.org/wiki/Spider_trap