You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by brainstorm <br...@gmail.com> on 2008/06/29 17:56:53 UTC
Nutch spider trap detection
Hi!
I guess it is implemented, but cannot find it by myself on nutch API
docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
detect spider traps[1] ?
Thanks,
Roman
[1] http://en.wikipedia.org/wiki/Spider_trap
Re: Nutch spider trap detection
Posted by brainstorm <br...@gmail.com>.
Thanks ! I guess you mean:
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
In conf/regex-urlfilter.txt, am I wrong ?
The DomContentUtils on
/nutch/trunk/src/java/org/apache/nutch/parse/*.java is a bit confusing
to me and cannot see the recursion "protection" code.
Thanks !
On Mon, Jun 30, 2008 at 12:21 AM, Dennis Kubes <ku...@apache.org> wrote:
> There are some regexes in the url normalizers and there is some code in
> DomContentUtils for recursion.
>
> Dennis
>
> brainstorm wrote:
>>
>> Hi!
>>
>> I guess it is implemented, but cannot find it by myself on nutch API
>> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
>> detect spider traps[1] ?
>>
>> Thanks,
>> Roman
>>
>> [1] http://en.wikipedia.org/wiki/Spider_trap
>
Re: Nutch spider trap detection
Posted by Dennis Kubes <ku...@apache.org>.
There are some regexes in the url normalizers and there is some code in
DomContentUtils for recursion.
Dennis
brainstorm wrote:
> Hi!
>
> I guess it is implemented, but cannot find it by myself on nutch API
> docs nor wiki :-/ ... Is there any mechanism implemented in nutch to
> detect spider traps[1] ?
>
> Thanks,
> Roman
>
> [1] http://en.wikipedia.org/wiki/Spider_trap