You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Emilijan Mirceski <em...@cpuedge.com> on 2005/06/30 21:40:29 UTC
recursion: see recursion
Lately, I'm receiving 1000's variations of the following:
050630 153456 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315216/mt.net.mk/mt.net.
.k/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/mt.net.mk
050630 153457 Response content length is not known
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Kultura/315299/text/kultura/mt.ne
t.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/index.htm
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315408/mt.net.mk/mt.net.
mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/kategorija
the mt.mt.net.mk just goes on forever for bunch of urls, different depths,
etc...
Any ideas how to prevent it / fix it?
regex url filter
Posted by Emilijan Mirceski <em...@cpuedge.com>.
If in my regex-urlfilter:
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
i skip '?' and '=', I will have more pages in my database.
Is there any strong reason why this was disabled in the release version?
(My segments have about ~100 thousand pages total, which is barely 1.2 GB)
Regards,
Emilijan
RE: recursion: see recursion
Posted by Emilijan Mirceski <em...@cpuedge.com>.
Problem solved by an appropriate regex query. The reason for the problem is
some strange combination of java code and urls.
-----Original Message-----
From: Emilijan Mirceski [mailto:emilijan@cpuedge.com]
Sent: Thursday, June 30, 2005 3:40 PM
To: nutch-user@lucene.apache.org
Subject: recursion: see recursion
Lately, I'm receiving 1000's variations of the following:
050630 153456 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315216/mt.net.mk/mt.net.
.k/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/mt.net.mk
050630 153457 Response content length is not known
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Kultura/315299/text/kultura/mt.ne
t.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/index.htm
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315408/mt.net.mk/mt.net.
mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/kategorija
the mt.mt.net.mk just goes on forever for bunch of urls, different depths,
etc...
Any ideas how to prevent it / fix it?