You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Emilijan Mirceski <em...@cpuedge.com> on 2005/06/30 21:40:29 UTC

recursion: see recursion

Lately, I'm receiving 1000's variations of the following:

050630 153456 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315216/mt.net.mk/mt.net.
.k/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/mt.net.mk
050630 153457 Response content length is not known
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Kultura/315299/text/kultura/mt.ne
t.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/index.htm
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315408/mt.net.mk/mt.net.
mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/kategorija


the mt.mt.net.mk just goes on forever for bunch of urls, different depths,
etc...

Any ideas how to prevent it / fix it?


regex url filter

Posted by Emilijan Mirceski <em...@cpuedge.com>.
If in my regex-urlfilter:

>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]

i skip '?' and '=', I will have more pages in my database. 

Is there any strong reason why this was disabled in the release version? 
(My segments have about ~100 thousand pages total, which is barely 1.2 GB)

Regards,
Emilijan


RE: recursion: see recursion

Posted by Emilijan Mirceski <em...@cpuedge.com>.
Problem solved by an appropriate regex query. The reason for the problem is
some strange combination of java code and urls.

-----Original Message-----
From: Emilijan Mirceski [mailto:emilijan@cpuedge.com] 
Sent: Thursday, June 30, 2005 3:40 PM
To: nutch-user@lucene.apache.org
Subject: recursion: see recursion


Lately, I'm receiving 1000's variations of the following:

050630 153456 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315216/mt.net.mk/mt.net.
.k/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/mt.net.mk
050630 153457 Response content length is not known
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Kultura/315299/text/kultura/mt.ne
t.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/index.htm
050630 153458 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315408/mt.net.mk/mt.net.
mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/kategorija


the mt.mt.net.mk just goes on forever for bunch of urls, different depths,
etc...

Any ideas how to prevent it / fix it?