You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/03/31 12:32:23 UTC

Re: [Nutch-general] What's the difference between crawl-urlfilter.txt and regex-urlfilter.txt?

Hello Steve

I think I can explain more.

regex-urlfilter.txt is used by RegexURLFilter plugin
while crawl-urlfilter.txt is used by CrawlTool(the crawl-tool.xml)

I think it is clear


Best regards, 

/Jack  
======= At 2005-03-31, 17:51:21 you wrote: =======

>Olaf, Cheers, I am still confused though. 
>How does nutch know which of the two to use,
>that is, how do I tell nutch if its doing intranet
>or internet? Do I rename regex-urlfilter.txt
>to crawl-urlfilter.txt to if I want to do internet crawls?
>Steve
>
>
>-----Original Message-----
>From: Olaf Thiele [mailto:olaf.thiele@gmail.com] 
>Sent: Thursday, March 31, 2005 4:40 PM
>To: nutch-user@incubator.apache.org
>Subject: Re: What's the difference between crawl-urlfilter.txt and
>regex-urlfilter.txt?
>
>
>Hi Steve,
>the crawl-urlfilter is for intranet crawling while regex-urlfilter is
>for internet crawling.
>
>Kind regards,
>Olaf
>
>
>
>On Thu, 31 Mar 2005 12:01:19 +0800, Steve Follmer <sf...@meer.net>
>wrote:
>> 
>> What's the difference between crawl-urlfilter.txt and 
>> regex-urlfilter.txt? They look very similar. Why does nutch have both,
>
>> and what do they do different?
>> 
>> My best guess is that the first is used only by the crawl tool and the
>
>> second is used by nutch proper. The crawl tool and nutch proper seem 
>> to also have
>> separate .xml config files. I further guess that this is just an
>> artifact of
>> having two separate tools that need separate but equal configuration?
>> 
>> -Poindexter
>> 
>> 
>
>
>-- 
>
><SimpleHuman gender="male">
>   <Physical name="Olaf Thiele" />
>   <Virtual adress="http://www.olafthiele.de" />
></SimpleHuman>
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by Demarc:
>A global provider of Threat Management Solutions.
>Download our HomeAdmin security software for free today!
>http://www.demarc.com/Info/Sentarus/hamr30
>_______________________________________________
>Nutch-general mailing list
>Nutch-general@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-general