You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2005/09/11 05:45:42 UTC

Nutch-87 Setup

hi Matt:

You nutch-87 has a good idea and I believe it provides
a solution for good size of controled domain, say
hundreds of thousands sites.

I am currently trying to implement it to Nutch 07.

Got several questions want to be clearified:

1)
Should I create two plug-in classes in nutch? 

etc
one for "WhitelistURLFilter" 
one for "WhitelistWriter

2)
I found Whitelist.java refer to 
"import epile.util.LogLevel;"

And
WhitelistURLFilter.java refer to
"import epile.crawl.util.StringURL;
import epile.util.LogLevel;"

Are these new package existing in Nutch lib? If not,
should we import a new epile*.jar?

3)
If we want to use Nutch-87, should we change the code
in Nutch core code. 

I plan to "replace" all the places where
RegexURLFilter appearing by WhitelistURLFilter.

Is it a right approach?

thanks,

Michael Ji,



	
		
______________________________________________________
Click here to donate to the Hurricane Katrina relief effort.
http://store.yahoo.com/redcross-donate3/

Re: Nutch-87 Setup

Posted by Matt Kangas <ka...@gmail.com>.

Michael, this looks like an error in your Nutch configuration, or  
possibly your CLASSPATH. I'd guess it's the former. Take a look at  
the following nutch-site.xml (or nutch-default) properties, and make  
sure they reference (a) the right place on disk, (b) plugins that  
actually exist:

- plugin.folders
- plugin.includes
- urlfilter.order

If you're still stuck, email me privately and we'll try to work  
through this.

--Matt

On Sep 13, 2005, at 7:14 PM, Michael Ji wrote:

> hi Matt:
>
> Thanks your advice.
>
> I can trigger URLFilterChecker successfully, however,
> get the following error, complain about index filter.
> Could you let me know where the problem will be?
>
> "
> 050921 191015 impl:
> point=org.apache.nutch.net.URLFilter
> class=org.apache.nutch.net.RegexURLFilter
>
> 050921 191015 not including:
> E:\programs\cygwin\home\fji\versionControl\nutch_V07_P87\nutch\build 
> \plugins\WhitelistURLFilter
>
> 050921 191015 SEVERE
> org.apache.nutch.plugin.PluginRuntimeException:
> extension point:
> org.apache.nutch.indexer.IndexingFilter does not
> exist.
> Exception in thread "main"
> java.lang.ExceptionInInitializerError
>     at
> org.apache.nutch.net.URLFilterChecker.checkAll 
> (URLFilterChecker.java:93)
>     at
> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
> Caused by: java.lang.RuntimeException:
> org.apache.nutch.plugin.PluginRuntimeException:
> extension point:
> org.apache.nutch.indexer.IndexingFilter does not
> exist.
>     at
> org.apache.nutch.plugin.PluginRepository.getInstance 
> (PluginRepository.java:147)
>     at
> org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
>     ... 2 more
> Caused by:
> org.apache.nutch.plugin.PluginRuntimeException:
> extension point:
> org.apache.nutch.indexer.IndexingFilter does not
> exist.
>     at
> org.apache.nutch.plugin.PluginRepository.installExtensions 
> (PluginRepository.java:78)
>     at
> org.apache.nutch.plugin.PluginRepository.<init> 
> (PluginRepository.java:61)
>     at
> org.apache.nutch.plugin.PluginRepository.getInstance 
> (PluginRepository.java:144)
>     ... 3 more
> "
>
> thanks,
>
> Michael Ji
>
>
> --- Matt Kangas <ka...@gmail.com> wrote:
>
>
>> Hi Michael,
>>
>> Ordinarily there's no need to edit bin/nutch to run
>> a specific class.
>> If the class is in a JAR in <nutch-home>/lib, you
>> can just say "nutch
>> <full class name>". For example, the following two
>> commands are
>> equivalent:
>>
>> $ nutch crawl
>> $ nutch org.apache.nutch.tools.CrawlTool
>>
>> However, the situation is a little different for
>> plugins. Ordinarily
>> the classes for a plugin are placed in
>> <nutch-home>/plugins/<plugin-
>> name>, not <nutch-home>/lib. To instantiate the
>> plugin class, you
>> must *another* class which calls the appropriate
>> plugin factory. For
>> URLFilter plugins, the factory class is
>> org.apache.nutch.net.URLFilters. This class does not
>> have a main()
>> method, but there is a helper class to test filters,
>>
>> URLFilterChecker. You can run it as follows:
>>
>> $ nutch org.apache.nutch.net.URLFilterChecker
>> -allCombined < urls.txt
>>
>> Hope that helps. Let me know if that doesn't work
>> for you.
>>
>> --Matt
>>
>> On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:
>>
>>
>>> hi Matt:
>>>
>>> I implemented and compiled your patch in Nutch 07
>>> successfully.
>>>
>>> However, I met a running problem, when I want to
>>>
>> test
>>
>>> patch manually by calling its' class.
>>>
>>> I edited bin/nutch and added line,
>>> "
>>> elif [ "$COMMAND" = WhitelistFilterTester ] ; then
>>>   CLASS=epile.crawl.plugin.WhitelistURLFilter
>>> "
>>>
>>> But when I call it, give me error as
>>> "
>>> Exception in thread "main"
>>> java.lang.NoClassDefFoundError:
>>>
>> epile/crawl/plugin/Wh
>>
>>> itelistURLFilter
>>> "
>>>
>>> I guess the classpath is not defined properly.
>>>
>>> My environment setting as followings:
>>>
>>> 1. nutch build.xml
>>> adding "<ant dir="epile" target="deploy"/> "
>>>
>>> 2. nutch/src/plugin/
>>> create dir of "epile-basic/src/java"
>>> then copy unzip nutch-87 of epile/crawl.. to that
>>>
>> dir
>>
>>>
>>> 3. I created plugin.xml in epile-basic/, same as
>>>
>> the
>>
>>> one you loaded in patch;
>>> and a new build.xml of
>>> "
>>> <?xml version="1.0"?>
>>>
>>> <project name="WhitelistURLFilter" default="jar">
>>>
>>>   <import file="../build-plugin.xml"/>
>>>
>>> </project>
>>>
>>> "
>>>
>>> 4. In nutch, I can run "ant" successfully,
>>> in nutch/build/, a new WhitelistURLFilter/ is
>>>
>> created
>>
>>> and with WhitelistURLFilter.class inside;
>>>
>>> Did I miss something important?
>>>
>>> thanks,
>>>
>>> Michael Ji
>>>
>>>
>>>
>>
>>

--
Matt Kangas / kangas@gmail.com

Re: Nutch-87 Setup

Posted by Michael Ji <fj...@yahoo.com>.

hi Matt:

Thanks your advice.

I can trigger URLFilterChecker successfully, however,
get the following error, complain about index filter.
Could you let me know where the problem will be?

"
050921 191015 impl:
point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter

050921 191015 not including:
E:\programs\cygwin\home\fji\versionControl\nutch_V07_P87\nutch\build\plugins\WhitelistURLFilter

050921 191015 SEVERE
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
Exception in thread "main"
java.lang.ExceptionInInitializerError
	at
org.apache.nutch.net.URLFilterChecker.checkAll(URLFilterChecker.java:93)
	at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
Caused by: java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
	at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:147)
	at
org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:40)
	... 2 more
Caused by:
org.apache.nutch.plugin.PluginRuntimeException:
extension point:
org.apache.nutch.indexer.IndexingFilter does not
exist.
	at
org.apache.nutch.plugin.PluginRepository.installExtensions(PluginRepository.java:78)
	at
org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:61)
	at
org.apache.nutch.plugin.PluginRepository.getInstance(PluginRepository.java:144)
	... 3 more
"

thanks,

Michael Ji


--- Matt Kangas <ka...@gmail.com> wrote:

> Hi Michael,
> 
> Ordinarily there's no need to edit bin/nutch to run
> a specific class.  
> If the class is in a JAR in <nutch-home>/lib, you
> can just say "nutch  
> <full class name>". For example, the following two
> commands are  
> equivalent:
> 
> $ nutch crawl
> $ nutch org.apache.nutch.tools.CrawlTool
> 
> However, the situation is a little different for
> plugins. Ordinarily  
> the classes for a plugin are placed in
> <nutch-home>/plugins/<plugin- 
> name>, not <nutch-home>/lib. To instantiate the
> plugin class, you  
> must *another* class which calls the appropriate
> plugin factory. For  
> URLFilter plugins, the factory class is  
> org.apache.nutch.net.URLFilters. This class does not
> have a main()  
> method, but there is a helper class to test filters,
>  
> URLFilterChecker. You can run it as follows:
> 
> $ nutch org.apache.nutch.net.URLFilterChecker
> -allCombined < urls.txt
> 
> Hope that helps. Let me know if that doesn't work
> for you.
> 
> --Matt
> 
> On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:
> 
> > hi Matt:
> >
> > I implemented and compiled your patch in Nutch 07
> > successfully.
> >
> > However, I met a running problem, when I want to
> test
> > patch manually by calling its' class.
> >
> > I edited bin/nutch and added line,
> > "
> > elif [ "$COMMAND" = WhitelistFilterTester ] ; then
> >   CLASS=epile.crawl.plugin.WhitelistURLFilter
> > "
> >
> > But when I call it, give me error as
> > "
> > Exception in thread "main"
> > java.lang.NoClassDefFoundError:
> epile/crawl/plugin/Wh
> > itelistURLFilter
> > "
> >
> > I guess the classpath is not defined properly.
> >
> > My environment setting as followings:
> >
> > 1. nutch build.xml
> > adding "<ant dir="epile" target="deploy"/> "
> >
> > 2. nutch/src/plugin/
> > create dir of "epile-basic/src/java"
> > then copy unzip nutch-87 of epile/crawl.. to that
> dir
> >
> > 3. I created plugin.xml in epile-basic/, same as
> the
> > one you loaded in patch;
> > and a new build.xml of
> > "
> > <?xml version="1.0"?>
> >
> > <project name="WhitelistURLFilter" default="jar">
> >
> >   <import file="../build-plugin.xml"/>
> >
> > </project>
> >
> > "
> >
> > 4. In nutch, I can run "ant" successfully,
> > in nutch/build/, a new WhitelistURLFilter/ is
> created
> > and with WhitelistURLFilter.class inside;
> >
> > Did I miss something important?
> >
> > thanks,
> >
> > Michael Ji
> >
> >
>
=====================================================
> > --- Matt Kangas <ka...@gmail.com> wrote:
> >
> >
> >> Hi Michael,
> >>
> >> Only WhitelistURLFilter is a plugin class.
> >> WhitelistWriter is a
> >> utility for creating the on-disk hash used at
> >> fetch/inject time by
> >> WhitelistURLFilter. Sorry for the confusion. I
> will
> >> add a sample
> >> plugin.xml file to the ticket, which should help
> >> make things clearer.
> >>
> >> Also, "epile.util.*" are our proprietary classes.
> >> LogLevel simply
> >> retrieves a value from a file other than
> >> nutch-site.xml. You can
> >> safely replace the references to
> epile.util.LogLevel
> >> with:
> >>
> >>
> >>> import org.apache.nutch.util.LogFormatter;
> >>> private static final Logger LOG =
> >>>
> >> LogFormatter.getLogger
> >>
> >>> (WhitelistURLFilter.class.getName());
> >>>
> >>
> >> StringURL is another utility class, probably not
> of
> >> high value. It
> >> just applies regexes to URL strings. The only
> >> references to it that I
> >> see are:
> >>
> >>
> >>> $ grep StringURL WhitelistURLFilter.java
> >>> import epile.crawl.util.StringURL;
> >>>     String hostname =
> >>>
> >> StringURL.extractHostname(url);
> >>
> >>>       String strippedURL =
> >>>
> >> StringURL.removeHostname(url);
> >>
> >>>         String domain =
> >>>
> >> StringURL.extractDomainFromHostname(hostname);
> >>
> >>>       if (StringURL.isCGI(url))
> >>>
> >>
> >> extractHostname() and removeHostname() can be
> >> replaced with calls to
> >> java.net.URL.getHost() and getPath(),
> respectively.
> >> The other two are
> >> simple to replicate, and can probably be
> commented
> >> out for basic use.
> >>
> >> Finally, to use this "new" plugin, you need to:
> >>
> >> a) make sure a suitable directory is created
> under
> >> "plugins",
> >> including a plugin.xml and a jar with the
> >> WhitelistURLFilter class
> >>
> >> b) modify your nutch-site.xml to include the new
> >> filter:
> >>
> >>
> >>> <property>
> >>>
> >>>
> >>
> >>
> >
>
<name>epile.crawl.whitelist.enableUndirectedCrawl</name>
> >
> >>>   <value>false</value>
> >>> </property>
> >>>
> >>> <property>
> >>>   <name>urlfilter.whitelist.file</name>
> >>>   <value>/var/epile/crawl/whitelist_map</value>
> >>>   <description>Name of file containing the
> >>>
> >> location of the on-disk
> >>
> >>> whitelist map directory.</description>
> >>> </property>
> >>>
> >>> <property>
> 
=== message truncated ===



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: Nutch-87 Setup

Posted by Matt Kangas <ka...@gmail.com>.

Hi Michael,

Ordinarily there's no need to edit bin/nutch to run a specific class.  
If the class is in a JAR in <nutch-home>/lib, you can just say "nutch  
<full class name>". For example, the following two commands are  
equivalent:

$ nutch crawl
$ nutch org.apache.nutch.tools.CrawlTool

However, the situation is a little different for plugins. Ordinarily  
the classes for a plugin are placed in <nutch-home>/plugins/<plugin- 
name>, not <nutch-home>/lib. To instantiate the plugin class, you  
must *another* class which calls the appropriate plugin factory. For  
URLFilter plugins, the factory class is  
org.apache.nutch.net.URLFilters. This class does not have a main()  
method, but there is a helper class to test filters,  
URLFilterChecker. You can run it as follows:

$ nutch org.apache.nutch.net.URLFilterChecker -allCombined < urls.txt

Hope that helps. Let me know if that doesn't work for you.

--Matt

On Sep 11, 2005, at 3:20 PM, Michael Ji wrote:

> hi Matt:
>
> I implemented and compiled your patch in Nutch 07
> successfully.
>
> However, I met a running problem, when I want to test
> patch manually by calling its' class.
>
> I edited bin/nutch and added line,
> "
> elif [ "$COMMAND" = WhitelistFilterTester ] ; then
>   CLASS=epile.crawl.plugin.WhitelistURLFilter
> "
>
> But when I call it, give me error as
> "
> Exception in thread "main"
> java.lang.NoClassDefFoundError: epile/crawl/plugin/Wh
> itelistURLFilter
> "
>
> I guess the classpath is not defined properly.
>
> My environment setting as followings:
>
> 1. nutch build.xml
> adding "<ant dir="epile" target="deploy"/> "
>
> 2. nutch/src/plugin/
> create dir of "epile-basic/src/java"
> then copy unzip nutch-87 of epile/crawl.. to that dir
>
> 3. I created plugin.xml in epile-basic/, same as the
> one you loaded in patch;
> and a new build.xml of
> "
> <?xml version="1.0"?>
>
> <project name="WhitelistURLFilter" default="jar">
>
>   <import file="../build-plugin.xml"/>
>
> </project>
>
> "
>
> 4. In nutch, I can run "ant" successfully,
> in nutch/build/, a new WhitelistURLFilter/ is created
> and with WhitelistURLFilter.class inside;
>
> Did I miss something important?
>
> thanks,
>
> Michael Ji
>
> =====================================================
> --- Matt Kangas <ka...@gmail.com> wrote:
>
>
>> Hi Michael,
>>
>> Only WhitelistURLFilter is a plugin class.
>> WhitelistWriter is a
>> utility for creating the on-disk hash used at
>> fetch/inject time by
>> WhitelistURLFilter. Sorry for the confusion. I will
>> add a sample
>> plugin.xml file to the ticket, which should help
>> make things clearer.
>>
>> Also, "epile.util.*" are our proprietary classes.
>> LogLevel simply
>> retrieves a value from a file other than
>> nutch-site.xml. You can
>> safely replace the references to epile.util.LogLevel
>> with:
>>
>>
>>> import org.apache.nutch.util.LogFormatter;
>>> private static final Logger LOG =
>>>
>> LogFormatter.getLogger
>>
>>> (WhitelistURLFilter.class.getName());
>>>
>>
>> StringURL is another utility class, probably not of
>> high value. It
>> just applies regexes to URL strings. The only
>> references to it that I
>> see are:
>>
>>
>>> $ grep StringURL WhitelistURLFilter.java
>>> import epile.crawl.util.StringURL;
>>>     String hostname =
>>>
>> StringURL.extractHostname(url);
>>
>>>       String strippedURL =
>>>
>> StringURL.removeHostname(url);
>>
>>>         String domain =
>>>
>> StringURL.extractDomainFromHostname(hostname);
>>
>>>       if (StringURL.isCGI(url))
>>>
>>
>> extractHostname() and removeHostname() can be
>> replaced with calls to
>> java.net.URL.getHost() and getPath(), respectively.
>> The other two are
>> simple to replicate, and can probably be commented
>> out for basic use.
>>
>> Finally, to use this "new" plugin, you need to:
>>
>> a) make sure a suitable directory is created under
>> "plugins",
>> including a plugin.xml and a jar with the
>> WhitelistURLFilter class
>>
>> b) modify your nutch-site.xml to include the new
>> filter:
>>
>>
>>> <property>
>>>
>>>
>>
>>
> <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
>
>>>   <value>false</value>
>>> </property>
>>>
>>> <property>
>>>   <name>urlfilter.whitelist.file</name>
>>>   <value>/var/epile/crawl/whitelist_map</value>
>>>   <description>Name of file containing the
>>>
>> location of the on-disk
>>
>>> whitelist map directory.</description>
>>> </property>
>>>
>>> <property>
>>>   <name>plugin.includes</name>
>>>
>>>
>>
>>
> <value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-
>
>>
>>
>>>
>>>
>>
>>
> (text|html)|index-basic|query-(basic|site|url)</value>
>
>>> </property>
>>>
>>> <property>
>>>   <name>urlfilter.order</name>
>>>   <value>org.apache.nutch.net.RegexURLFilter
>>> epile.crawl.plugin.WhitelistURLFilter</value>
>>> </property>
>>>
>>
>> c) run WhitelistWriter before attempting to fetch,
>> so the filter has
>> some rules to work with.
>>
>> I may have left out a crucial step or two here (0.5
>> wink ;), so feel
>> free to ask if anything seems unclear. I'll go
>> update the ticket now
>> to clarify these points.
>>
>> --Matt
>>
>>
>> On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:
>>
>>
>>> hi Matt:
>>>
>>> You nutch-87 has a good idea and I believe it
>>>
>> provides
>>
>>> a solution for good size of controled domain, say
>>> hundreds of thousands sites.
>>>
>>> I am currently trying to implement it to Nutch 07.
>>>
>>> Got several questions want to be clearified:
>>>
>>> 1)
>>> Should I create two plug-in classes in nutch?
>>>
>>> etc
>>> one for "WhitelistURLFilter"
>>> one for "WhitelistWriter
>>>
>>> 2)
>>> I found Whitelist.java refer to
>>> "import epile.util.LogLevel;"
>>>
>>> And
>>> WhitelistURLFilter.java refer to
>>> "import epile.crawl.util.StringURL;
>>> import epile.util.LogLevel;"
>>>
>>> Are these new package existing in Nutch lib? If
>>>
>> not,
>>
>>> should we import a new epile*.jar?
>>>
>>> 3)
>>> If we want to use Nutch-87, should we change the
>>>
>> code
>>
>>> in Nutch core code.
>>>
>>> I plan to "replace" all the places where
>>> RegexURLFilter appearing by WhitelistURLFilter.
>>>
>>> Is it a right approach?
>>>
>>> thanks,
>>>
>>> Michael Ji,
>>>
>>>
>>
>> --
>> Matt Kangas / kangas@gmail.com
>>
>>
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

--
Matt Kangas / kangas@gmail.com

Re: Nutch-87 Setup

Posted by Michael Ji <fj...@yahoo.com>.

hi Matt:

I implemented and compiled your patch in Nutch 07
successfully.

However, I met a running problem, when I want to test
patch manually by calling its' class.

I edited bin/nutch and added line, 
"
elif [ "$COMMAND" = WhitelistFilterTester ] ; then
  CLASS=epile.crawl.plugin.WhitelistURLFilter
"

But when I call it, give me error as 
"
Exception in thread "main"
java.lang.NoClassDefFoundError: epile/crawl/plugin/Wh
itelistURLFilter
"

I guess the classpath is not defined properly.

My environment setting as followings:

1. nutch build.xml 
adding "<ant dir="epile" target="deploy"/> "

2. nutch/src/plugin/
create dir of "epile-basic/src/java"
then copy unzip nutch-87 of epile/crawl.. to that dir

3. I created plugin.xml in epile-basic/, same as the
one you loaded in patch; 
and a new build.xml of
"
<?xml version="1.0"?>

<project name="WhitelistURLFilter" default="jar">

  <import file="../build-plugin.xml"/>

</project>

"

4. In nutch, I can run "ant" successfully, 
in nutch/build/, a new WhitelistURLFilter/ is created
and with WhitelistURLFilter.class inside;

Did I miss something important?

thanks,

Michael Ji

=====================================================
--- Matt Kangas <ka...@gmail.com> wrote:

> Hi Michael,
> 
> Only WhitelistURLFilter is a plugin class.
> WhitelistWriter is a  
> utility for creating the on-disk hash used at
> fetch/inject time by  
> WhitelistURLFilter. Sorry for the confusion. I will
> add a sample  
> plugin.xml file to the ticket, which should help
> make things clearer.
> 
> Also, "epile.util.*" are our proprietary classes.
> LogLevel simply  
> retrieves a value from a file other than
> nutch-site.xml. You can  
> safely replace the references to epile.util.LogLevel
> with:
> 
> > import org.apache.nutch.util.LogFormatter;
> > private static final Logger LOG =
> LogFormatter.getLogger 
> > (WhitelistURLFilter.class.getName());
> 
> StringURL is another utility class, probably not of
> high value. It  
> just applies regexes to URL strings. The only
> references to it that I  
> see are:
> 
> > $ grep StringURL WhitelistURLFilter.java
> > import epile.crawl.util.StringURL;
> >     String hostname =
> StringURL.extractHostname(url);
> >       String strippedURL =
> StringURL.removeHostname(url);
> >         String domain =
> StringURL.extractDomainFromHostname(hostname);
> >       if (StringURL.isCGI(url))
> 
> extractHostname() and removeHostname() can be
> replaced with calls to  
> java.net.URL.getHost() and getPath(), respectively.
> The other two are  
> simple to replicate, and can probably be commented
> out for basic use.
> 
> Finally, to use this "new" plugin, you need to:
> 
> a) make sure a suitable directory is created under
> "plugins",  
> including a plugin.xml and a jar with the
> WhitelistURLFilter class
> 
> b) modify your nutch-site.xml to include the new
> filter:
> 
> > <property>
> >  
>
<name>epile.crawl.whitelist.enableUndirectedCrawl</name>
> >   <value>false</value>
> > </property>
> >
> > <property>
> >   <name>urlfilter.whitelist.file</name>
> >   <value>/var/epile/crawl/whitelist_map</value>
> >   <description>Name of file containing the
> location of the on-disk  
> > whitelist map directory.</description>
> > </property>
> >
> > <property>
> >   <name>plugin.includes</name>
> >  
>
<value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse-
> 
> >
>
(text|html)|index-basic|query-(basic|site|url)</value>
> > </property>
> >
> > <property>
> >   <name>urlfilter.order</name>
> >   <value>org.apache.nutch.net.RegexURLFilter  
> > epile.crawl.plugin.WhitelistURLFilter</value>
> > </property>
> 
> c) run WhitelistWriter before attempting to fetch,
> so the filter has  
> some rules to work with.
> 
> I may have left out a crucial step or two here (0.5
> wink ;), so feel  
> free to ask if anything seems unclear. I'll go
> update the ticket now  
> to clarify these points.
> 
> --Matt
> 
> 
> On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:
> 
> > hi Matt:
> >
> > You nutch-87 has a good idea and I believe it
> provides
> > a solution for good size of controled domain, say
> > hundreds of thousands sites.
> >
> > I am currently trying to implement it to Nutch 07.
> >
> > Got several questions want to be clearified:
> >
> > 1)
> > Should I create two plug-in classes in nutch?
> >
> > etc
> > one for "WhitelistURLFilter"
> > one for "WhitelistWriter
> >
> > 2)
> > I found Whitelist.java refer to
> > "import epile.util.LogLevel;"
> >
> > And
> > WhitelistURLFilter.java refer to
> > "import epile.crawl.util.StringURL;
> > import epile.util.LogLevel;"
> >
> > Are these new package existing in Nutch lib? If
> not,
> > should we import a new epile*.jar?
> >
> > 3)
> > If we want to use Nutch-87, should we change the
> code
> > in Nutch core code.
> >
> > I plan to "replace" all the places where
> > RegexURLFilter appearing by WhitelistURLFilter.
> >
> > Is it a right approach?
> >
> > thanks,
> >
> > Michael Ji,
> >
> 
> --
> Matt Kangas / kangas@gmail.com
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Nutch-87 Setup

Posted by Matt Kangas <ka...@gmail.com>.

Hi Michael,

Only WhitelistURLFilter is a plugin class. WhitelistWriter is a  
utility for creating the on-disk hash used at fetch/inject time by  
WhitelistURLFilter. Sorry for the confusion. I will add a sample  
plugin.xml file to the ticket, which should help make things clearer.

Also, "epile.util.*" are our proprietary classes. LogLevel simply  
retrieves a value from a file other than nutch-site.xml. You can  
safely replace the references to epile.util.LogLevel with:

> import org.apache.nutch.util.LogFormatter;
> private static final Logger LOG = LogFormatter.getLogger 
> (WhitelistURLFilter.class.getName());

StringURL is another utility class, probably not of high value. It  
just applies regexes to URL strings. The only references to it that I  
see are:

> $ grep StringURL WhitelistURLFilter.java
> import epile.crawl.util.StringURL;
>     String hostname = StringURL.extractHostname(url);
>       String strippedURL = StringURL.removeHostname(url);
>         String domain = StringURL.extractDomainFromHostname(hostname);
>       if (StringURL.isCGI(url))

extractHostname() and removeHostname() can be replaced with calls to  
java.net.URL.getHost() and getPath(), respectively. The other two are  
simple to replicate, and can probably be commented out for basic use.

Finally, to use this "new" plugin, you need to:

a) make sure a suitable directory is created under "plugins",  
including a plugin.xml and a jar with the WhitelistURLFilter class

b) modify your nutch-site.xml to include the new filter:

> <property>
>   <name>epile.crawl.whitelist.enableUndirectedCrawl</name>
>   <value>false</value>
> </property>
>
> <property>
>   <name>urlfilter.whitelist.file</name>
>   <value>/var/epile/crawl/whitelist_map</value>
>   <description>Name of file containing the location of the on-disk  
> whitelist map directory.</description>
> </property>
>
> <property>
>   <name>plugin.includes</name>
>   <value>epile-whitelisturlfilter|urlfilter-(prefix|regex)|parse- 
> (text|html)|index-basic|query-(basic|site|url)</value>
> </property>
>
> <property>
>   <name>urlfilter.order</name>
>   <value>org.apache.nutch.net.RegexURLFilter  
> epile.crawl.plugin.WhitelistURLFilter</value>
> </property>

c) run WhitelistWriter before attempting to fetch, so the filter has  
some rules to work with.

I may have left out a crucial step or two here (0.5 wink ;), so feel  
free to ask if anything seems unclear. I'll go update the ticket now  
to clarify these points.

--Matt

On Sep 10, 2005, at 11:45 PM, Michael Ji wrote:

> hi Matt:
>
> You nutch-87 has a good idea and I believe it provides
> a solution for good size of controled domain, say
> hundreds of thousands sites.
>
> I am currently trying to implement it to Nutch 07.
>
> Got several questions want to be clearified:
>
> 1)
> Should I create two plug-in classes in nutch?
>
> etc
> one for "WhitelistURLFilter"
> one for "WhitelistWriter
>
> 2)
> I found Whitelist.java refer to
> "import epile.util.LogLevel;"
>
> And
> WhitelistURLFilter.java refer to
> "import epile.crawl.util.StringURL;
> import epile.util.LogLevel;"
>
> Are these new package existing in Nutch lib? If not,
> should we import a new epile*.jar?
>
> 3)
> If we want to use Nutch-87, should we change the code
> in Nutch core code.
>
> I plan to "replace" all the places where
> RegexURLFilter appearing by WhitelistURLFilter.
>
> Is it a right approach?
>
> thanks,
>
> Michael Ji,
>

--
Matt Kangas / kangas@gmail.com