You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John Mendenhall <jo...@surfutopia.net> on 2008/10/13 23:28:23 UTC

nutch mergedb filter does not appear to be filtering

We are using nutch version nutch-2008-07-22_04-01-29.
We have a crawldb with over 1 million urls.
We need to remove (filter) 17000 urls.

We have created a new nutch configuration directory.
The only difference between this configuration directory
and the normal one is the automaton-urlfilter.txt,
crawl-urlfilter.txt, and regex-urlfilter.txt files.

We have added the urls we would like removed listed
before the normal patterns in the url filter files.

Here is how we list the urls to be removed (in the
regex-urlfilter.txt file):

  -^http://www.domain.com/path1/path2/file1$
  -^http://www.domain.com/path1/path2/file2$

The normal patterns are listed as follows:

  +^http://www.domain.com/path3/
  +^http://www.domain.com/.*fileending1$

We run CrawlDbMerger command as follows:

  bin/nutch mergedb /full/patch/newcrawldb /full/patch/crawldb -filter

I modified the log4j.properties file entry for CrawlDbMerger as follows:

  log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=DEBUG,cmdstdout

It takes less than a couple minutes to run.  It does not output any debug
statements.  

When I run bin/nutch readdb <crawldb> -stats for the original
crawldb (/full/patch/crawldb) and the new crawldb (/full/patch/newcrawldb),
the stats for both crawldbs are the same.

It appears it is doing a copy with no filtering.

I will continue trying different things.  I will post when I determine
the problem.  I am hoping it is just something stupid I am doing.

Please let me know if there is anything specific I should be looking
at first.  Thanks in advance for any guidance or ideas provided.

Thanks!

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch mergedb filter does not appear to be filtering

Posted by John Mendenhall <jo...@surfutopia.net>.
> > We have created a new nutch configuration directory.
> > The only difference between this configuration directory
> > and the normal one is the automaton-urlfilter.txt,
> > crawl-urlfilter.txt, and regex-urlfilter.txt files.
> 
> With the new nutch configuration directory, we set the env
> var NUTCH_CONF_DIR.  I know the nutch script is getting this
> value.  I put in some debug statements.  I can see it added
> properly to CLASSPATH.  I also set HADOOP_CONF_DIR.  This
> also does not have any effect.
> 
> I am checking the access times on the regex-urlfilter.txt file.
> The new regex-urlfilter.txt is not accessed.  The process is
> only accessing the regex-urlfilter.txt file in the $NUTCH_HOME/conf
> directory.  It does not appear to be using the NUTCH_CONF_DIR.
> 
> Does anyone have any thoughts or ideas for what we can do to
> get this to work with the NUTCH_CONF_DIR?  Thank you in
> advance for any pointers.

I fixed the problem.

Before modifying NUTCH_CONF_DIR and HADOOP_CONF_DIR, stop
the hadoop processes.  Then, modify the NUTCH_CONF_DIR and
HADOOP_CONF_DIR.  We set them to our special configuration
directory.  Then, start the hadoop processes.  Once the filtering
is done, we stop the hadoop processes.  Then, we unset the
NUTCH_CONF_DIR and HADOOP_CONF_DIR environment variables.
Finally, we restart the hadoop processes.

Everything works like a charm now.

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Re: nutch mergedb filter does not appear to be filtering

Posted by John Mendenhall <jo...@surfutopia.net>.
> We are using nutch version nutch-2008-07-22_04-01-29.
> We have a crawldb with over 1 million urls.
> We need to remove (filter) 17000 urls.
> 
> We have created a new nutch configuration directory.
> The only difference between this configuration directory
> and the normal one is the automaton-urlfilter.txt,
> crawl-urlfilter.txt, and regex-urlfilter.txt files.
> 
> We have added the urls we would like removed listed
> before the normal patterns in the url filter files.
> 
> Here is how we list the urls to be removed (in the
> regex-urlfilter.txt file):
> 
>   -^http://www.domain.com/path1/path2/file1$
>   -^http://www.domain.com/path1/path2/file2$
> 
> The normal patterns are listed as follows:
> 
>   +^http://www.domain.com/path3/
>   +^http://www.domain.com/.*fileending1$
> 
> We run CrawlDbMerger command as follows:
> 
>   bin/nutch mergedb /full/patch/newcrawldb /full/patch/crawldb -filter
> 
> I modified the log4j.properties file entry for CrawlDbMerger as follows:
> 
>   log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=DEBUG,cmdstdout
> 
> It takes less than a couple minutes to run.  It does not output any debug
> statements.  
> 
> When I run bin/nutch readdb <crawldb> -stats for the original
> crawldb (/full/patch/crawldb) and the new crawldb (/full/patch/newcrawldb),
> the stats for both crawldbs are the same.
> 
> It appears it is doing a copy with no filtering.
> 
> I will continue trying different things.  I will post when I determine
> the problem.  I am hoping it is just something stupid I am doing.
> 
> Please let me know if there is anything specific I should be looking
> at first.  Thanks in advance for any guidance or ideas provided.

I found the problem.  I do not know how to fix the problem.

With the new nutch configuration directory, we set the env
var NUTCH_CONF_DIR.  I know the nutch script is getting this
value.  I put in some debug statements.  I can see it added
properly to CLASSPATH.  I also set HADOOP_CONF_DIR.  This
also does not have any effect.

I am checking the access times on the regex-urlfilter.txt file.
The new regex-urlfilter.txt is not accessed.  The process is
only accessing the regex-urlfilter.txt file in the $NUTCH_HOME/conf
directory.  It does not appear to be using the NUTCH_CONF_DIR.

Does anyone have any thoughts or ideas for what we can do to
get this to work with the NUTCH_CONF_DIR?  Thank you in
advance for any pointers.

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services