You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nutch.apache.org by ab...@apache.org on 2006/03/08 15:10:13 UTC

svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Author: ab
Date: Wed Mar  8 06:10:12 2006
New Revision: 384219

URL: http://svn.apache.org/viewcvs?rev=384219&view=rev
Log:
Don't generate URLs that don't pass URLFilters.

Modified:
    lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?rev=384219&r1=384218&r2=384219&view=diff
==============================================================================
--- lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java (original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java Wed Mar  8 06:10:12 2006
@@ -28,6 +28,8 @@
 import org.apache.hadoop.mapred.*;
 import org.apache.hadoop.mapred.lib.*;
 
+import org.apache.nutch.net.URLFilterException;
+import org.apache.nutch.net.URLFilters;
 import org.apache.nutch.util.NutchConfiguration;
 import org.apache.nutch.util.NutchJob;
 
@@ -45,11 +47,13 @@
     private HashMap hostCounts = new HashMap();
     private int maxPerHost;
     private Partitioner hostPartitioner = new PartitionUrlByHost();
+    private URLFilters filters;
 
     public void configure(JobConf job) {
       curTime = job.getLong("crawl.gen.curTime", System.currentTimeMillis());
       limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
       maxPerHost = job.getInt("generate.max.per.host", -1);
+      filters = new URLFilters(job);
     }
 
     public void close() {}
@@ -58,6 +62,14 @@
     public void map(WritableComparable key, Writable value,
                     OutputCollector output, Reporter reporter)
       throws IOException {
+      UTF8 url = (UTF8)key;
+      // don't generate URLs that don't pass URLFilters
+      try {
+        if (filters.filter(url.toString()) == null)
+          return;
+      } catch (URLFilterException e) {
+        LOG.warning("Couldn't filter url: " + url + " (" + e.getMessage() + ")");
+      }
       CrawlDatum crawlDatum = (CrawlDatum)value;
 
       if (crawlDatum.getStatus() == CrawlDatum.STATUS_DB_GONE)

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> Stefan Groschupf wrote:
> 
>> I notice filtering urls is done in the output format until parsing. 
>> Wouldn't it be better to filter it until updating crawlDb?
> 
> 
> "Until" == "during" ?
> 
> As you observed, doing it at this stage saves space in segment data, and 
> in consequence saves on processing time (no CPU/IO needed to process 
> useless data, throw away junk as soon as possible).

I think it is better to not filter at parse time, but at db insert time. 
  This way if desired urls are accidentally filtered out then one only 
has to re-update the db to include them rather than re-parse and re-update.

Doug

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Stefan Groschupf <sg...@media-style.com>.

>> I notice filtering urls is done in the output format until  
>> parsing. Wouldn't it be better to filter it until updating crawlDb?
>
> "Until" == "during" ?
Sorry, yes during!
>
> As you observed, doing it at this stage saves space in segment  
> data, and in consequence saves on processing time (no CPU/IO needed  
> to process useless data, throw away junk as soon as possible).
Make sense, thanks for the hint. I guess now with a published db  
filter tool for nutch .7 and .8 people will be able to clean up web-  
and crawl databases.

Stefan

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Andrzej Bialecki <ab...@getopt.org>.

Stefan Groschupf wrote:
> I notice filtering urls is done in the output format until parsing. 
> Wouldn't it be better to filter it until updating crawlDb?

"Until" == "during" ?

As you observed, doing it at this stage saves space in segment data, and 
in consequence saves on processing time (no CPU/IO needed to process 
useless data, throw away junk as soon as possible).

> Sure it would require to have some more disk space but since parsing 
> is done until fetching it may be improve fetching speed.

Parsing is not always done at fetching stage (Fetcher.parsing == false).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Stefan Groschupf <sg...@media-style.com>.

I notice filtering urls is done in the output format until parsing.  
Wouldn't it be better to filter it until updating crawlDb?
Sure it would require to have some more disk space but since parsing  
is done until fetching it may be improve fetching speed.

Stefan

Am 08.03.2006 um 18:53 schrieb Doug Cutting:

> ab@apache.org wrote:
>> Don't generate URLs that don't pass URLFilters.
>
> Just to be clear, this is to support folks changing their filters  
> while they're crawling, right?  We already filter before we put  
> things into the db, so we're filtering twice now, no?  If so, then  
> perhaps there should be an option to disable this second filtering  
> for folks who don't change their filters?
>
> Doug
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> IMHO doing this here has a minimal impact while preventing a common 
> problem, but if you think this would harm many users then we should of 
> course make it optional.

Let's just leave it as-is for now.  Thanks!

Doug

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Matt Kangas <ka...@gmail.com>.

Rod, I just posted my PruneDB.java file to: http:// 
blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

(104 lines, nutch 0.7 only.)

License granted anyone to hack/copy this as they wish. Should be easy  
to adapt to 0.8.

> Usage: PruneDB <db> -s
> Where: db is the path of the nutch db to prune
> Usage: -s simulate: parses the db, but doesn't delete any pages

--Matt

On Mar 8, 2006, at 1:47 PM, Rod Taylor wrote:

> On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> ab@apache.org wrote:
>>>> Don't generate URLs that don't pass URLFilters.
>>>
>>> Just to be clear, this is to support folks changing their filters
>>> while they're crawling, right?  We already filter before we
>>
>> Yes, and this seems to be the most common case. This is especially
>> important since there are no tools yet to clean up the DB.
>
> I have this situation now. There are over 100M urls in my DB from crap
> domains that I want to get rid of.
>
> Adding a --refilter option to updatedb seemed like the most obvious
> course of action.
>
> A completely separate command so it could be initiated by hand would
> also work for me.
>
> -- 
> Rod Taylor <rb...@sitesell.com>
>

--
Matt Kangas / kangas@gmail.com

CrawlDb Filter tool, was Re: svn commit: r384219 -

Posted by Stefan Groschupf <sg...@media-style.com>.

Rod,
some days ago I had written a small tool that is filtering a crawlDb.
You can find it here now:
http://issues.apache.org/jira/browse/NUTCH-226
Give it a try and let me know if that works for you, in any case  
backup your crawlDb first!!!
I tested it only with a small crawlDb, so it is your own risk. :)

Stefan

Am 08.03.2006 um 19:47 schrieb Rod Taylor:

> On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
>> Doug Cutting wrote:
>>> ab@apache.org wrote:
>>>> Don't generate URLs that don't pass URLFilters.
>>>
>>> Just to be clear, this is to support folks changing their filters
>>> while they're crawling, right?  We already filter before we
>>
>> Yes, and this seems to be the most common case. This is especially
>> important since there are no tools yet to clean up the DB.
>
> I have this situation now. There are over 100M urls in my DB from crap
> domains that I want to get rid of.
>
> Adding a --refilter option to updatedb seemed like the most obvious
> course of action.
>
> A completely separate command so it could be initiated by hand would
> also work for me.
>
> -- 
> Rod Taylor <rb...@sitesell.com>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Rod Taylor <rb...@sitesell.com>.

On Wed, 2006-03-08 at 19:15 +0100, Andrzej Bialecki wrote:
> Doug Cutting wrote:
> > ab@apache.org wrote:
> >> Don't generate URLs that don't pass URLFilters.
> >
> > Just to be clear, this is to support folks changing their filters 
> > while they're crawling, right?  We already filter before we 
> 
> Yes, and this seems to be the most common case. This is especially 
> important since there are no tools yet to clean up the DB.

I have this situation now. There are over 100M urls in my DB from crap
domains that I want to get rid of.

Adding a --refilter option to updatedb seemed like the most obvious
course of action.

A completely separate command so it could be initiated by hand would
also work for me.

-- 
Rod Taylor <rb...@sitesell.com>

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> ab@apache.org wrote:
>> Don't generate URLs that don't pass URLFilters.
>
> Just to be clear, this is to support folks changing their filters 
> while they're crawling, right?  We already filter before we 

Yes, and this seems to be the most common case. This is especially 
important since there are no tools yet to clean up the DB.

> put things into the db, so we're filtering twice now, no?  If so, then 
> perhaps there should be an option to disable this second filtering for 
> folks who don't change their filters?

IMHO doing this here has a minimal impact while preventing a common 
problem, but if you think this would harm many users then we should of 
course make it optional.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

Posted by Doug Cutting <cu...@apache.org>.

ab@apache.org wrote:
> Don't generate URLs that don't pass URLFilters.

Just to be clear, this is to support folks changing their filters while 
they're crawling, right?  We already filter before we put things into 
the db, so we're filtering twice now, no?  If so, then perhaps there 
should be an option to disable this second filtering for folks who don't 
change their filters?

Doug