You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/12/04 18:38:25 UTC

purging low-scoring urls

Is it possible to purge low-scoring urls from the crawldb? My news crawl has many thousands of zero-scoring urls and also many thousands of urls with scores less than 0.03. These urls will never be fetched because they will never make it into the generator's topN by score. So, all they do is make the process slower.

It seems like something an urlfilter could do, but I have not found any documentation for any urlfilter that does it.

RE: purging low-scoring urls

Posted by Yossi Tamari <yo...@pipl.com>.
Forgot to say: a urlfilter can't do that, since its input is just the URL, without any metadata such as the score.

> -----Original Message-----
> From: Yossi Tamari [mailto:yossi.tamari@pipl.com]
> Sent: 04 December 2017 21:01
> To: user@nutch.apache.org; 'Michael Coffey' <mc...@yahoo.com>
> Subject: RE: purging low-scoring urls
> 
> Hi Michael,
> 
> I think one way you can do it is using `readdb <crawldb> -dump new_crawldb -
> format crawldb -expr "score>0.03" `.
> You would then need to use hdfs commands to replace the existing
> <crawldb>/current with newcrawl_db.
> Of course, I strongly recommend backing up the current crawldb before
> replacing it...
> 
> 	Yossi.
> 
> > -----Original Message-----
> > From: Michael Coffey [mailto:mcoffey@yahoo.com.INVALID]
> > Sent: 04 December 2017 20:38
> > To: User <us...@nutch.apache.org>
> > Subject: purging low-scoring urls
> >
> > Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> > many thousands of zero-scoring urls and also many thousands of urls with
> > scores less than 0.03. These urls will never be fetched because they will never
> > make it into the generator's topN by score. So, all they do is make the process
> > slower.
> >
> > It seems like something an urlfilter could do, but I have not found any
> > documentation for any urlfilter that does it.



RE: purging low-scoring urls

Posted by Yossi Tamari <yo...@pipl.com>.
Hi Michael,

I think one way you can do it is using `readdb <crawldb> -dump new_crawldb -format crawldb -expr "score>0.03" `.
You would then need to use hdfs commands to replace the existing <crawldb>/current with newcrawl_db.
Of course, I strongly recommend backing up the current crawldb before replacing it...

	Yossi. 

> -----Original Message-----
> From: Michael Coffey [mailto:mcoffey@yahoo.com.INVALID]
> Sent: 04 December 2017 20:38
> To: User <us...@nutch.apache.org>
> Subject: purging low-scoring urls
> 
> Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> many thousands of zero-scoring urls and also many thousands of urls with
> scores less than 0.03. These urls will never be fetched because they will never
> make it into the generator's topN by score. So, all they do is make the process
> slower.
> 
> It seems like something an urlfilter could do, but I have not found any
> documentation for any urlfilter that does it.