You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2018/05/01 20:47:15 UTC

random sampling of crawlDb urls

I want to extract a random sample of URLS from my big crawldb. I think I should be able to do this using readdb -dump with a Jexl expression, but I haven't been able to get it to work.

I have tried several variations of the following command.
$NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr "((Math.random())>=0.1)"


Typically, it produces zero records. I know the expression is getting through to the CrawlDbReader (without quotes) because I get this message:
18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: ((Math.random())>=0.1)

Even when I use the expression "((Math.random())>=0.0)" I get zero output records.

If I use the expression "((Math.random())>=.99)" it lets all records pass through to the output. I guess it has something to do with the lack of leading zero on the numeric constant.

Does anyone know a good way to extract a random sample of records from a crawlDb?

RE: random sampling of crawlDb urls

Posted by Yossi Tamari <yo...@pipl.com>.
Hi Michael,

If you are using 1.14, there is a parameter -sample that allows you to request a random sample. See https://issues.apache.org/jira/browse/NUTCH-2463.

	Yossi.

> -----Original Message-----
> From: Michael Coffey <mc...@yahoo.com.INVALID>
> Sent: 01 May 2018 23:47
> To: User <us...@nutch.apache.org>
> Subject: random sampling of crawlDb urls
> 
> I want to extract a random sample of URLS from my big crawldb. I think I should
> be able to do this using readdb -dump with a Jexl expression, but I haven't been
> able to get it to work.
> 
> I have tried several variations of the following command.
> $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -
> dump /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr
> "((Math.random())>=0.1)"
> 
> 
> Typically, it produces zero records. I know the expression is getting through to
> the CrawlDbReader (without quotes) because I get this message:
> 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr:
> ((Math.random())>=0.1)
> 
> Even when I use the expression "((Math.random())>=0.0)" I get zero output
> records.
> 
> If I use the expression "((Math.random())>=.99)" it lets all records pass through
> to the output. I guess it has something to do with the lack of leading zero on the
> numeric constant.
> 
> Does anyone know a good way to extract a random sample of records from a
> crawlDb?