You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2016/10/25 18:25:06 UTC

Re: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb status

thanks a lot markus for your answer.

For now maybe i need to use jexl expresion because i have to many documents unfetched and is important for me to crawl it first.
I have used a command (bin/crawl urls/ crawl/ 5)
Can you tell me how use jexl parameter ?, please one example using the command will be appreciated.

Later i will use my own custom scoring using perhaps a percent of topN parameter dedicated to status of crawldb(unfetched)
and other percent using normal scoring. this is for avoid traps.
Thanks a lot.








----- Mensaje original -----
De: "Markus Jelsma" <ma...@openindex.io>
Para: user@nutch.apache.org
Enviados: Martes, 25 de Octubre 2016 12:48:06
Asunto: ***UNCHECKED*** [MASSMAIL]RE: generator conditional by crawldb status

Yes, you can using the -expr with an JEXL expression e.g. -expr '(status = "db_fetched")'

Fields are here: https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524

But you can also achieve this using a custom scoring filter, which is a much more elegant solution. Take care of spider traps, if you prioritize unfetched unconditionally, you can easily fall into such a trap and not come out of it.
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Tuesday 25th October 2016 18:34
> To: user@nutch.apache.org
> Subject: generator conditional by crawldb status
> 
> Hi all.
> I am using nutch 1.12 and solr 4.10.3 with linuxmint 18.
> I want to crawl pages from crawldb using this order.
> 
> 1-unfetched 
> 2-modified
> 3-gone
> and others
> 
> I know that generator process is which decides what pages are selected or not from crawldb.
> Any help or advice to crawl pages in that order will be appreciated.
> 
> Greetings.
>