You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roannel Fernández Hernández <ro...@uci.cu> on 2015/10/09 16:34:07 UTC

Re: [MASSMAIL]Nutch only fetch and parse the third part of urls

Hi Andres,

Check your rules in the URL filters.

Roannel

----- Mensaje original -----
> De: "Andrés Rincón Pacheco" <ar...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Jueves, 8 de Octubre 2015 9:26:11
> Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls
> 
> Hi,
> 
> I am using nutch 1.9, after review the urls added by the Injector the total
> url is 25146.
> (Log evidence)
> crawl.Injector - Injector: Total number of urls after normalization: 25146
> 
> When I was checking the log file only 7003 urls was fetched and 6727 urls
> was parsed.
> 
> And these are the statistics:
> 
> CrawlDb statistics start: ../crawlInfo/crawldb
> Statistics for CrawlDb: ../crawlInfo/crawldb
> TOTAL urls:     30914
> retry 0:        30913
> retry 1:        1
> min score:      0.0
> avg score:      0.4359605
> max score:      100.002
> status 1 (db_unfetched):        23912
> status 2 (db_fetched):  6727
> status 3 (db_gone):     8
> status 4 (db_redir_temp):       266
> status 5 (db_redir_perm):       1
> CrawlDb statistics: done
> 
> Why only the third part (approximately) urls is fetched and parsed?
> 
> Thanks.
> 
17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid07

Re: [MASSMAIL]Nutch only fetch and parse the third part of urls

Posted by Andrés Rincón Pacheco <ar...@gmail.com>.
Hi Roannel,

After review the URL filters configuration and log I have seen  the
following evidence in log file:

crawl.Injector - Injector: Total number of urls rejected by filters: 1413
crawl.Injector - Injector: Total number of urls after normalization: 25146

crawl.Generator - Generator: topN: 26559

So whit these values is not possible infer that the trouble is related with
the URL filter.

Any other solution for the trouble?

Thanks for your help.



2015-10-09 9:34 GMT-05:00 Roannel Fernández Hernández <ro...@uci.cu>:

> Hi Andres,
>
> Check your rules in the URL filters.
>
> Roannel
>
> ----- Mensaje original -----
> > De: "Andrés Rincón Pacheco" <ar...@gmail.com>
> > Para: user@nutch.apache.org
> > Enviados: Jueves, 8 de Octubre 2015 9:26:11
> > Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls
> >
> > Hi,
> >
> > I am using nutch 1.9, after review the urls added by the Injector the
> total
> > url is 25146.
> > (Log evidence)
> > crawl.Injector - Injector: Total number of urls after normalization:
> 25146
> >
> > When I was checking the log file only 7003 urls was fetched and 6727 urls
> > was parsed.
> >
> > And these are the statistics:
> >
> > CrawlDb statistics start: ../crawlInfo/crawldb
> > Statistics for CrawlDb: ../crawlInfo/crawldb
> > TOTAL urls:     30914
> > retry 0:        30913
> > retry 1:        1
> > min score:      0.0
> > avg score:      0.4359605
> > max score:      100.002
> > status 1 (db_unfetched):        23912
> > status 2 (db_fetched):  6727
> > status 3 (db_gone):     8
> > status 4 (db_redir_temp):       266
> > status 5 (db_redir_perm):       1
> > CrawlDb statistics: done
> >
> > Why only the third part (approximately) urls is fetched and parsed?
> >
> > Thanks.
> >
> 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
> http://coj.uci.cu/contest/contestview.xhtml?cid 07
>