You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andrés Rincón Pacheco <ar...@gmail.com> on 2015/10/08 15:26:11 UTC

Nutch only fetch and parse the third part of urls

Hi,

I am using nutch 1.9, after review the urls added by the Injector the total
url is 25146.
(Log evidence)
crawl.Injector - Injector: Total number of urls after normalization: 25146

When I was checking the log file only 7003 urls was fetched and 6727 urls
was parsed.

And these are the statistics:

CrawlDb statistics start: ../crawlInfo/crawldb
Statistics for CrawlDb: ../crawlInfo/crawldb
TOTAL urls:     30914
retry 0:        30913
retry 1:        1
min score:      0.0
avg score:      0.4359605
max score:      100.002
status 1 (db_unfetched):        23912
status 2 (db_fetched):  6727
status 3 (db_gone):     8
status 4 (db_redir_temp):       266
status 5 (db_redir_perm):       1
CrawlDb statistics: done

Why only the third part (approximately) urls is fetched and parsed?

Thanks.

Re: [MASSMAIL]Nutch only fetch and parse the third part of urls

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.
Hello Andrés.
Your situation could happens because a lot of problem, share with as your log for see details, i can suggest that check your url normalizer because it can skip url with problems, also check your nutch script exactly in lines below and increase your parameter(i have 1000) because this is the total of url fetched on every round of crawl. 
# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 1000`

Tell me if this helps you.
Greetings. 




----- Mensaje original -----
De: "Andrés Rincón Pacheco" <ar...@gmail.com>
Para: user@nutch.apache.org
Enviados: Jueves, 8 de Octubre 2015 9:26:11
Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls

Hi,

I am using nutch 1.9, after review the urls added by the Injector the total
url is 25146.
(Log evidence)
crawl.Injector - Injector: Total number of urls after normalization: 25146

When I was checking the log file only 7003 urls was fetched and 6727 urls
was parsed.

And these are the statistics:

CrawlDb statistics start: ../crawlInfo/crawldb
Statistics for CrawlDb: ../crawlInfo/crawldb
TOTAL urls:     30914
retry 0:        30913
retry 1:        1
min score:      0.0
avg score:      0.4359605
max score:      100.002
status 1 (db_unfetched):        23912
status 2 (db_fetched):  6727
status 3 (db_gone):     8
status 4 (db_redir_temp):       266
status 5 (db_redir_perm):       1
CrawlDb statistics: done

Why only the third part (approximately) urls is fetched and parsed?

Thanks.
17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid07

Re: [MASSMAIL]Nutch only fetch and parse the third part of urls

Posted by Andrés Rincón Pacheco <ar...@gmail.com>.
Hi Roannel,

After review the URL filters configuration and log I have seen  the
following evidence in log file:

crawl.Injector - Injector: Total number of urls rejected by filters: 1413
crawl.Injector - Injector: Total number of urls after normalization: 25146

crawl.Generator - Generator: topN: 26559

So whit these values is not possible infer that the trouble is related with
the URL filter.

Any other solution for the trouble?

Thanks for your help.



2015-10-09 9:34 GMT-05:00 Roannel Fernández Hernández <ro...@uci.cu>:

> Hi Andres,
>
> Check your rules in the URL filters.
>
> Roannel
>
> ----- Mensaje original -----
> > De: "Andrés Rincón Pacheco" <ar...@gmail.com>
> > Para: user@nutch.apache.org
> > Enviados: Jueves, 8 de Octubre 2015 9:26:11
> > Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls
> >
> > Hi,
> >
> > I am using nutch 1.9, after review the urls added by the Injector the
> total
> > url is 25146.
> > (Log evidence)
> > crawl.Injector - Injector: Total number of urls after normalization:
> 25146
> >
> > When I was checking the log file only 7003 urls was fetched and 6727 urls
> > was parsed.
> >
> > And these are the statistics:
> >
> > CrawlDb statistics start: ../crawlInfo/crawldb
> > Statistics for CrawlDb: ../crawlInfo/crawldb
> > TOTAL urls:     30914
> > retry 0:        30913
> > retry 1:        1
> > min score:      0.0
> > avg score:      0.4359605
> > max score:      100.002
> > status 1 (db_unfetched):        23912
> > status 2 (db_fetched):  6727
> > status 3 (db_gone):     8
> > status 4 (db_redir_temp):       266
> > status 5 (db_redir_perm):       1
> > CrawlDb statistics: done
> >
> > Why only the third part (approximately) urls is fetched and parsed?
> >
> > Thanks.
> >
> 17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
> http://coj.uci.cu/contest/contestview.xhtml?cid 07
>

Re: [MASSMAIL]Nutch only fetch and parse the third part of urls

Posted by Roannel Fernández Hernández <ro...@uci.cu>.
Hi Andres,

Check your rules in the URL filters.

Roannel

----- Mensaje original -----
> De: "Andrés Rincón Pacheco" <ar...@gmail.com>
> Para: user@nutch.apache.org
> Enviados: Jueves, 8 de Octubre 2015 9:26:11
> Asunto: [MASSMAIL]Nutch only fetch and parse the third part of urls
> 
> Hi,
> 
> I am using nutch 1.9, after review the urls added by the Injector the total
> url is 25146.
> (Log evidence)
> crawl.Injector - Injector: Total number of urls after normalization: 25146
> 
> When I was checking the log file only 7003 urls was fetched and 6727 urls
> was parsed.
> 
> And these are the statistics:
> 
> CrawlDb statistics start: ../crawlInfo/crawldb
> Statistics for CrawlDb: ../crawlInfo/crawldb
> TOTAL urls:     30914
> retry 0:        30913
> retry 1:        1
> min score:      0.0
> avg score:      0.4359605
> max score:      100.002
> status 1 (db_unfetched):        23912
> status 2 (db_fetched):  6727
> status 3 (db_gone):     8
> status 4 (db_redir_temp):       266
> status 5 (db_redir_perm):       1
> CrawlDb statistics: done
> 
> Why only the third part (approximately) urls is fetched and parsed?
> 
> Thanks.
> 
17 de octubre: Final Cubana 2015 del Concurso de Programación ACM-ICPC.
http://coj.uci.cu/contest/contestview.xhtml?cid07