You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chip Calhoun <cc...@aip.org> on 2017/05/11 20:30:34 UTC

Nutch not indexing all seed URLs

I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the uninteresting navigation pages on my site, I've made a URLs list of all the URLs I want crawled; the current list is 2522 URLs. However, the indexer stopped after just 1077 of these URLs. My generate.max.count is set to -1. What would cause my URLs to be skipped?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalhoun@aip.org
https://www.aip.org/history-programs/niels-bohr-library

Re: [MASSMAIL]Nutch not indexing all seed URLs

Posted by Yongyao Jiang <j....@gmail.com>.

Hi Chip,

Another possible reason is that some websites claims in the robot.txt that
crawlers are not allowed to access them. I had the same problem before.

Yongyao

On Fri, May 12, 2017 at 10:28 AM, Chip Calhoun <cc...@aip.org> wrote:

> Thank you. The problem was right below that; I had the default
> "timeLimitFetch=180", and it stopped after 3 hours. I'll bump that up to
> something ridiculous and try again.
>
> Chip
>
> -----Original Message-----
> From: Eyeris Rodriguez Rueda [mailto:erueda@uci.cu]
> Sent: Thursday, May 11, 2017 4:46 PM
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]Nutch not indexing all seed URLs
>
> Hi.
> Maybe one cause:
> Have you seen topN (fetchlist) parameter inside bin/crawl script (line
> 117) sizeFetchlist=`expr $numSlaves \* 50` this number could limit your url
> list.
>
> Also check your filters.
>
>
> Tell me if you have solved the problem
>
>
>
>
>
> ----- Mensaje original -----
> De: "Chip Calhoun" <cc...@aip.org>
> Para: user@nutch.apache.org
> Enviados: Jueves, 11 de Mayo 2017 16:30:34
> Asunto: [MASSMAIL]Nutch not indexing all seed URLs
>
> I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing
> the uninteresting navigation pages on my site, I've made a URLs list of all
> the URLs I want crawled; the current list is 2522 URLs. However, the
> indexer stopped after just 1077 of these URLs. My generate.max.count is set
> to -1. What would cause my URLs to be skipped?
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalhoun@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>
>


-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University

RE: [MASSMAIL]Nutch not indexing all seed URLs

Posted by Chip Calhoun <cc...@aip.org>.

Thank you. The problem was right below that; I had the default "timeLimitFetch=180", and it stopped after 3 hours. I'll bump that up to something ridiculous and try again.

Chip

-----Original Message-----
From: Eyeris Rodriguez Rueda [mailto:erueda@uci.cu] 
Sent: Thursday, May 11, 2017 4:46 PM
To: user@nutch.apache.org
Subject: Re: [MASSMAIL]Nutch not indexing all seed URLs

Hi.
Maybe one cause:
Have you seen topN (fetchlist) parameter inside bin/crawl script (line 117) sizeFetchlist=`expr $numSlaves \* 50` this number could limit your url list.

Also check your filters.

Tell me if you have solved the problem

----- Mensaje original -----
De: "Chip Calhoun" <cc...@aip.org>
Para: user@nutch.apache.org
Enviados: Jueves, 11 de Mayo 2017 16:30:34
Asunto: [MASSMAIL]Nutch not indexing all seed URLs

I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the uninteresting navigation pages on my site, I've made a URLs list of all the URLs I want crawled; the current list is 2522 URLs. However, the indexer stopped after just 1077 of these URLs. My generate.max.count is set to -1. What would cause my URLs to be skipped?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalhoun@aip.org
https://www.aip.org/history-programs/niels-bohr-library

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Re: [MASSMAIL]Nutch not indexing all seed URLs

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Hi.
Maybe one cause:
Have you seen topN (fetchlist) parameter inside bin/crawl script (line 117)
sizeFetchlist=`expr $numSlaves \* 50`
this number could limit your url list.

Also check your filters.


Tell me if you have solved the problem





----- Mensaje original -----
De: "Chip Calhoun" <cc...@aip.org>
Para: user@nutch.apache.org
Enviados: Jueves, 11 de Mayo 2017 16:30:34
Asunto: [MASSMAIL]Nutch not indexing all seed URLs

I'm using Nutch 1.12 to index a local site. To keep Nutch from indexing the uninteresting navigation pages on my site, I've made a URLs list of all the URLs I want crawled; the current list is 2522 URLs. However, the indexer stopped after just 1077 of these URLs. My generate.max.count is set to -1. What would cause my URLs to be skipped?

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalhoun@aip.org
https://www.aip.org/history-programs/niels-bohr-library

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre