You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2012/12/03 20:24:43 UTC

hung threads in big nutch crawl process

Hi all.
I have detected that in big nutch crawl process(depth:10 topN:100 000) some threads are hunged in some part of crawl cicle for example normalizing by regex and fetching urls to.
Im using nutch 1.5.1 and solr 3.6.
Ram:2GB
CPU:CoreI3.
OS:Ubuntu 12.04(server)

I have a doubt, How nutch manipulate the threads in a cicle of crawl process ?.
Is multithread the generation,fetching,parsing process ? 

PD:Sorry for my english. Is not my native language.

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: hung threads in big nutch crawl process

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Thanks markus.
I will try step by step.

----- Mensaje original -----
De: "Markus Jelsma" <ma...@openindex.io>
Para: user@nutch.apache.org
Enviados: Lunes, 3 de Diciembre 2012 15:20:50
Asunto: RE: hung threads in big nutch crawl process

This page explains the individual steps:
http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
 
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 03-Dec-2012 21:08
> To: user@nutch.apache.org
> Subject: RE: hung threads in big nutch crawl process
> 
> Thank markus for your anwer.
> I always have used nutch with console making a complete cycle
> bin/nutch crawl urls -dir crawl -depth 10 -topN 100000 -solr http://localhost:8080/solr
> Could you explain me how to use a separately process. I was reading the wiki but not function for me because I don’t understand the commands. I want to use nutch in distribuited mode, could you give me a good documentation of it.
> 
> _____________________________________________________________________
> Ing. Eyeris Rodriguez Rueda
> Teléfono:837-3370
> Universidad de las Ciencias Informáticas
> _____________________________________________________________________
> 
> -----Mensaje original-----
> De: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Enviado el: lunes, 03 de diciembre de 2012 1:42 PM
> Para: user@nutch.apache.org
> Asunto: RE: hung threads in big nutch crawl process
> 
> Hi - Hadoop organizes some threads but in Nutch the only job that uses threads is the fetcher. Parses are done using the executor service.
> 
> It is very well possible that you have some regexes that are very complex and Nutch can take a long time processing those, especially if you parse in the fetcher job.
> 
> You should run the Nutch jobs separate to find out which job is giving you trouble.
> 
> -----Original message-----
> > From:Eyeris Rodriguez Rueda <er...@uci.cu>
> > Sent: Mon 03-Dec-2012 20:31
> > To: user@nutch.apache.org
> > Subject: hung threads in big nutch crawl process
> > 
> > Hi all.
> > I have detected that in big nutch crawl process(depth:10 topN:100 000) some threads are hunged in some part of crawl cicle for example normalizing by regex and fetching urls to.
> > Im using nutch 1.5.1 and solr 3.6.
> > Ram:2GB
> > CPU:CoreI3.
> > OS:Ubuntu 12.04(server)
> > 
> > I have a doubt, How nutch manipulate the threads in a cicle of crawl process ?.
> > Is multithread the generation,fetching,parsing process ? 
> > 
> > PD:Sorry for my english. Is not my native language.

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: hung threads in big nutch crawl process

Posted by Markus Jelsma <ma...@openindex.io>.

This page explains the individual steps:
http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling
 
 
-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 03-Dec-2012 21:08
> To: user@nutch.apache.org
> Subject: RE: hung threads in big nutch crawl process
> 
> Thank markus for your anwer.
> I always have used nutch with console making a complete cycle
> bin/nutch crawl urls -dir crawl -depth 10 -topN 100000 -solr http://localhost:8080/solr
> Could you explain me how to use a separately process. I was reading the wiki but not function for me because I don’t understand the commands. I want to use nutch in distribuited mode, could you give me a good documentation of it.
> 
> _____________________________________________________________________
> Ing. Eyeris Rodriguez Rueda
> Teléfono:837-3370
> Universidad de las Ciencias Informáticas
> _____________________________________________________________________
> 
> -----Mensaje original-----
> De: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Enviado el: lunes, 03 de diciembre de 2012 1:42 PM
> Para: user@nutch.apache.org
> Asunto: RE: hung threads in big nutch crawl process
> 
> Hi - Hadoop organizes some threads but in Nutch the only job that uses threads is the fetcher. Parses are done using the executor service.
> 
> It is very well possible that you have some regexes that are very complex and Nutch can take a long time processing those, especially if you parse in the fetcher job.
> 
> You should run the Nutch jobs separate to find out which job is giving you trouble.
> 
> -----Original message-----
> > From:Eyeris Rodriguez Rueda <er...@uci.cu>
> > Sent: Mon 03-Dec-2012 20:31
> > To: user@nutch.apache.org
> > Subject: hung threads in big nutch crawl process
> > 
> > Hi all.
> > I have detected that in big nutch crawl process(depth:10 topN:100 000) some threads are hunged in some part of crawl cicle for example normalizing by regex and fetching urls to.
> > Im using nutch 1.5.1 and solr 3.6.
> > Ram:2GB
> > CPU:CoreI3.
> > OS:Ubuntu 12.04(server)
> > 
> > I have a doubt, How nutch manipulate the threads in a cicle of crawl process ?.
> > Is multithread the generation,fetching,parsing process ? 
> > 
> > PD:Sorry for my english. Is not my native language.
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>

RE: hung threads in big nutch crawl process

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Thank markus for your anwer.
I always have used nutch with console making a complete cycle
bin/nutch crawl urls -dir crawl -depth 10 -topN 100000 -solr http://localhost:8080/solr
Could you explain me how to use a separately process. I was reading the wiki but not function for me because I don’t understand the commands. I want to use nutch in distribuited mode, could you give me a good documentation of it.

_____________________________________________________________________
Ing. Eyeris Rodriguez Rueda
Teléfono:837-3370
Universidad de las Ciencias Informáticas
_____________________________________________________________________

-----Mensaje original-----
De: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Enviado el: lunes, 03 de diciembre de 2012 1:42 PM
Para: user@nutch.apache.org
Asunto: RE: hung threads in big nutch crawl process

Hi - Hadoop organizes some threads but in Nutch the only job that uses threads is the fetcher. Parses are done using the executor service.

It is very well possible that you have some regexes that are very complex and Nutch can take a long time processing those, especially if you parse in the fetcher job.

You should run the Nutch jobs separate to find out which job is giving you trouble.

-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 03-Dec-2012 20:31
> To: user@nutch.apache.org
> Subject: hung threads in big nutch crawl process
> 
> Hi all.
> I have detected that in big nutch crawl process(depth:10 topN:100 000) some threads are hunged in some part of crawl cicle for example normalizing by regex and fetching urls to.
> Im using nutch 1.5.1 and solr 3.6.
> Ram:2GB
> CPU:CoreI3.
> OS:Ubuntu 12.04(server)
> 
> I have a doubt, How nutch manipulate the threads in a cicle of crawl process ?.
> Is multithread the generation,fetching,parsing process ? 
> 
> PD:Sorry for my english. Is not my native language.


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: hung threads in big nutch crawl process

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - Hadoop organizes some threads but in Nutch the only job that uses threads is the fetcher. Parses are done using the executor service.

It is very well possible that you have some regexes that are very complex and Nutch can take a long time processing those, especially if you parse in the fetcher job.

You should run the Nutch jobs separate to find out which job is giving you trouble.

-----Original message-----
> From:Eyeris Rodriguez Rueda <er...@uci.cu>
> Sent: Mon 03-Dec-2012 20:31
> To: user@nutch.apache.org
> Subject: hung threads in big nutch crawl process
> 
> Hi all.
> I have detected that in big nutch crawl process(depth:10 topN:100 000) some threads are hunged in some part of crawl cicle for example normalizing by regex and fetching urls to.
> Im using nutch 1.5.1 and solr 3.6.
> Ram:2GB
> CPU:CoreI3.
> OS:Ubuntu 12.04(server)
> 
> I have a doubt, How nutch manipulate the threads in a cicle of crawl process ?.
> Is multithread the generation,fetching,parsing process ? 
> 
> PD:Sorry for my english. Is not my native language.
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
>