You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by consultas <co...@qualidade.eng.br> on 2009/04/01 21:47:36 UTC

Nutch 1.0 experience

Hi,

I have been using Nuth for some years now.  I am using it under Gygwin, with Windows XP, with 2GB memory, nominal bandwith 6 Megs,  using a single server,with pages in the range of 300,000 for a vertical semi-production engine.  I use 60 threads,  using the crawl method for the initial crawl and end up using the whole web method.  Until the last release, in the fetching phase, I had, on my screen a steady rolling list of the pages being indexed.  Everything worked, almost 100% of the time, quite smoothly.

Them I tried the new version, and, on the screen, I got some weird indications, like below, and , unfortunateley, on a turtle like speed:

fetch of http://www.greenpeace.org/brasil/transgenicos/noticias/text/javascript failed with: java.net.SocketTimeoutException: Read timed out
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
fetch of http://www.greenpeace.org/international/press/reports/nuclear-waste-crisis-france failed with: java.net.SocketTimeoutException: Read timed out
-activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
Unable to resolve: www.fishunlimited.org, skipping.
fetching http://www.forests.org/archived_site/today/recent/1997/forfadef_files/filelist.xml
fetching http://www.rpi.edu/news/podcasts.html
fetching http://www.news24.com/Beeld/Gallery/Home/0,,,00.html
fetching http://www.epo.org/
-activeThreads=60, spinWaiting=55, fetchQueues.totalSize=0
fetching http://vcforum.eagle.org/banning.cfm
fetching http://cdn.socialtwist.com/2009022511095/script.js
fetching http://www.lrqa.com.br/treinamento/
-activeThreads=60, spinWaiting=54, fetchQueues.totalSize=0
fetching http://www.processingtalk.com/news/eme/eme416.html
fetching http://www.sciencedaily.com/releases/2009/03/090324111600.htm
fetching http://www.asnt-glas.org/meetings.htm
-activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0
fetching http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
fetching http://www.uscg.mil/comdt/blog/2009/01
fetching http://www1.eere.energy.gov/inventions/energytechnet/includes/opera/5

More than this, very often the fect is aborted with 60 hung trheads and, when I suceed, it seems ( I am not absolutely sure about this,but with a very strong feeling,  considering the size of the resulting segment), that, some times the option `topN` is not respected, with less pages fetched than intended.

So, I am relating my own experience, as a simple user of Nutch, hoping that the problems that I faced can be correct, so that I can use Nutch-1.0, wht is not feasable now.

Thank you


Re: Nutch 1.0 experience

Posted by consultas <co...@qualidade.eng.br>.
Thank you Dogacan, for your very prompt reply (I was truly amazed, thanks)

I would like do point, however, that apart the very slow behaviour of the 
fetcher (it reminds me when the version 0.8 was launched),  it seems that 
the fetcher fases end with hangup threads and it seems,  also, that it does 
not respect (it seems), sometimes,  the "topN" choices made.  It may be the 
case that the fetcher 2 is optmized for someone using several servers, but 
for a single server (and I think this a very large portion of Nutch users), 
it does not works very well.   At least for a couple of experiencies I did, 
reminding once again, that with the previou version (including  some recent 
nightly drives), everythting worke quite well.

Tanks again, for your attention and I really want a very big success for 
Nutch.





----- Original Message ----- 
From: "Doğacan Güney" <do...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, April 01, 2009 4:54 PM
Subject: Re: Nutch 1.0 experience


On Wed, Apr 1, 2009 at 22:47, consultas <co...@qualidade.eng.br> wrote:

> Hi,
>
> I have been using Nuth for some years now.  I am using it under Gygwin,
> with Windows XP, with 2GB memory, nominal bandwith 6 Megs,  using a single
> server,with pages in the range of 300,000 for a vertical semi-production
> engine.  I use 60 threads,  using the crawl method for the initial crawl 
> and
> end up using the whole web method.  Until the last release, in the 
> fetching
> phase, I had, on my screen a steady rolling list of the pages being 
> indexed.
>  Everything worked, almost 100% of the time, quite smoothly.
>
> Them I tried the new version, and, on the screen, I got some weird
> indications, like below, and , unfortunateley, on a turtle like speed:
>
> fetch of
> http://www.greenpeace.org/brasil/transgenicos/noticias/text/javascriptfailed 
> with: java.net.SocketTimeoutException: Read timed out
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> fetch of
> http://www.greenpeace.org/international/press/reports/nuclear-waste-crisis-francefailed 
> with: java.net.SocketTimeoutException: Read timed out
> -activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
> Unable to resolve: www.fishunlimited.org, skipping.
> fetching
> http://www.forests.org/archived_site/today/recent/1997/forfadef_files/filelist.xml
> fetching http://www.rpi.edu/news/podcasts.html
> fetching http://www.news24.com/Beeld/Gallery/Home/0,,,00.html
> fetching http://www.epo.org/
> -activeThreads=60 <http://www.epo.org/%0A-activeThreads=60>,
> spinWaiting=55, fetchQueues.totalSize=0
> fetching http://vcforum.eagle.org/banning.cfm
> fetching http://cdn.socialtwist.com/2009022511095/script.js
> fetching http://www.lrqa.com.br/treinamento/
> -activeThreads=60<http://www.lrqa.com.br/treinamento/%0A-activeThreads=60>,
> spinWaiting=54, fetchQueues.totalSize=0
> fetching http://www.processingtalk.com/news/eme/eme416.html
> fetching http://www.sciencedaily.com/releases/2009/03/090324111600.htm
> fetching http://www.asnt-glas.org/meetings.htm
> -activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0
> fetching http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco
> -activeThreads=60<http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco%0A-activeThreads=60>,
> spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> fetching http://www.uscg.mil/comdt/blog/2009/01
> fetching
> http://www1.eere.energy.gov/inventions/energytechnet/includes/opera/5
>
> More than this, very often the fect is aborted with 60 hung trheads and,
> when I suceed, it seems ( I am not absolutely sure about this,but with a
> very strong feeling,  considering the size of the resulting segment), 
> that,
> some times the option `topN` is not respected, with less pages fetched 
> than
> intended.
>
> So, I am relating my own experience, as a simple user of Nutch, hoping 
> that
> the problems that I faced can be correct, so that I can use Nutch-1.0, wht
> is not feasable now.
>

This log:

-activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0

is no big deal. This is nutch showing you information you probably
don't need :)

During nutch 1.0 development, a new fetcher was developed and
it replaced the old fetcher. Because the new fetcher has a better more
flexible code base. However, you are not the first person who reported
problems with it. You may find tracking this issue useful while this
is sorted out:

https://issues.apache.org/jira/browse/NUTCH-721


>
> Thank you
>
>


-- 
Doğacan Güney



--------------------------------------------------------------------------------



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.238 / Virus Database: 270.11.35/2034 - Release Date: 04/01/09 
06:06:00


Re: Nutch 1.0 experience

Posted by Doğacan Güney <do...@gmail.com>.
On Wed, Apr 1, 2009 at 22:47, consultas <co...@qualidade.eng.br> wrote:

> Hi,
>
> I have been using Nuth for some years now.  I am using it under Gygwin,
> with Windows XP, with 2GB memory, nominal bandwith 6 Megs,  using a single
> server,with pages in the range of 300,000 for a vertical semi-production
> engine.  I use 60 threads,  using the crawl method for the initial crawl and
> end up using the whole web method.  Until the last release, in the fetching
> phase, I had, on my screen a steady rolling list of the pages being indexed.
>  Everything worked, almost 100% of the time, quite smoothly.
>
> Them I tried the new version, and, on the screen, I got some weird
> indications, like below, and , unfortunateley, on a turtle like speed:
>
> fetch of
> http://www.greenpeace.org/brasil/transgenicos/noticias/text/javascriptfailed with: java.net.SocketTimeoutException: Read timed out
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> fetch of
> http://www.greenpeace.org/international/press/reports/nuclear-waste-crisis-francefailed with: java.net.SocketTimeoutException: Read timed out
> -activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
> Unable to resolve: www.fishunlimited.org, skipping.
> fetching
> http://www.forests.org/archived_site/today/recent/1997/forfadef_files/filelist.xml
> fetching http://www.rpi.edu/news/podcasts.html
> fetching http://www.news24.com/Beeld/Gallery/Home/0,,,00.html
> fetching http://www.epo.org/
> -activeThreads=60 <http://www.epo.org/%0A-activeThreads=60>,
> spinWaiting=55, fetchQueues.totalSize=0
> fetching http://vcforum.eagle.org/banning.cfm
> fetching http://cdn.socialtwist.com/2009022511095/script.js
> fetching http://www.lrqa.com.br/treinamento/
> -activeThreads=60<http://www.lrqa.com.br/treinamento/%0A-activeThreads=60>,
> spinWaiting=54, fetchQueues.totalSize=0
> fetching http://www.processingtalk.com/news/eme/eme416.html
> fetching http://www.sciencedaily.com/releases/2009/03/090324111600.htm
> fetching http://www.asnt-glas.org/meetings.htm
> -activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0
> fetching http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco
> -activeThreads=60<http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco%0A-activeThreads=60>,
> spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
> fetching http://www.uscg.mil/comdt/blog/2009/01
> fetching
> http://www1.eere.energy.gov/inventions/energytechnet/includes/opera/5
>
> More than this, very often the fect is aborted with 60 hung trheads and,
> when I suceed, it seems ( I am not absolutely sure about this,but with a
> very strong feeling,  considering the size of the resulting segment), that,
> some times the option `topN` is not respected, with less pages fetched than
> intended.
>
> So, I am relating my own experience, as a simple user of Nutch, hoping that
> the problems that I faced can be correct, so that I can use Nutch-1.0, wht
> is not feasable now.
>

This log:

-activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0

is no big deal. This is nutch showing you information you probably
don't need :)

During nutch 1.0 development, a new fetcher was developed and
it replaced the old fetcher. Because the new fetcher has a better more
flexible code base. However, you are not the first person who reported
problems with it. You may find tracking this issue useful while this
is sorted out:

https://issues.apache.org/jira/browse/NUTCH-721


>
> Thank you
>
>


-- 
Doğacan Güney