You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2010/09/01 22:33:25 UTC

Nutch 1.1 Crawl is slow,hangs and aborts eventually

All,

 

I am crawling a site that is heavy in rtf, txt and pdf documents in
addition to pages that embed a lot of images. I am using Nutch 1.1 and
running on Windows 7.  I am seeing the following errors in my hadoop
logs.  

 

 

2010-09-01 15:01:26,509 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.html.HtmlParser] are enabled via the
plugin.includes system property, and all claim to support the content
type text/html, but they are not mapped to it  in the parse-plugins.xml
file

2010-09-01 15:01:38,969 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.pdf.PdfParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file

2010-09-01 15:12:56,444 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.text.TextParser] are enabled via the
plugin.includes system property, and all claim to support the content
type text/plain, but they are not mapped to it  in the parse-plugins.xml
file

2010-09-01 15:13:09,611 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
system property, and all claim to support the content type
application/x-tika-msoffice, but they are not mapped to it  in the
parse-plugins.xml file

 

 I am using the basic Crawl command here with a depth of 4  and during
the crawl process Nutch seems to hang at different places  for a long
time eventually aborting with "Aborted with 9 (or some n) number of
threads" message.  For example in one hang, it sat  on the last line
"activeThreads=0" below for a long time (more than 5 mins I think))
before taking off again.  After fetching for some more time it started
to hang again eventually aborting with the "Aborted with 9 hung threads
message".

 

fetching http://abc.xyz.com/research/briefing_books/20

-finishing thread FetcherThread, activeThreads=7

-finishing thread FetcherThread, activeThreads=8

-activeThreads=9, spinWaiting=3, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=9

-finishing thread FetcherThread, activeThreads=5

-finishing thread FetcherThread, activeThreads=6

-finishing thread FetcherThread, activeThreads=4

-finishing thread FetcherThread, activeThreads=3

-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=2

-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=1

-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=0

-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0

-activeThreads=0

 

My understanding is that Tika is supposed to do all mime types so I am
not sure why the errors are coming up.  I have also seen error messages
like 'Aborted with 9 (or some n) number of threads" message when crawl
depth is increased.  During this hang my CPU is clocking at 100%
(indicating some tight loop or something).  

 

My plugin.includes is like following

 

<property>

<name>plugin.includes</name>

<value>subcollection|protocol-http|urlfilter-regex|index-(basic|anchor)|
query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|ur
lnormalizer-(pass|regex|basic)|parse-tika</value>

</property>

 

Can you all please advice? I am a bit not sure where to go from here.  I
have read about the timeout.patch that Andrzej Bialecki implemented that
may address the above issue.  Is that true? 

Also, how can I apply this patch if it does fix my issue?  I am running
Nutch on Windows 7 so not sure what to do with the .patch file.

 

I appreciate your help

 

Thanks

Raj

 


Trying to applu timeout.patch on 1.1 source

Posted by "Nemani, Raj" <Ra...@turner.com>.
As part the following problem (I have posted this already and would
appreciate any help), I am trying to apply timeout.patch using patch.exe
(from Unix Utils) on Windows 7 64 bit.
Both patch.exe and timeout.patch files are in the top level folder of
the 1.1 source files (i.e the top level folder that has conf folder, src
folder, lib folder, site folder, build.xml etc etc.)

Here is the command I am using and trying to redirect the output to
results.text.


C:\temp\PatchFilestest\apache-nutch-1.1>patch -cl -p1 < timeout.patch >
result.txt

I am getting the following weird error

patch: **** Only garbage was found in the patch input.

Has anybody seen this?  Can anybody please throw more light on this
error or what I am doing wrong?

Thanks
Raj


-----Original Message-----
From: Nemani, Raj [mailto:Raj.Nemani@turner.com] 
Sent: Wednesday, September 01, 2010 4:33 PM
To: user@nutch.apache.org
Subject: Nutch 1.1 Crawl is slow,hangs and aborts eventually

All,

 

I am crawling a site that is heavy in rtf, txt and pdf documents in
addition to pages that embed a lot of images. I am using Nutch 1.1 and
running on Windows 7.  I am seeing the following errors in my hadoop
logs.  

 

 

2010-09-01 15:01:26,509 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.html.HtmlParser] are enabled via the
plugin.includes system property, and all claim to support the content
type text/html, but they are not mapped to it  in the parse-plugins.xml
file

2010-09-01 15:01:38,969 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.pdf.PdfParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file

2010-09-01 15:12:56,444 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.text.TextParser] are enabled via the
plugin.includes system property, and all claim to support the content
type text/plain, but they are not mapped to it  in the parse-plugins.xml
file

2010-09-01 15:13:09,611 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
system property, and all claim to support the content type
application/x-tika-msoffice, but they are not mapped to it  in the
parse-plugins.xml file

 

 I am using the basic Crawl command here with a depth of 4  and during
the crawl process Nutch seems to hang at different places  for a long
time eventually aborting with "Aborted with 9 (or some n) number of
threads" message.  For example in one hang, it sat  on the last line
"activeThreads=0" below for a long time (more than 5 mins I think))
before taking off again.  After fetching for some more time it started
to hang again eventually aborting with the "Aborted with 9 hung threads
message".

 

fetching http://abc.xyz.com/research/briefing_books/20

-finishing thread FetcherThread, activeThreads=7

-finishing thread FetcherThread, activeThreads=8

-activeThreads=9, spinWaiting=3, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=9

-finishing thread FetcherThread, activeThreads=5

-finishing thread FetcherThread, activeThreads=6

-finishing thread FetcherThread, activeThreads=4

-finishing thread FetcherThread, activeThreads=3

-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=2

-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=1

-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=0

-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0

-activeThreads=0

 

My understanding is that Tika is supposed to do all mime types so I am
not sure why the errors are coming up.  I have also seen error messages
like 'Aborted with 9 (or some n) number of threads" message when crawl
depth is increased.  During this hang my CPU is clocking at 100%
(indicating some tight loop or something).  

 

My plugin.includes is like following

 

<property>

<name>plugin.includes</name>

<value>subcollection|protocol-http|urlfilter-regex|index-(basic|anchor)|
query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|ur
lnormalizer-(pass|regex|basic)|parse-tika</value>

</property>

 

Can you all please advice? I am a bit not sure where to go from here.  I
have read about the timeout.patch that Andrzej Bialecki implemented that
may address the above issue.  Is that true? 

Also, how can I apply this patch if it does fix my issue?  I am running
Nutch on Windows 7 so not sure what to do with the .patch file.

 

I appreciate your help

 

Thanks

Raj

 


RE: Nutch 1.1 Crawl is slow,hangs and aborts eventually

Posted by "Nemani, Raj" <Ra...@turner.com>.
Somehow my previous message did not include the body.

Thank you and Volli for the help.

I have already downloaded branch 1.2.  I assumed that the timeout is
always set to 30 sec but after reading your comment I noticed that it is
reading from config.  Can you please tell me where I should set this
value? Is it in Nutch-site.xml?

I do have http.content.limit set to -1.   Do you recommend not setting
to -1?

Also, I have been using the generic Crawl command. Would it be better if
I used the generate/fetch(with noParsing)/parse/Update script.

Also, with 1.2 I noticed that subcollection filed is not getting into
the index.  I also noticed that in 1.2 there is field for it in the
schema as opposed to adding the filed through code.  How does this work
in 1.2?

Thanks for you help again
Raj

-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Friday, September 03, 2010 9:00 AM
To: user@nutch.apache.org; illov@web.de
Subject: Re: Nutch 1.1 Crawl is slow,hangs and aborts eventually

> Concerning hung up threads. I had this messages whenever I changed the
> value of property "http.content.limit" to "-1".
>

yep - that's related to the parsing hanging. Has been discussed on the
list
quite a lot - Andrzej has provided a patch which has been included in
the
branch 1.2


>
> Concerning "not mapped to tika": Change your "parse-tika" to
> "parse-(text|html|tika|pdf)"
>

these messages can be ignored. the tika parser is not mapped to a
specific
mime-type unlike the other parsers, that's all


>
> I would try "parse-(text|html|tika|pdf|msexcel|msword|msexcel)", too.
>

Tika being the parser by default it will be tried before the other
parsers
specified for a mimetype so this is not going to change much

Just give 1.2 a try and set the value for the parsing timeout to a
reasonable value

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


>
>
> Am 01.09.2010 22:33, schrieb Nemani, Raj:
>
>  All,
>> I am crawling a site that is heavy in rtf, txt and pdf documents in
>> addition to pages that embed a lot of images. I am using Nutch 1.1
and
>> running on Windows 7.  I am seeing the following errors in my hadoop
>> logs.
>> 2010-09-01 15:01:26,509 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.html.HtmlParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type text/html, but they are not mapped to it  in the
parse-plugins.xml
>> file
>>
>> 2010-09-01 15:01:38,969 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.pdf.PdfParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type application/pdf, but they are not mapped to it  in the
>> parse-plugins.xml file
>>
>> 2010-09-01 15:12:56,444 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.text.TextParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type text/plain, but they are not mapped to it  in the
parse-plugins.xml
>> file
>>
>> 2010-09-01 15:13:09,611 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser] are enabled via the
plugin.includes
>> system property, and all claim to support the content type
>> application/x-tika-msoffice, but they are not mapped to it  in the
>> parse-plugins.xml file
>>
>>
>   I am using the basic Crawl command here with a depth of 4  and
during
>> the crawl process Nutch seems to hang at different places  for a long
>> time eventually aborting with "Aborted with 9 (or some n) number of
>> threads" message.  For example in one hang, it sat  on the last line
>> "activeThreads=0" below for a long time (more than 5 mins I think))
>> before taking off again.  After fetching for some more time it
started
>> to hang again eventually aborting with the "Aborted with 9 hung
threads
>> message".
>>
>>
>>
>> fetching http://abc.xyz.com/research/briefing_books/20
>>
>> -finishing thread FetcherThread, activeThreads=7
>>
>> -finishing thread FetcherThread, activeThreads=8
>>
>> -activeThreads=9, spinWaiting=3, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=9
>>
>> -finishing thread FetcherThread, activeThreads=5
>>
>> -finishing thread FetcherThread, activeThreads=6
>>
>> -finishing thread FetcherThread, activeThreads=4
>>
>> -finishing thread FetcherThread, activeThreads=3
>>
>> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=2
>>
>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=1
>>
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=0
>>
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -activeThreads=0
>>
>>
>>
>> My understanding is that Tika is supposed to do all mime types so I
am
>> not sure why the errors are coming up.  I have also seen error
messages
>> like 'Aborted with 9 (or some n) number of threads" message when
crawl
>> depth is increased.  During this hang my CPU is clocking at 100%
>> (indicating some tight loop or something).
>>
>>
>>
>> My plugin.includes is like following
>>
>>
>>
>> <property>
>>
>> <name>plugin.includes</name>
>>
>>
<value>subcollection|protocol-http|urlfilter-regex|index-(basic|anchor)|
>>
query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|ur
>> lnormalizer-(pass|regex|basic)|parse-tika</value>
>>
>> </property>
>>
>>
>>
>> Can you all please advice? I am a bit not sure where to go from here.
I
>> have read about the timeout.patch that Andrzej Bialecki implemented
that
>> may address the above issue.  Is that true?
>>
>> Also, how can I apply this patch if it does fix my issue?  I am
running
>> Nutch on Windows 7 so not sure what to do with the .patch file.
>>
>>
>>
>> I appreciate your help
>>
>>
>>
>> Thanks
>>
>> Raj
>>
>>
>>
>>
>>

RE: Nutch 1.1 Crawl is slow,hangs and aborts eventually

Posted by "Nemani, Raj" <Ra...@turner.com>.
Thank you and Volli for the help.

I have already downloaded branch 1.2.  I assumed that the timeout is
always set to 30 sec but after reading your comment I noticed that it is
reading from config.  Can you please tell me where I should set this
value? Is it in Nutch-site.xml?

I do have http.content.limit" set to "-1.   Do you recommend not setting
to -1?

Also, I have been using the generic Crawl command. Would it be better if
I used the generate/fetch(with noParsing)/parse/Update script.

Also, with 1.2 I noticed that subcollection filed is not getting into
the index.  I also noticed that in 1.2 there is field for it as opposed
to adding the filed through code.  How does this work in 1.2?

Thanks for you help again
Raj


-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Friday, September 03, 2010 9:00 AM
To: user@nutch.apache.org; illov@web.de
Subject: Re: Nutch 1.1 Crawl is slow,hangs and aborts eventually

> Concerning hung up threads. I had this messages whenever I changed the
> value of property "http.content.limit" to "-1".
>

yep - that's related to the parsing hanging. Has been discussed on the
list
quite a lot - Andrzej has provided a patch which has been included in
the
branch 1.2


>
> Concerning "not mapped to tika": Change your "parse-tika" to
> "parse-(text|html|tika|pdf)"
>

these messages can be ignored. the tika parser is not mapped to a
specific
mime-type unlike the other parsers, that's all


>
> I would try "parse-(text|html|tika|pdf|msexcel|msword|msexcel)", too.
>

Tika being the parser by default it will be tried before the other
parsers
specified for a mimetype so this is not going to change much

Just give 1.2 a try and set the value for the parsing timeout to a
reasonable value

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


>
>
> Am 01.09.2010 22:33, schrieb Nemani, Raj:
>
>  All,
>> I am crawling a site that is heavy in rtf, txt and pdf documents in
>> addition to pages that embed a lot of images. I am using Nutch 1.1
and
>> running on Windows 7.  I am seeing the following errors in my hadoop
>> logs.
>> 2010-09-01 15:01:26,509 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.html.HtmlParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type text/html, but they are not mapped to it  in the
parse-plugins.xml
>> file
>>
>> 2010-09-01 15:01:38,969 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.pdf.PdfParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type application/pdf, but they are not mapped to it  in the
>> parse-plugins.xml file
>>
>> 2010-09-01 15:12:56,444 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.text.TextParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type text/plain, but they are not mapped to it  in the
parse-plugins.xml
>> file
>>
>> 2010-09-01 15:13:09,611 INFO  parse.ParserFactory - The parsing
plugins:
>> [org.apache.nutch.parse.tika.Parser] are enabled via the
plugin.includes
>> system property, and all claim to support the content type
>> application/x-tika-msoffice, but they are not mapped to it  in the
>> parse-plugins.xml file
>>
>>
>   I am using the basic Crawl command here with a depth of 4  and
during
>> the crawl process Nutch seems to hang at different places  for a long
>> time eventually aborting with "Aborted with 9 (or some n) number of
>> threads" message.  For example in one hang, it sat  on the last line
>> "activeThreads=0" below for a long time (more than 5 mins I think))
>> before taking off again.  After fetching for some more time it
started
>> to hang again eventually aborting with the "Aborted with 9 hung
threads
>> message".
>>
>>
>>
>> fetching http://abc.xyz.com/research/briefing_books/20
>>
>> -finishing thread FetcherThread, activeThreads=7
>>
>> -finishing thread FetcherThread, activeThreads=8
>>
>> -activeThreads=9, spinWaiting=3, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=9
>>
>> -finishing thread FetcherThread, activeThreads=5
>>
>> -finishing thread FetcherThread, activeThreads=6
>>
>> -finishing thread FetcherThread, activeThreads=4
>>
>> -finishing thread FetcherThread, activeThreads=3
>>
>> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=2
>>
>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=1
>>
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=0
>>
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -activeThreads=0
>>
>>
>>
>> My understanding is that Tika is supposed to do all mime types so I
am
>> not sure why the errors are coming up.  I have also seen error
messages
>> like 'Aborted with 9 (or some n) number of threads" message when
crawl
>> depth is increased.  During this hang my CPU is clocking at 100%
>> (indicating some tight loop or something).
>>
>>
>>
>> My plugin.includes is like following
>>
>>
>>
>> <property>
>>
>> <name>plugin.includes</name>
>>
>>
<value>subcollection|protocol-http|urlfilter-regex|index-(basic|anchor)|
>>
query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|ur
>> lnormalizer-(pass|regex|basic)|parse-tika</value>
>>
>> </property>
>>
>>
>>
>> Can you all please advice? I am a bit not sure where to go from here.
I
>> have read about the timeout.patch that Andrzej Bialecki implemented
that
>> may address the above issue.  Is that true?
>>
>> Also, how can I apply this patch if it does fix my issue?  I am
running
>> Nutch on Windows 7 so not sure what to do with the .patch file.
>>
>>
>>
>> I appreciate your help
>>
>>
>>
>> Thanks
>>
>> Raj
>>
>>
>>
>>
>>

Re: Nutch 1.1 Crawl is slow,hangs and aborts eventually

Posted by Julien Nioche <li...@gmail.com>.
> Concerning hung up threads. I had this messages whenever I changed the
> value of property "http.content.limit" to "-1".
>

yep - that's related to the parsing hanging. Has been discussed on the list
quite a lot - Andrzej has provided a patch which has been included in the
branch 1.2


>
> Concerning "not mapped to tika": Change your "parse-tika" to
> "parse-(text|html|tika|pdf)"
>

these messages can be ignored. the tika parser is not mapped to a specific
mime-type unlike the other parsers, that's all


>
> I would try "parse-(text|html|tika|pdf|msexcel|msword|msexcel)", too.
>

Tika being the parser by default it will be tried before the other parsers
specified for a mimetype so this is not going to change much

Just give 1.2 a try and set the value for the parsing timeout to a
reasonable value

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


>
>
> Am 01.09.2010 22:33, schrieb Nemani, Raj:
>
>  All,
>> I am crawling a site that is heavy in rtf, txt and pdf documents in
>> addition to pages that embed a lot of images. I am using Nutch 1.1 and
>> running on Windows 7.  I am seeing the following errors in my hadoop
>> logs.
>> 2010-09-01 15:01:26,509 INFO  parse.ParserFactory - The parsing plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.html.HtmlParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type text/html, but they are not mapped to it  in the parse-plugins.xml
>> file
>>
>> 2010-09-01 15:01:38,969 INFO  parse.ParserFactory - The parsing plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.pdf.PdfParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type application/pdf, but they are not mapped to it  in the
>> parse-plugins.xml file
>>
>> 2010-09-01 15:12:56,444 INFO  parse.ParserFactory - The parsing plugins:
>> [org.apache.nutch.parse.tika.Parser -
>> org.apache.nutch.parse.text.TextParser] are enabled via the
>> plugin.includes system property, and all claim to support the content
>> type text/plain, but they are not mapped to it  in the parse-plugins.xml
>> file
>>
>> 2010-09-01 15:13:09,611 INFO  parse.ParserFactory - The parsing plugins:
>> [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
>> system property, and all claim to support the content type
>> application/x-tika-msoffice, but they are not mapped to it  in the
>> parse-plugins.xml file
>>
>>
>   I am using the basic Crawl command here with a depth of 4  and during
>> the crawl process Nutch seems to hang at different places  for a long
>> time eventually aborting with "Aborted with 9 (or some n) number of
>> threads" message.  For example in one hang, it sat  on the last line
>> "activeThreads=0" below for a long time (more than 5 mins I think))
>> before taking off again.  After fetching for some more time it started
>> to hang again eventually aborting with the "Aborted with 9 hung threads
>> message".
>>
>>
>>
>> fetching http://abc.xyz.com/research/briefing_books/20
>>
>> -finishing thread FetcherThread, activeThreads=7
>>
>> -finishing thread FetcherThread, activeThreads=8
>>
>> -activeThreads=9, spinWaiting=3, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=9
>>
>> -finishing thread FetcherThread, activeThreads=5
>>
>> -finishing thread FetcherThread, activeThreads=6
>>
>> -finishing thread FetcherThread, activeThreads=4
>>
>> -finishing thread FetcherThread, activeThreads=3
>>
>> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=2
>>
>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=1
>>
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -finishing thread FetcherThread, activeThreads=0
>>
>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>
>> -activeThreads=0
>>
>>
>>
>> My understanding is that Tika is supposed to do all mime types so I am
>> not sure why the errors are coming up.  I have also seen error messages
>> like 'Aborted with 9 (or some n) number of threads" message when crawl
>> depth is increased.  During this hang my CPU is clocking at 100%
>> (indicating some tight loop or something).
>>
>>
>>
>> My plugin.includes is like following
>>
>>
>>
>> <property>
>>
>> <name>plugin.includes</name>
>>
>> <value>subcollection|protocol-http|urlfilter-regex|index-(basic|anchor)|
>> query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|ur
>> lnormalizer-(pass|regex|basic)|parse-tika</value>
>>
>> </property>
>>
>>
>>
>> Can you all please advice? I am a bit not sure where to go from here.  I
>> have read about the timeout.patch that Andrzej Bialecki implemented that
>> may address the above issue.  Is that true?
>>
>> Also, how can I apply this patch if it does fix my issue?  I am running
>> Nutch on Windows 7 so not sure what to do with the .patch file.
>>
>>
>>
>> I appreciate your help
>>
>>
>>
>> Thanks
>>
>> Raj
>>
>>
>>
>>
>>

Re: Nutch 1.1 Crawl is slow,hangs and aborts eventually

Posted by Volli <il...@web.de>.
Concerning hung up threads. I had this messages whenever I 
changed the value of property "http.content.limit" to "-1".

Concerning "not mapped to tika": Change your "parse-tika" to 
"parse-(text|html|tika|pdf)"

I would try 
"parse-(text|html|tika|pdf|msexcel|msword|msexcel)", too.


Am 01.09.2010 22:33, schrieb Nemani, Raj:
> All,
> I am crawling a site that is heavy in rtf, txt and pdf documents in
> addition to pages that embed a lot of images. I am using Nutch 1.1 and
> running on Windows 7.  I am seeing the following errors in my hadoop
> logs.
> 2010-09-01 15:01:26,509 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.Parser -
> org.apache.nutch.parse.html.HtmlParser] are enabled via the
> plugin.includes system property, and all claim to support the content
> type text/html, but they are not mapped to it  in the parse-plugins.xml
> file
>
> 2010-09-01 15:01:38,969 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.Parser -
> org.apache.nutch.parse.pdf.PdfParser] are enabled via the
> plugin.includes system property, and all claim to support the content
> type application/pdf, but they are not mapped to it  in the
> parse-plugins.xml file
>
> 2010-09-01 15:12:56,444 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.Parser -
> org.apache.nutch.parse.text.TextParser] are enabled via the
> plugin.includes system property, and all claim to support the content
> type text/plain, but they are not mapped to it  in the parse-plugins.xml
> file
>
> 2010-09-01 15:13:09,611 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
> system property, and all claim to support the content type
> application/x-tika-msoffice, but they are not mapped to it  in the
> parse-plugins.xml file
>

>   I am using the basic Crawl command here with a depth of 4  and during
> the crawl process Nutch seems to hang at different places  for a long
> time eventually aborting with "Aborted with 9 (or some n) number of
> threads" message.  For example in one hang, it sat  on the last line
> "activeThreads=0" below for a long time (more than 5 mins I think))
> before taking off again.  After fetching for some more time it started
> to hang again eventually aborting with the "Aborted with 9 hung threads
> message".
>
>
>
> fetching http://abc.xyz.com/research/briefing_books/20
>
> -finishing thread FetcherThread, activeThreads=7
>
> -finishing thread FetcherThread, activeThreads=8
>
> -activeThreads=9, spinWaiting=3, fetchQueues.totalSize=0
>
> -finishing thread FetcherThread, activeThreads=9
>
> -finishing thread FetcherThread, activeThreads=5
>
> -finishing thread FetcherThread, activeThreads=6
>
> -finishing thread FetcherThread, activeThreads=4
>
> -finishing thread FetcherThread, activeThreads=3
>
> -activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
>
> -finishing thread FetcherThread, activeThreads=2
>
> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>
> -finishing thread FetcherThread, activeThreads=1
>
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>
> -finishing thread FetcherThread, activeThreads=0
>
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>
> -activeThreads=0
>
>
>
> My understanding is that Tika is supposed to do all mime types so I am
> not sure why the errors are coming up.  I have also seen error messages
> like 'Aborted with 9 (or some n) number of threads" message when crawl
> depth is increased.  During this hang my CPU is clocking at 100%
> (indicating some tight loop or something).
>
>
>
> My plugin.includes is like following
>
>
>
> <property>
>
> <name>plugin.includes</name>
>
> <value>subcollection|protocol-http|urlfilter-regex|index-(basic|anchor)|
> query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|ur
> lnormalizer-(pass|regex|basic)|parse-tika</value>
>
> </property>
>
>
>
> Can you all please advice? I am a bit not sure where to go from here.  I
> have read about the timeout.patch that Andrzej Bialecki implemented that
> may address the above issue.  Is that true?
>
> Also, how can I apply this patch if it does fix my issue?  I am running
> Nutch on Windows 7 so not sure what to do with the .patch file.
>
>
>
> I appreciate your help
>
>
>
> Thanks
>
> Raj
>
>
>
>