You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2010/09/02 15:43:18 UTC

Trying to applu timeout.patch on 1.1 source

As part the following problem (I have posted this already and would
appreciate any help), I am trying to apply timeout.patch using patch.exe
(from Unix Utils) on Windows 7 64 bit.
Both patch.exe and timeout.patch files are in the top level folder of
the 1.1 source files (i.e the top level folder that has conf folder, src
folder, lib folder, site folder, build.xml etc etc.)

Here is the command I am using and trying to redirect the output to
results.text.


C:\temp\PatchFilestest\apache-nutch-1.1>patch -cl -p1 < timeout.patch >
result.txt

I am getting the following weird error

patch: **** Only garbage was found in the patch input.

Has anybody seen this?  Can anybody please throw more light on this
error or what I am doing wrong?

Thanks
Raj


-----Original Message-----
From: Nemani, Raj [mailto:Raj.Nemani@turner.com] 
Sent: Wednesday, September 01, 2010 4:33 PM
To: user@nutch.apache.org
Subject: Nutch 1.1 Crawl is slow,hangs and aborts eventually

All,

 

I am crawling a site that is heavy in rtf, txt and pdf documents in
addition to pages that embed a lot of images. I am using Nutch 1.1 and
running on Windows 7.  I am seeing the following errors in my hadoop
logs.  

 

 

2010-09-01 15:01:26,509 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.html.HtmlParser] are enabled via the
plugin.includes system property, and all claim to support the content
type text/html, but they are not mapped to it  in the parse-plugins.xml
file

2010-09-01 15:01:38,969 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.pdf.PdfParser] are enabled via the
plugin.includes system property, and all claim to support the content
type application/pdf, but they are not mapped to it  in the
parse-plugins.xml file

2010-09-01 15:12:56,444 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.text.TextParser] are enabled via the
plugin.includes system property, and all claim to support the content
type text/plain, but they are not mapped to it  in the parse-plugins.xml
file

2010-09-01 15:13:09,611 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
system property, and all claim to support the content type
application/x-tika-msoffice, but they are not mapped to it  in the
parse-plugins.xml file

 

 I am using the basic Crawl command here with a depth of 4  and during
the crawl process Nutch seems to hang at different places  for a long
time eventually aborting with "Aborted with 9 (or some n) number of
threads" message.  For example in one hang, it sat  on the last line
"activeThreads=0" below for a long time (more than 5 mins I think))
before taking off again.  After fetching for some more time it started
to hang again eventually aborting with the "Aborted with 9 hung threads
message".

 

fetching http://abc.xyz.com/research/briefing_books/20

-finishing thread FetcherThread, activeThreads=7

-finishing thread FetcherThread, activeThreads=8

-activeThreads=9, spinWaiting=3, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=9

-finishing thread FetcherThread, activeThreads=5

-finishing thread FetcherThread, activeThreads=6

-finishing thread FetcherThread, activeThreads=4

-finishing thread FetcherThread, activeThreads=3

-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=2

-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=1

-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

-finishing thread FetcherThread, activeThreads=0

-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0

-activeThreads=0

 

My understanding is that Tika is supposed to do all mime types so I am
not sure why the errors are coming up.  I have also seen error messages
like 'Aborted with 9 (or some n) number of threads" message when crawl
depth is increased.  During this hang my CPU is clocking at 100%
(indicating some tight loop or something).  

 

My plugin.includes is like following

 

<property>

<name>plugin.includes</name>

<value>subcollection|protocol-http|urlfilter-regex|index-(basic|anchor)|
query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|ur
lnormalizer-(pass|regex|basic)|parse-tika</value>

</property>

 

Can you all please advice? I am a bit not sure where to go from here.  I
have read about the timeout.patch that Andrzej Bialecki implemented that
may address the above issue.  Is that true? 

Also, how can I apply this patch if it does fix my issue?  I am running
Nutch on Windows 7 so not sure what to do with the .patch file.

 

I appreciate your help

 

Thanks

Raj