You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2011/03/14 19:21:15 UTC

Re: nutch crawl command takes 98% of cpu

Hello,

Which version this patch  is applicable?

Thanks.
Alex.

 

 


 

 

-----Original Message-----
From: Alexis <al...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Tue, Feb 8, 2011 9:59 am
Subject: Re: nutch crawl command takes 98% of cpu


Hi,



Thanks for all the feedback. It looks like there is not much you can

do if you give the FLV parser some corrupted data. From a practical

point of view, we can say that this is extremely annoying as it takes

up all the CPU resources and prevent other threads to perform their

task properly, till the TIMEOUT occurs, kills the thread and frees up

the CPU.



We can notice that this happens when an FLV file is truncated (due to

an http.content.limit property lower that its content-length, in

bytes). So the suggestion is to hint to the parser that it is likely

to get stuck and skip the parsing in case the downloaded content size

mismatches the content-length header.



Besides, I often see errors in the HTML parser when the content is

truncated (https://issues.apache.org/jira/browse/TIKA-307). So it does

not hurt saving time and avoiding errors.



I created the issue here: https://issues.apache.org/jira/browse/NUTCH-965

See attached patch.



Alexis.



On Mon, Feb 7, 2011 at 12:00 PM, Ken Krugler

<kk...@transpac.com> wrote:

> Hi Kirby & others,

>

> On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote:

>

>> On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler

>> <kk...@transpac.com> wrote:

>>>

>>> Some comments below.

>>>

>>> On Jan 29, 2011, at 5:55am, Julien Nioche wrote:

>>>

>>>> Hi,

>>>>

>>>> This shows the state of the various threads within a Java process. Most

>>>> of

>>>> them seem to be busy parsing zip archives with Tika. The interesting

>>>> part

>>>> is

>>>> that the main thread is at the Generation step :

>>>>

>>>> *  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)

>>>>  at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)

>>>> *

>>>> with the "Thread-415331" normalizing the URLs as part of the generation.

>>>>

>>>> So why do we see threads busy at parsing these archives? I think this is

>>>> a

>>>> result of the Timeout mechanism (

>>>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing.

>>>> Before it, we used to have the parsing step loop on a single document

>>>> and

>>>> never complete. Thanks to Andrzej's patch, the parsing is done is

>>>> separate

>>>> threads which are abandonned if more than X seconds have passed (default

>>>> 30

>>>> I think). Obiously these threads are still lurking around in the

>>>> background

>>>> and consuming CPU.

>>>>

>>>> This is an issue when calling the Crawl command only. When using the

>>>> separate commands for the various steps, the runaway threads die with

>>>> the

>>>> main process, however since the Crawl uses a single process, these

>>>> timeout

>>>> threads keep going.

>>>>

>>>> Am not an expert in multithreading and don't have an idea of whether

>>>> these

>>>> threads could be killed somehow. Andrzej, any clue?

>>>

>>> This is a fundamental problem with run-away threads - there is no safe,

>>> reliable way to kill them off.

>>>

>>> And if you parse enough documents, you will run into a number that

>>> currently

>>> cause Tika to hang. Zip files for sure, but we ran into the same issue

>>> with

>>> FLV files.

>>>

>>> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs

>>> parsers there. See https://issues.apache.org/jira/browse/TIKA-416

>>>

>>> -- Ken

>>>

>>

>> All,

>>

>>  Just an observation, but the general approach to this problem is to

>> use Thread.interrupt().  Virtually all code in the JDK treats the

>> thread being interrupted as a request to cancel.  Java Concurrency in

>> Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,

>> any general purpose library code that swallows "InterruptedException"

>> and isn't implementing the Thread cancellation policy has a bug in it

>> (the cancellation policy can only be implemented by the owner of the

>> thread, unless the library is a task/thread library it cannot be

>> implementing the cancellation policy).  Any place you see:

>

> [snip]

>

>> One exception is that

>> sockets read/write operations don't operate this way, the socket must

>> be closed to interrupt a read/write, the approach JCIP suggests is to

>> tie the socket and thread in such a way that interrupt() closes the

>> sockets that would be reading/writing inside that thread.

>

> Excellent input, as I need to solve some issues with needing to abort HTTP

> requests.

>

> [snip]

>

>> Not sure exactly what the problems inside of Tika are, but getting it

>> to respect interruption would be a wonderful thing for everybody that

>> uses it.  The problem might be getting all underlying libraries it

>> uses to do so.

>

> Yes, that's exactly the issue in the cases I've seen. The libraries used to

> do the actual parsing can get caught in loops, when processing unexpected

> data. There's no checks for interrupt, e.g. it's code that is walking some

> data structure, and doesn't realize that it's in a loop (e.g. offset to next

> chunk is set to zero, so the same chunk is endlessly reprocessed).

>

> Occasionally we can get the underlying libraries to fix issues, but each new

> release has the potential for new and exciting hangs.

>

> That's why Jukka went down the admittedly hard-core and heavy-weight path of

> providing an option to run parses in a child JVM.

>

> If there's another solution, we'd love to hear about it :)

>

> Thanks,

>

> -- Ken

>

> --------------------------

> Ken Krugler

> +1 530-210-6378

> http://bixolabs.com

> e l a s t i c   w e b   m i n i n g

>

>

>

>

>

>