You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/07/04 14:24:07 UTC

Parser hangs

Hi,

Another large crawl seems to lead to problems, this time the parser. I've 
added logging to the parser so i can follow it's progress; it outputs the key 
of the document it's processing.

It now seems to hang. The proces continues to use CPU time (it fluctuates 
normally) and i can confirm that the document in question is parsable. Both 
with ParserChecker and a complete crawl cycle of that one URL.

I don't know if the parse job is finishing up as i can't see it but this is 
the last output of the log:

2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://<HOST>
2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default
2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'fetcher', using default

As you can see it's already doing `nothing` for 45 minutes. What is it doing? 
Will it ever finish?

Thanks

Re: Parser hangs

Posted by Markus Jelsma <ma...@openindex.io>.

I've only have very small production crawls running on Hadoop. This large 
scale test is in the process of migration to a Hadoop cluster. I'll keep an 
eye on your comments on the reducer once the migration has completed.

Thanks for explaining.



On Monday 04 July 2011 16:28:40 Julien Nioche wrote:
> > On Monday 04 July 2011 15:52:36 Julien Nioche wrote:
> > > no problem. Like most Hadoop jobs the output of the mapper is written,
> > 
> > then
> > 
> > > there is the shuffle etc... finally it goes through the reducer
> > > (ParseSegment l137) - mostly IO bound
> > 
> > I see. The log line is written in the mapper so it's the reduce phase
> > that takes ages to complete. I didn't see much IO-wait though. IO was
> > very little
> > when compared to the total run time of the reduce phase.
> 
> the reducer itself does very little but the time could be spent
> deserializing when the objects are read to be sent to the reducer- in which
> case it would be CPU bound
> 
> > Any advice on how to provide log output to show progress there? It seems
> > parser suffers from the same problem as the fetcher since both reducers
> > take a
> > lot of time.
> 
> That's not something that I've experienced and I'm surprised that the
> reduce step takes that long.
> Again the Hadoop webapps are the best way of monitoring a crawl + they also
> add loads of status info (# docs per Mimetype, errors, etc...). IMHO
> running Nutch in local mode is only useful for testing / debugging /
> running very small crawls
> 
> > > You can check the status of the job on the Hadoop webapps, assuming
> > > that you're running Nutch in (pseudo) distributed mode of course which
> > > is preferable for large crawls
> > 
> > This large-scale test runs locally atm. Hadoop has been set up but hasn't
> > been
> > migrated yet.
> > 
> > > On 4 July 2011 14:29, Markus Jelsma <ma...@openindex.io> wrote:
> > > > Julien, and others,
> > > > 
> > > > This was a wild goose chase! The parser just now finished. In this
> > > > case
> > 
> > i
> > 
> > > > rephrase the question: what is it doing after all docs have been
> > 
> > parsed?
> > 
> > > > The
> > > > entire parse took less than whatever it was doing after it parsed the
> > > > last document.
> > > > 
> > > > Thanks!
> > > > 
> > > > (Sorry Julien ;)
> > > > 
> > > > On Monday 04 July 2011 14:24:07 Markus Jelsma wrote:
> > > > > Hi,
> > > > > 
> > > > > Another large crawl seems to lead to problems, this time the
> > > > > parser. I've added logging to the parser so i can follow it's
> > > > > progress; it outputs the key of the document it's processing.
> > > > > 
> > > > > It now seems to hang. The proces continues to use CPU time (it
> > > > > fluctuates normally) and i can confirm that the document in
> > > > > question is parsable.
> > > > 
> > > > Both
> > > > 
> > > > > with ParserChecker and a complete crawl cycle of that one URL.
> > > > > 
> > > > > I don't know if the parse job is finishing up as i can't see it but
> > > > > this
> > > > 
> > > > is
> > > > 
> > > > > the last output of the log:
> > > > > 
> > > > > 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://
> > > > 
> > > > <HOST>
> > > > 
> > > > > 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find
> > > > > rules for scope 'outlink', using default
> > > > > 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find
> > > > > rules for scope 'fetcher', using default
> > > > > 
> > > > > As you can see it's already doing `nothing` for 45 minutes. What is
> > 
> > it
> > 
> > > > > doing? Will it ever finish?
> > > > > 
> > > > > Thanks
> > > > 
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser hangs

Posted by Julien Nioche <li...@gmail.com>.

>
>
> On Monday 04 July 2011 15:52:36 Julien Nioche wrote:
> > no problem. Like most Hadoop jobs the output of the mapper is written,
> then
> > there is the shuffle etc... finally it goes through the reducer
> > (ParseSegment l137) - mostly IO bound
>
> I see. The log line is written in the mapper so it's the reduce phase that
> takes ages to complete. I didn't see much IO-wait though. IO was very
> little
> when compared to the total run time of the reduce phase.
>

the reducer itself does very little but the time could be spent
deserializing when the objects are read to be sent to the reducer- in which
case it would be CPU bound


>
> Any advice on how to provide log output to show progress there? It seems
> parser suffers from the same problem as the fetcher since both reducers
> take a
> lot of time.
>

That's not something that I've experienced and I'm surprised that the reduce
step takes that long.
Again the Hadoop webapps are the best way of monitoring a crawl + they also
add loads of status info (# docs per Mimetype, errors, etc...). IMHO running
Nutch in local mode is only useful for testing / debugging / running very
small crawls



>
> >
> > You can check the status of the job on the Hadoop webapps, assuming that
> > you're running Nutch in (pseudo) distributed mode of course which is
> > preferable for large crawls
>
> This large-scale test runs locally atm. Hadoop has been set up but hasn't
> been
> migrated yet.
>
> >
> > On 4 July 2011 14:29, Markus Jelsma <ma...@openindex.io> wrote:
> > > Julien, and others,
> > >
> > > This was a wild goose chase! The parser just now finished. In this case
> i
> > > rephrase the question: what is it doing after all docs have been
> parsed?
> > > The
> > > entire parse took less than whatever it was doing after it parsed the
> > > last document.
> > >
> > > Thanks!
> > >
> > > (Sorry Julien ;)
> > >
> > > On Monday 04 July 2011 14:24:07 Markus Jelsma wrote:
> > > > Hi,
> > > >
> > > > Another large crawl seems to lead to problems, this time the parser.
> > > > I've added logging to the parser so i can follow it's progress; it
> > > > outputs the key of the document it's processing.
> > > >
> > > > It now seems to hang. The proces continues to use CPU time (it
> > > > fluctuates normally) and i can confirm that the document in question
> > > > is parsable.
> > >
> > > Both
> > >
> > > > with ParserChecker and a complete crawl cycle of that one URL.
> > > >
> > > > I don't know if the parse job is finishing up as i can't see it but
> > > > this
> > >
> > > is
> > >
> > > > the last output of the log:
> > > >
> > > > 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://
> > >
> > > <HOST>
> > >
> > > > 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find
> > > > rules for scope 'outlink', using default
> > > > 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find
> > > > rules for scope 'fetcher', using default
> > > >
> > > > As you can see it's already doing `nothing` for 45 minutes. What is
> it
> > > > doing? Will it ever finish?
> > > >
> > > > Thanks
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Parser hangs

Posted by Markus Jelsma <ma...@openindex.io>.


On Monday 04 July 2011 15:52:36 Julien Nioche wrote:
> no problem. Like most Hadoop jobs the output of the mapper is written, then
> there is the shuffle etc... finally it goes through the reducer
> (ParseSegment l137) - mostly IO bound

I see. The log line is written in the mapper so it's the reduce phase that 
takes ages to complete. I didn't see much IO-wait though. IO was very little 
when compared to the total run time of the reduce phase.

Any advice on how to provide log output to show progress there? It seems 
parser suffers from the same problem as the fetcher since both reducers take a 
lot of time.

> 
> You can check the status of the job on the Hadoop webapps, assuming that
> you're running Nutch in (pseudo) distributed mode of course which is
> preferable for large crawls

This large-scale test runs locally atm. Hadoop has been set up but hasn't been 
migrated yet.

> 
> On 4 July 2011 14:29, Markus Jelsma <ma...@openindex.io> wrote:
> > Julien, and others,
> > 
> > This was a wild goose chase! The parser just now finished. In this case i
> > rephrase the question: what is it doing after all docs have been parsed?
> > The
> > entire parse took less than whatever it was doing after it parsed the
> > last document.
> > 
> > Thanks!
> > 
> > (Sorry Julien ;)
> > 
> > On Monday 04 July 2011 14:24:07 Markus Jelsma wrote:
> > > Hi,
> > > 
> > > Another large crawl seems to lead to problems, this time the parser.
> > > I've added logging to the parser so i can follow it's progress; it
> > > outputs the key of the document it's processing.
> > > 
> > > It now seems to hang. The proces continues to use CPU time (it
> > > fluctuates normally) and i can confirm that the document in question
> > > is parsable.
> > 
> > Both
> > 
> > > with ParserChecker and a complete crawl cycle of that one URL.
> > > 
> > > I don't know if the parse job is finishing up as i can't see it but
> > > this
> > 
> > is
> > 
> > > the last output of the log:
> > > 
> > > 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://
> > 
> > <HOST>
> > 
> > > 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find
> > > rules for scope 'outlink', using default
> > > 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find
> > > rules for scope 'fetcher', using default
> > > 
> > > As you can see it's already doing `nothing` for 45 minutes. What is it
> > > doing? Will it ever finish?
> > > 
> > > Thanks
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser hangs

Posted by Julien Nioche <li...@gmail.com>.

no problem. Like most Hadoop jobs the output of the mapper is written, then
there is the shuffle etc... finally it goes through the reducer
(ParseSegment l137) - mostly IO bound

You can check the status of the job on the Hadoop webapps, assuming that
you're running Nutch in (pseudo) distributed mode of course which is
preferable for large crawls



On 4 July 2011 14:29, Markus Jelsma <ma...@openindex.io> wrote:

> Julien, and others,
>
> This was a wild goose chase! The parser just now finished. In this case i
> rephrase the question: what is it doing after all docs have been parsed?
> The
> entire parse took less than whatever it was doing after it parsed the last
> document.
>
> Thanks!
>
> (Sorry Julien ;)
>
> On Monday 04 July 2011 14:24:07 Markus Jelsma wrote:
> > Hi,
> >
> > Another large crawl seems to lead to problems, this time the parser. I've
> > added logging to the parser so i can follow it's progress; it outputs the
> > key of the document it's processing.
> >
> > It now seems to hang. The proces continues to use CPU time (it fluctuates
> > normally) and i can confirm that the document in question is parsable.
> Both
> > with ParserChecker and a complete crawl cycle of that one URL.
> >
> > I don't know if the parse job is finishing up as i can't see it but this
> is
> > the last output of the log:
> >
> > 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://
> <HOST>
> > 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules
> > for scope 'outlink', using default
> > 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules
> > for scope 'fetcher', using default
> >
> > As you can see it's already doing `nothing` for 45 minutes. What is it
> > doing? Will it ever finish?
> >
> > Thanks
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Parser hangs

Posted by Markus Jelsma <ma...@openindex.io>.

Julien, and others,

This was a wild goose chase! The parser just now finished. In this case i 
rephrase the question: what is it doing after all docs have been parsed? The 
entire parse took less than whatever it was doing after it parsed the last 
document.

Thanks!

(Sorry Julien ;)

On Monday 04 July 2011 14:24:07 Markus Jelsma wrote:
> Hi,
> 
> Another large crawl seems to lead to problems, this time the parser. I've
> added logging to the parser so i can follow it's progress; it outputs the
> key of the document it's processing.
> 
> It now seems to hang. The proces continues to use CPU time (it fluctuates
> normally) and i can confirm that the document in question is parsable. Both
> with ParserChecker and a complete crawl cycle of that one URL.
> 
> I don't know if the parse job is finishing up as i can't see it but this is
> the last output of the log:
> 
> 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://<HOST>
> 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules
> for scope 'outlink', using default
> 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules
> for scope 'fetcher', using default
> 
> As you can see it's already doing `nothing` for 45 minutes. What is it
> doing? Will it ever finish?
> 
> Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser hangs

Posted by Markus Jelsma <ma...@openindex.io>.

None of these. All these URL's work fine with ParserChecker. I've also tried 
several more that are not in the snippet below, all parse well, so does the 
PDF except it's slow.

On Monday 04 July 2011 15:21:53 Julien Nioche wrote:
> Which is the one that loops with the ParserChecker?
> 
> On 4 July 2011 14:18, Markus Jelsma <ma...@openindex.io> wrote:
> > These are the last few lines of the currently running parse job:
> > 
> > 2011-07-04 11:43:15,450 INFO  parse.ParseSegment - Parsing:
> > http://www.elseviergezondheidszorg.nl/1068128/Stappenplan-Zorgvisie-
> > Opleidingwijzer.pdf
> > 2011-07-04 11:43:16,173 INFO  parse.ParseSegment - Parsing:
> > http://www.elseviergezondheidszorg.nl/1128911/Aanmelden-nieuwsbrief.html
> > 2011-07-04 11:43:16,316 INFO  parse.ParseSegment - Parsing:
> > 
> > http://www.elsevieropleidingen.nl/applicaties/alfabetische-opleidinglijst
> > .aspx 2011-07-04 11:43:16,324 INFO  parse.ParseSegment - Parsing:
> > http://www.elsgulpen.nl
> > 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing:
> > http://www.elshaarzaak.nl/
> > 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules
> > for
> > scope 'outlink', using default
> > 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules
> > for
> > scope 'fetcher', using default
> > 
> > I see no text file, all HTML and one PDF. The elshaarzaak.nl is confirmed
> > to
> > parse nicely in a small test crawl on another machine using same Nutch
> > 1.4-dev
> > version and config.
> > 
> > On Monday 04 July 2011 15:13:10 Julien Nioche wrote:
> > > Only the last one is likely to correspond to that document as the first
> > > 2 are for a .txt document.
> > > 
> > > Can you tell me what the URL is so that I can check whether the issue
> > > is reproductible?
> > > 
> > > Thanks
> > > 
> > > > > try calling jstack to see where it is stuck?
> > > > 
> > > > I've obtained a thread dump but need some assistance on how to to
> > > > interpret it. It actually doing something as some threads' trace
> > > > change between jstack
> > > > calls.
> > > > 
> > > > 
> > > > These three threads change. Note the calls to Tika. I'm no longer
> > > > sure what it's processing now. Im only sure the last log line
> > > > `Parsing: URL` is a plain
> > > > old HTML page.
> > > > 
> > > > Thanks
> > > > 
> > > > 
> > > > "Thread-91065" prio=10 tid=0x00007ff788146000 nid=0x2a30 runnable
> > > > [0x00007ff77b5f4000]
> > > > 
> > > >   java.lang.Thread.State: RUNNABLE
> > > >   
> > > >        at java.util.Arrays.copyOf(Arrays.java:2882)
> > > >        at
> > 
> > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > 
> > > > :100)
> > > > :
> > > >        at
> > > > 
> > > > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390
> > > > )
> > > > 
> > > >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> > > >        - locked <0x00000000dc200000> (a java.lang.StringBuffer)
> > > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > > > 
> > > > Source)
> > > > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > 
> > > > rator.java:146)
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > 
> > > > :39)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.writeReplacement(SafeContentHandle
> > 
> > > > r.java:143)
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:105
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > 
> > > > :151)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > 
> > > > va:261)
> > > > 
> > > >        at
> > 
> > org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > > > 
> > > >        at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:3
> > > >        5) at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:2
> > > >        4) at
> > > > 
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > 
> > > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > >        at java.lang.Thread.run(Thread.java:662)
> > > > 
> > > > and
> > > > 
> > > > 
> > > > "Thread-91016" prio=10 tid=0x00007ff788ad2800 nid=0x2952 runnable
> > > > [0x00007ff77b7f5000]
> > > > 
> > > >   java.lang.Thread.State: RUNNABLE
> > > >   
> > > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > > > 
> > > > Source)
> > > > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > 
> > > > rator.java:146)
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > 
> > > > :39)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:101
> > 
> > > > )
> > > > 
> > > >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > 
> > > > :151)
> > > > :
> > > >        at
> > 
> > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > 
> > > > va:261)
> > > > 
> > > >        at
> > 
> > org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> > 
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > > > 
> > > >        at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:3
> > > >        5) at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:2
> > > >        4) at
> > > > 
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > 
> > > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > >        at java.lang.Thread.run(Thread.java:662)
> > > > 
> > > > and
> > > > 
> > > > 
> > > > 
> > > > "Thread-57923" prio=10 tid=0x00000000422ef000 nid=0x1fbe runnable
> > > > [0x00007ff780c14000]
> > > > 
> > > >   java.lang.Thread.State: RUNNABLE
> > > >   
> > > >        at java.util.Arrays.copyOf(Arrays.java:2882)
> > > >        at
> > 
> > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > 
> > > > :100)
> > > > :
> > > >        at
> > > > 
> > > > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390
> > > > )
> > > > 
> > > >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> > > >        - locked <0x00000000f5e57810> (a java.lang.StringBuffer)
> > > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > > > 
> > > > Source)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser
> > 
> > > > .java:463)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:19
> > 
> > > > 5)
> > > > 
> > > >        at
> > > > 
> > > > org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:82
> > > > 1)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.
> > 
> > > > java:2033)
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
> > 
> > > >        at
> >  
> >  org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
> >  
> > > >        at
> > > > 
> > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
> > > > )
> > > > 
> > > >        at
> > > > 
> > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
> > > > )
> > > > 
> > > >        at
> > 
> > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
> > 
> > > > :164)
> > > > :
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
> > > > 
> > > >        at
> >  
> >  org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
> >  
> > > >        at
> > > > 
> > > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:147)
> > > > 
> > > >        at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:3
> > > >        5) at
> > > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:2
> > > >        4) at
> > > > 
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > > > 
> > > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > > >        at java.lang.Thread.run(Thread.java:662)
> > > > 
> > > > Thanks
> > > > 
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser hangs

Posted by Julien Nioche <li...@gmail.com>.

Which is the one that loops with the ParserChecker?

On 4 July 2011 14:18, Markus Jelsma <ma...@openindex.io> wrote:

> These are the last few lines of the currently running parse job:
>
> 2011-07-04 11:43:15,450 INFO  parse.ParseSegment - Parsing:
> http://www.elseviergezondheidszorg.nl/1068128/Stappenplan-Zorgvisie-
> Opleidingwijzer.pdf
> 2011-07-04 11:43:16,173 INFO  parse.ParseSegment - Parsing:
> http://www.elseviergezondheidszorg.nl/1128911/Aanmelden-nieuwsbrief.html
> 2011-07-04 11:43:16,316 INFO  parse.ParseSegment - Parsing:
>
> http://www.elsevieropleidingen.nl/applicaties/alfabetische-opleidinglijst.aspx
> 2011-07-04 11:43:16,324 INFO  parse.ParseSegment - Parsing:
> http://www.elsgulpen.nl
> 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing:
> http://www.elshaarzaak.nl/
> 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules
> for
> scope 'outlink', using default
> 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules
> for
> scope 'fetcher', using default
>
> I see no text file, all HTML and one PDF. The elshaarzaak.nl is confirmed
> to
> parse nicely in a small test crawl on another machine using same Nutch
> 1.4-dev
> version and config.
>
>
>
> On Monday 04 July 2011 15:13:10 Julien Nioche wrote:
> > Only the last one is likely to correspond to that document as the first 2
> > are for a .txt document.
> >
> > Can you tell me what the URL is so that I can check whether the issue is
> > reproductible?
> >
> > Thanks
> >
> > > > try calling jstack to see where it is stuck?
> > >
> > > I've obtained a thread dump but need some assistance on how to to
> > > interpret it. It actually doing something as some threads' trace change
> > > between jstack
> > > calls.
> > >
> > >
> > > These three threads change. Note the calls to Tika. I'm no longer sure
> > > what it's processing now. Im only sure the last log line `Parsing: URL`
> > > is a plain
> > > old HTML page.
> > >
> > > Thanks
> > >
> > >
> > > "Thread-91065" prio=10 tid=0x00007ff788146000 nid=0x2a30 runnable
> > > [0x00007ff77b5f4000]
> > >
> > >   java.lang.Thread.State: RUNNABLE
> > >
> > >        at java.util.Arrays.copyOf(Arrays.java:2882)
> > >        at
> > >
> > >
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > > :100)
> > >
> > >        at
> > >
> > > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
> > >
> > >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> > >        - locked <0x00000000dc200000> (a java.lang.StringBuffer)
> > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > >
> > > Source)
> > >
> > >        at
> > >
> > > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > > rator.java:146)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > > :39)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > > )
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler.writeReplacement(SafeContentHandle
> > > r.java:143)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:105
> > > )
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > > :151)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > > va:261)
> > >
> > >        at
> org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> > >        at
> > >
> > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > >
> > >        at
> > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> > >        at
> > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> > >        at
> > >
> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > >
> > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > >        at java.lang.Thread.run(Thread.java:662)
> > >
> > > and
> > >
> > >
> > > "Thread-91016" prio=10 tid=0x00007ff788ad2800 nid=0x2952 runnable
> > > [0x00007ff77b7f5000]
> > >
> > >   java.lang.Thread.State: RUNNABLE
> > >
> > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > >
> > > Source)
> > >
> > >        at
> > >
> > > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > > rator.java:146)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > > :39)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > > )
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:101
> > > )
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > > :151)
> > >
> > >        at
> > >
> > >
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > > va:261)
> > >
> > >        at
> org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> > >        at
> > >
> > > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > >
> > >        at
> > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> > >        at
> > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> > >        at
> > >
> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > >
> > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > >        at java.lang.Thread.run(Thread.java:662)
> > >
> > > and
> > >
> > >
> > >
> > > "Thread-57923" prio=10 tid=0x00000000422ef000 nid=0x1fbe runnable
> > > [0x00007ff780c14000]
> > >
> > >   java.lang.Thread.State: RUNNABLE
> > >
> > >        at java.util.Arrays.copyOf(Arrays.java:2882)
> > >        at
> > >
> > >
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > > :100)
> > >
> > >        at
> > >
> > > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
> > >
> > >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> > >        - locked <0x00000000f5e57810> (a java.lang.StringBuffer)
> > >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > >
> > > Source)
> > >
> > >        at
> > >
> > >
> org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser
> > > .java:463)
> > >
> > >        at
> > >
> > >
> org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:19
> > > 5)
> > >
> > >        at
> > >
> > > org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821)
> > >
> > >        at
> > >
> > >
> org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.
> > > java:2033)
> > >
> > >        at
> > >
> > >
> org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
> > >
> > >        at
> > >
>  org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
> > >        at
> > >
> > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
> > >
> > >        at
> > >
> > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
> > >
> > >        at
> > >
> > >
> org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
> > > :164)
> > >
> > >        at
> > >
> > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
> > >
> > >        at
> > >
>  org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
> > >        at
> > >
> > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:147)
> > >
> > >        at
> > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> > >        at
> > >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> > >        at
> > >
> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > >
> > >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > >        at java.lang.Thread.run(Thread.java:662)
> > >
> > > Thanks
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Parser hangs

Posted by Markus Jelsma <ma...@openindex.io>.

These are the last few lines of the currently running parse job:

2011-07-04 11:43:15,450 INFO  parse.ParseSegment - Parsing: 
http://www.elseviergezondheidszorg.nl/1068128/Stappenplan-Zorgvisie-
Opleidingwijzer.pdf
2011-07-04 11:43:16,173 INFO  parse.ParseSegment - Parsing: 
http://www.elseviergezondheidszorg.nl/1128911/Aanmelden-nieuwsbrief.html
2011-07-04 11:43:16,316 INFO  parse.ParseSegment - Parsing: 
http://www.elsevieropleidingen.nl/applicaties/alfabetische-opleidinglijst.aspx
2011-07-04 11:43:16,324 INFO  parse.ParseSegment - Parsing: 
http://www.elsgulpen.nl
2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: 
http://www.elshaarzaak.nl/
2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default
2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'fetcher', using default

I see no text file, all HTML and one PDF. The elshaarzaak.nl is confirmed to 
parse nicely in a small test crawl on another machine using same Nutch 1.4-dev 
version and config.



On Monday 04 July 2011 15:13:10 Julien Nioche wrote:
> Only the last one is likely to correspond to that document as the first 2
> are for a .txt document.
> 
> Can you tell me what the URL is so that I can check whether the issue is
> reproductible?
> 
> Thanks
> 
> > > try calling jstack to see where it is stuck?
> > 
> > I've obtained a thread dump but need some assistance on how to to
> > interpret it. It actually doing something as some threads' trace change
> > between jstack
> > calls.
> > 
> > 
> > These three threads change. Note the calls to Tika. I'm no longer sure
> > what it's processing now. Im only sure the last log line `Parsing: URL`
> > is a plain
> > old HTML page.
> > 
> > Thanks
> > 
> > 
> > "Thread-91065" prio=10 tid=0x00007ff788146000 nid=0x2a30 runnable
> > [0x00007ff77b5f4000]
> > 
> >   java.lang.Thread.State: RUNNABLE
> >   
> >        at java.util.Arrays.copyOf(Arrays.java:2882)
> >        at
> > 
> > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > :100)
> > 
> >        at
> > 
> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
> > 
> >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> >        - locked <0x00000000dc200000> (a java.lang.StringBuffer)
> >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > 
> > Source)
> > 
> >        at
> > 
> > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
> > 
> >        at
> > 
> > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > rator.java:146)
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > :39)
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > )
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.writeReplacement(SafeContentHandle
> > r.java:143)
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:105
> > )
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > :151)
> > 
> >        at
> > 
> > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > va:261)
> > 
> >        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> >        at
> > 
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > 
> >        at
> >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >        at
> >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >        at
> > 
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > 
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at java.lang.Thread.run(Thread.java:662)
> > 
> > and
> > 
> > 
> > "Thread-91016" prio=10 tid=0x00007ff788ad2800 nid=0x2952 runnable
> > [0x00007ff77b7f5000]
> > 
> >   java.lang.Thread.State: RUNNABLE
> >   
> >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > 
> > Source)
> > 
> >        at
> > 
> > org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
> > 
> >        at
> > 
> > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDeco
> > rator.java:146)
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java
> > :39)
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61
> > )
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:101
> > )
> > 
> >        at
> > 
> > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java
> > :151)
> > 
> >        at
> > 
> > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.ja
> > va:261)
> > 
> >        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
> >        at
> > 
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
> > 
> >        at
> >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >        at
> >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >        at
> > 
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > 
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at java.lang.Thread.run(Thread.java:662)
> > 
> > and
> > 
> > 
> > 
> > "Thread-57923" prio=10 tid=0x00000000422ef000 nid=0x1fbe runnable
> > [0x00007ff780c14000]
> > 
> >   java.lang.Thread.State: RUNNABLE
> >   
> >        at java.util.Arrays.copyOf(Arrays.java:2882)
> >        at
> > 
> > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java
> > :100)
> > 
> >        at
> > 
> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
> > 
> >        at java.lang.StringBuffer.append(StringBuffer.java:224)
> >        - locked <0x00000000f5e57810> (a java.lang.StringBuffer)
> >        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> > 
> > Source)
> > 
> >        at
> > 
> > org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser
> > .java:463)
> > 
> >        at
> > 
> > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:19
> > 5)
> > 
> >        at
> > 
> > org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821)
> > 
> >        at
> > 
> > org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.
> > java:2033)
> > 
> >        at
> > 
> > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
> > 
> >        at
> >        org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
> >        at
> > 
> > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
> > 
> >        at
> > 
> > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
> > 
> >        at
> > 
> > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
> > :164)
> > 
> >        at
> > 
> > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
> > 
> >        at
> >        org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
> >        at
> > 
> > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:147)
> > 
> >        at
> >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >        at
> >        org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >        at
> > 
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > 
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at java.lang.Thread.run(Thread.java:662)
> > 
> > Thanks
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser hangs

Posted by Julien Nioche <li...@gmail.com>.

Only the last one is likely to correspond to that document as the first 2
are for a .txt document.

Can you tell me what the URL is so that I can check whether the issue is
reproductible?

Thanks




> >
> > try calling jstack to see where it is stuck?
>
> I've obtained a thread dump but need some assistance on how to to interpret
> it. It actually doing something as some threads' trace change between
> jstack
> calls.
>
>
> These three threads change. Note the calls to Tika. I'm no longer sure what
> it's processing now. Im only sure the last log line `Parsing: URL` is a
> plain
> old HTML page.
>
> Thanks
>
>
> "Thread-91065" prio=10 tid=0x00007ff788146000 nid=0x2a30 runnable
> [0x00007ff77b5f4000]
>   java.lang.Thread.State: RUNNABLE
>        at java.util.Arrays.copyOf(Arrays.java:2882)
>        at
>
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
>        at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
>        at java.lang.StringBuffer.append(StringBuffer.java:224)
>        - locked <0x00000000dc200000> (a java.lang.StringBuffer)
>        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> Source)
>        at
> org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
>        at
>
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>        at
>
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
>        at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
>        at
>
> org.apache.tika.sax.SafeContentHandler.writeReplacement(SafeContentHandler.java:143)
>        at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:105)
>        at
>
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
>        at
>
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
>        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
>        at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at java.lang.Thread.run(Thread.java:662)
>
>
>
>
> and
>
>
> "Thread-91016" prio=10 tid=0x00007ff788ad2800 nid=0x2952 runnable
> [0x00007ff77b7f5000]
>   java.lang.Thread.State: RUNNABLE
>        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> Source)
>        at
> org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
>        at
>
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>        at
>
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
>        at
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
>        at
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:101)
>        at
>
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
>        at
>
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
>        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
>        at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at java.lang.Thread.run(Thread.java:662)
>
>
> and
>
>
>
> "Thread-57923" prio=10 tid=0x00000000422ef000 nid=0x1fbe runnable
> [0x00007ff780c14000]
>   java.lang.Thread.State: RUNNABLE
>        at java.util.Arrays.copyOf(Arrays.java:2882)
>        at
>
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
>        at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
>        at java.lang.StringBuffer.append(StringBuffer.java:224)
>        - locked <0x00000000f5e57810> (a java.lang.StringBuffer)
>        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
> Source)
>        at
>
> org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser.java:463)
>        at
> org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
>        at
> org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821)
>        at
>
> org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033)
>        at
> org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
>        at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
>        at
> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
>        at
> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
>        at
>
> org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164)
>        at
> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
>        at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
>        at
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:147)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at java.lang.Thread.run(Thread.java:662)
>
> Thanks
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Parser hangs

Posted by Markus Jelsma <ma...@openindex.io>.

Hi Julien,

On Monday 04 July 2011 14:40:58 Julien Nioche wrote:
> Markus,
> 
> What is the mime-type of the document? Can you reproduce the problem with
> Tika directly?

Plain HTML, handled by parse-html.

> I assume you've set parser.timeout to a value > 0

You assume correctly, i use default of 30.

> 
> try calling jstack to see where it is stuck?

I've obtained a thread dump but need some assistance on how to to interpret 
it. It actually doing something as some threads' trace change between jstack 
calls. 


These three threads change. Note the calls to Tika. I'm no longer sure what 
it's processing now. Im only sure the last log line `Parsing: URL` is a plain 
old HTML page.

Thanks


"Thread-91065" prio=10 tid=0x00007ff788146000 nid=0x2a30 runnable 
[0x00007ff77b5f4000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
        at java.lang.StringBuffer.append(StringBuffer.java:224)
        - locked <0x00000000dc200000> (a java.lang.StringBuffer)
        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
        at 
org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at 
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
        at 
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
        at 
org.apache.tika.sax.SafeContentHandler.writeReplacement(SafeContentHandler.java:143)
        at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:105)
        at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
        at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
        at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:662)




and


"Thread-91016" prio=10 tid=0x00007ff788ad2800 nid=0x2952 runnable 
[0x00007ff77b7f5000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
        at 
org.apache.nutch.parse.tika.DOMBuilder.characters(DOMBuilder.java:405)
        at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at 
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
        at 
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
        at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:101)
        at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
        at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:132)
        at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:662)


and



"Thread-57923" prio=10 tid=0x00000000422ef000 nid=0x1fbe runnable 
[0x00007ff780c14000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
        at java.lang.StringBuffer.append(StringBuffer.java:224)
        - locked <0x00000000f5e57810> (a java.lang.StringBuffer)
        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
        at 
org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser.java:463)
        at 
org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
        at 
org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821)
        at 
org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033)
        at 
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
        at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
        at 
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
        at 
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
        at 
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164)
        at 
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
        at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
        at 
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:147)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:662)

Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser hangs

Posted by Julien Nioche <li...@gmail.com>.

Markus,

Another large crawl seems to lead to problems, this time the parser. I've
> added logging to the parser so i can follow it's progress; it outputs the
> key
> of the document it's processing.
>

What is the mime-type of the document? Can you reproduce the problem with
Tika directly?


>
> It now seems to hang. The proces continues to use CPU time (it fluctuates
> normally) and i can confirm that the document in question is parsable. Both
> with ParserChecker and a complete crawl cycle of that one URL.
>

I assume you've set parser.timeout to a value > 0


>
> I don't know if the parse job is finishing up as i can't see it but this is
> the last output of the log:
>
> 2011-07-04 11:43:16,328 INFO  parse.ParseSegment - Parsing: http://<HOST>
> 2011-07-04 11:44:53,197 WARN  regex.RegexURLNormalizer - can't find rules
> for
> scope 'outlink', using default
> 2011-07-04 11:45:02,877 WARN  regex.RegexURLNormalizer - can't find rules
> for
> scope 'fetcher', using default
>
> As you can see it's already doing `nothing` for 45 minutes. What is it
> doing?
> Will it ever finish?
>
>
try calling jstack to see where it is stuck?




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com