You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Fred Toth <ft...@synernet.com> on 2004/09/23 04:42:55 UTC

demo HTML parser question

Hi,

I've been working with the HTML parser demo that comes with
Lucene and I'm trying to understand why it's multi-threaded,
and, more importantly, how to exit gracefully on errors.

I've discovered if I throw an exception in the front-end static
code (main(), etc.), the JVM hangs instead of exiting. Presumably
this is because there are threads hanging around doing something.
But I'm not sure what!

Any pointers? I just want to exit gracefully on an error such as
a required meta tag is missing or similar.

Thanks,

Fred


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: demo HTML parser question

Posted by ro...@xemaps.com.

On Thu, 23 Sep 2004 10:53:26 -0700, Doug Cutting wrote
> roy-lucene-user@xemaps.com wrote:
> > We were originally attempting to use the demo html parser (Lucene 1.2), but as
> > you know, its for a demo.  I think its threaded to optimize on time, to allow
> > the calling thread to grab the title or top message even though its not done
> > parsing the entire html document.
> 
> That's almost right.  I originally wrote it that way to avoid having 
> to ever buffer the entire text of the document.  The document is 
> indexed while it is parsed.  But, as observed, this has lots of 
> problems and was probably a bad idea.
> 
> Could someone provide a patch that removes the multi-threading?  
> We'd simply use a StringBuffer in HTMLParser.jj to collect the text. 
>  Calls to pipeOut.write() would be replaced with text.append().  
> Then have the HTMLParser's constructor parse the page before 
> returning, rather than spawn a thread, and getReader() would return 
> a StringReader.  The public API of HTMLParser need not change at all 
> and lots of complex threading code would be thrown away.  Anyone 
> interested in coding this?

While we're on the subject...

When using the HTMLParser I tend to get a lot of token manager errors that
basically kill the thread (usually unexpected EOF).  Even if we were to remove
the multi-threading of the HTMLParser, these token manager errors would pretty
much kill the calling app (Error vs Exception).  Any idea how to get around this?

Perhaps this question really belongs on the javacc list?

Roy.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: demo HTML parser question

Posted by Doug Cutting <cu...@apache.org>.

roy-lucene-user@xemaps.com wrote:
> We were originally attempting to use the demo html parser (Lucene 1.2), but as
> you know, its for a demo.  I think its threaded to optimize on time, to allow
> the calling thread to grab the title or top message even though its not done
> parsing the entire html document.

That's almost right.  I originally wrote it that way to avoid having to 
ever buffer the entire text of the document.  The document is indexed 
while it is parsed.  But, as observed, this has lots of problems and was 
probably a bad idea.

Could someone provide a patch that removes the multi-threading?  We'd 
simply use a StringBuffer in HTMLParser.jj to collect the text.  Calls 
to pipeOut.write() would be replaced with text.append().  Then have the 
HTMLParser's constructor parse the page before returning, rather than 
spawn a thread, and getReader() would return a StringReader.  The public 
API of HTMLParser need not change at all and lots of complex threading 
code would be thrown away.  Anyone interested in coding this?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: demo HTML parser question

Posted by ro...@xemaps.com.

Hi Fred,

We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.  That's just a guess, I would love to hear
from others about this.  Anyway, since it is a separate thread, a token error
could kill it and there is no way for the calling thread to know about it.

We had to create our own html parser since we only cared about grabbing the
entire text from the html document and also we wanted to avoid the extra
thread.  We also do a lot of "SKIP"ping for minimal EOF errors (html documents
in email almost never follow standards).  For your html needs, you might want
to check out other JavaCC HTML parsers from the JavaCC web site.

Roy.

On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote
> Hi,
> 
> I've been working with the HTML parser demo that comes with
> Lucene and I'm trying to understand why it's multi-threaded,
> and, more importantly, how to exit gracefully on errors.
> 
> I've discovered if I throw an exception in the front-end static
> code (main(), etc.), the JVM hangs instead of exiting. Presumably
> this is because there are threads hanging around doing something.
> But I'm not sure what!
> 
> Any pointers? I just want to exit gracefully on an error such as
> a required meta tag is missing or similar.
> 
> Thanks,
> 
> Fred
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org