You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Keith Gunn <kg...@csd.abdn.ac.uk> on 2002/08/14 18:46:52 UTC

problems with HTML Parser

Has anyone noticed that the HTML Parser that comes with
Lucene joins terms together when parsing a file.
I used to think it was my PDFParser but after fixing that
I found out it was the HMTLParser.

I managed to find a replacement parser that doesn't join terms.

Just wondered if anyone had come across this problem??




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: problems with HTML Parser

Posted by Ben Litchfield <be...@csh.rit.edu>.

Maurits,

You can get a PDF parser from http://www.pdfbox.org

-Ben


On Wed, 14 Aug 2002, Maurits van Wijland wrote:

> Keith,
>
> I haven't noticed the problem with the Parser...but you trigger me
> by saying that you have a PDFParser!!!
>
> Are you able to contribute this PDFParser??
>
> Maurits.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: problems with HTML Parser

Posted by Keith Gunn <kg...@csd.abdn.ac.uk>.

If your parsing html files have a check in lucene
to see the terms that are index and see if you can
spot any joined terms.

The PDF parser as you can see from the other mail is from
www.pdfbox.org and i highly recommend it (thanks again Ben!)




On Wed, 14 Aug 2002, Maurits van Wijland wrote:

> Keith,
>
> I haven't noticed the problem with the Parser...but you trigger me
> by saying that you have a PDFParser!!!
>
> Are you able to contribute this PDFParser??
>
> Maurits.
> ----- Original Message -----
> From: "Keith Gunn" <kg...@csd.abdn.ac.uk>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, August 14, 2002 9:46 AM
> Subject: problems with HTML Parser
>
>
> > Has anyone noticed that the HTML Parser that comes with
> > Lucene joins terms together when parsing a file.
> > I used to think it was my PDFParser but after fixing that
> > I found out it was the HMTLParser.
> >
> > I managed to find a replacement parser that doesn't join terms.
> >
> > Just wondered if anyone had come across this problem??
> >
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: problems with HTML Parser

Posted by Maurits van Wijland <m....@quicknet.nl>.

Keith,

I haven't noticed the problem with the Parser...but you trigger me
by saying that you have a PDFParser!!!

Are you able to contribute this PDFParser??

Maurits.
----- Original Message -----
From: "Keith Gunn" <kg...@csd.abdn.ac.uk>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, August 14, 2002 9:46 AM
Subject: problems with HTML Parser


> Has anyone noticed that the HTML Parser that comes with
> Lucene joins terms together when parsing a file.
> I used to think it was my PDFParser but after fixing that
> I found out it was the HMTLParser.
>
> I managed to find a replacement parser that doesn't join terms.
>
> Just wondered if anyone had come across this problem??
>
>
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

XML indexing in lucene

Posted by jamin rubio <jr...@jouve.fr>.

Hi all,

I 'm a newbie to lucene, and i have a question . Is that possible that
Lucene just indexed a modified field in a index  without re-indexing all the
document ? Can Lucene do Partial Indexing ?

Cheers


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>