You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Fred Toth <ft...@synernet.com> on 2004/09/24 19:58:47 UTC

demo IndexHTML parser breaks unicode?

Hi,

I was hoping it wouldn't come to this:

I've got unicode in my source HTML. In particular, within meta tags,
and it's getting broken by the indexer. Note that I'm not trying to
query on any of this, just store and retrieve document titles with
unicode characters.

Has anyone else experienced this? I know this is just a demo, but
it's been working really well and I hate to give it up!

Is this easily fixable? I'm a little worried by this comment in
SimpleCharStream.java:

/**
  * An implementation of interface CharStream, where the stream is assumed to
  * contain only ASCII characters (without unicode processing).
  */

This is likely a show-stopper for me on this parser.

Can anyone recommend the shortest path to another HTML parser
that is unicode friendly?

Thanks for anything.

Fred


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: demo IndexHTML parser breaks unicode?

Posted by Fred Toth <ft...@synernet.com>.

Sorry, that didn't cure it.

Again, anyone want to point me to the quickest replacement
HTML parser (that's unicode clean)?

Thanks,

Fred

At 03:17 PM 9/24/2004, you wrote:
>On Friday 24 September 2004 19:58, Fred Toth wrote:
>
> > I've got unicode in my source HTML. In particular, within meta tags,
> > and it's getting broken by the indexer. Note that I'm not trying to
> > query on any of this, just store and retrieve document titles with
> > unicode characters.
>
>Please try again with the code from CVS, Christoph Goller committed a fix
>for this problem (at least I think it was this problem) 1-3 weeks ago.
>
>Regards
>  Daniel
>
>--
>http://www.danielnaber.de
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

IndexHTML parser + Constructer

Posted by Karthik N S <ka...@controlnet.co.in>.


Hi


Apologies .........

    Can Somebody Please tell me or  how to include  a constructer  within
'org.apache.lucene.demo.html.HtmlParser.java' ,
    So that using the Constructer read the String argument,Strips the HTML
Tags and returns the String with out Tags.
    Currently 'org.apache.lucene.demo.html.HtmlParser.java' method accepts
fullpath of the file and then reads
    the Content to Strip Tags......




Thx in Advance
Karthik


-----Original Message-----
From: Daniel Naber [mailto:daniel.naber@t-online.de]
Sent: Saturday, September 25, 2004 12:47 AM
To: Lucene Users List
Subject: Re: demo IndexHTML parser breaks unicode?


On Friday 24 September 2004 19:58, Fred Toth wrote:

> I've got unicode in my source HTML. In particular, within meta tags,
> and it's getting broken by the indexer. Note that I'm not trying to
> query on any of this, just store and retrieve document titles with
> unicode characters.

Please try again with the code from CVS, Christoph Goller committed a fix
for this problem (at least I think it was this problem) 1-3 weeks ago.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: demo IndexHTML parser breaks unicode?

Posted by Daniel Naber <da...@t-online.de>.

On Friday 24 September 2004 19:58, Fred Toth wrote:

> I've got unicode in my source HTML. In particular, within meta tags,
> and it's getting broken by the indexer. Note that I'm not trying to
> query on any of this, just store and retrieve document titles with
> unicode characters.

Please try again with the code from CVS, Christoph Goller committed a fix 
for this problem (at least I think it was this problem) 1-3 weeks ago.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org