You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Karthik N S <ka...@controlnet.co.in> on 2004/10/01 10:38:31 UTC

IndexHTML parser + Constructer


Hi


Apologies .........

    Can Somebody Please tell me or  how to include  a constructer  within
'org.apache.lucene.demo.html.HtmlParser.java' ,
    So that using the Constructer read the String argument,Strips the HTML
Tags and returns the String with out Tags.
    Currently 'org.apache.lucene.demo.html.HtmlParser.java' method accepts
fullpath of the file and then reads
    the Content to Strip Tags......




Thx in Advance
Karthik


-----Original Message-----
From: Daniel Naber [mailto:daniel.naber@t-online.de]
Sent: Saturday, September 25, 2004 12:47 AM
To: Lucene Users List
Subject: Re: demo IndexHTML parser breaks unicode?


On Friday 24 September 2004 19:58, Fred Toth wrote:

> I've got unicode in my source HTML. In particular, within meta tags,
> and it's getting broken by the indexer. Note that I'm not trying to
> query on any of this, just store and retrieve document titles with
> unicode characters.

Please try again with the code from CVS, Christoph Goller committed a fix
for this problem (at least I think it was this problem) 1-3 weeks ago.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org