You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sulman Sarwar <su...@gmail.com> on 2010/04/08 05:04:08 UTC

preserving markup of content ?

Hi All,

I am working on some language data and i need to index/search it. I
have used lucene for indexing plain text documents before as well (no
fancy tricks, just plain text indexing). The data that i have now is
transcribed text and is heavily marked up. (Its mostly conversations
and interviews). I can easily remove the markup and extract the text
and feed it to a lucene indexer but i need to preserve some important
markups so that at the time of retrieval the text can make some sense.
Now if i leave the required markup intact and index the documents, i
fear the markup will get tokenized too and will become searchable. I
dont want the markup to be searched but i need to keep it somehow
attached with the actual text to make the retrieval process easy. Can
you suggest me what/how to do it? Correct me if i am wrong. :)

Thanks for the help.

Sulman.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: preserving markup of content ?

Posted by Uwe Schindler <uw...@thetaphi.de>.

The "simple" solution is very easy:
Index the markup-free document by adding with new Field.Index.ANALYZED and Field.Store.NO, so it does not get stored. Then again add the same data (but with markup) to the index with Field.Store.YES but Field.Index.NO. If you like you can do this even with the same field name.

This works, as long as you don't need query highlighting, because the offsets from the first field addition cannot be used for highlighting inside the text with markup. In this case, you have to write your own analyzer that removes the markup in the tokenizer, but preserves the original offsets. Examples of this are e.g. The Wikipedia contrib in Lucene, which has an hand-crafted analyzer that can handle Mediawiki Markup syntax.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Sulman Sarwar [mailto:sulmansarwar@gmail.com]
> Sent: Thursday, April 08, 2010 5:04 AM
> To: java-user@lucene.apache.org
> Subject: preserving markup of content ?
> 
> Hi All,
> 
> I am working on some language data and i need to index/search it. I
> have used lucene for indexing plain text documents before as well (no
> fancy tricks, just plain text indexing). The data that i have now is
> transcribed text and is heavily marked up. (Its mostly conversations
> and interviews). I can easily remove the markup and extract the text
> and feed it to a lucene indexer but i need to preserve some important
> markups so that at the time of retrieval the text can make some sense.
> Now if i leave the required markup intact and index the documents, i
> fear the markup will get tokenized too and will become searchable. I
> dont want the markup to be searched but i need to keep it somehow
> attached with the actual text to make the retrieval process easy. Can
> you suggest me what/how to do it? Correct me if i am wrong. :)
> 
> Thanks for the help.
> 
> Sulman.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org