You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ahmet Arslan <io...@yahoo.com> on 2010/06/07 23:09:00 UTC

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

> I need to index HTML documents and one of the requirements
> is to highlight
> documents while maintaining all of the original formatting.
> The documents
> are relatively simple HTML, meaning no JavaScript code that
> changes elements
> at runtime or too fancy CSS styling.
> 
> I think it should be possible to write a tokenizer that
> strips out the HTML
> tags but maintains the original offsets within the HTML
> document so they
> can be used for highlighting the original HTML document,
> not just the
> text representation.
> 
> Does anybody know any tokenizers that can do this? It seems
> it's something
> other people may need too.
> 
> I am fairly new to Lucene so I may have chosen the wrong
> terminology but I
> hope this makes sense.

You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to add one or more org.apache.lucene.analysis.CharFilter(s) before tokenizer in your analyzer.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

Posted by Uwe Schindler <uw...@thetaphi.de>.
> Hi Ahmet,
> 
> I am using Lucene.NET with C# so I can't test this quickly.
> Will HTMLStripCharFilter maintain the character offsets or does it just
extract
> the plain text?

Yes the CharFilter does this!

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

Posted by Hans Merkl <hm...@rightonpoint.us>.
Hi Ahmet,

I am using Lucene.NET with C# so I can't test this quickly.
Will HTMLStripCharFilter maintain the character offsets or does it just
extract the plain text?

Hans


> You can use org.apache.solr.analysis.HTMLStripCharFilter. It is possible to
> add one or more org.apache.lucene.analysis.CharFilter(s) before tokenizer in
> your analyzer.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>