You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Doug Cutting <cu...@apache.org> on 2005/06/03 19:06:25 UTC

Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

Fred Toth wrote:
> I'm thinking we need something like "HTMLTokenizer" which bridges the
> gap between StandardAnalyzer and an external HTML parser. Since so
> many of us are dealing with HTML, I would think this would be generally
> useful for many problems. It could work this way:
> 
> Given this input:
> 
> <html><head><title>Howdy there</title></head><body>Hello 
> world</body></html>
> 
> An HTMLTokenizer would deliver something like this sort of token stream
> (the numbers represent the start/end offsets for the token):
> 
> TAG, <html>, 0, 6
> TAG, <head>, 6, 12
> TAG, <title>, 12, 18
> WORD, Howdy, 18, 22
> WORD, there, 23, 28
> TAG, </title>, 28, 36
> etc.
> 
> Given the above, a filter could then strip out the HTML, but pass the 
> WORDs on
> to Lucene, preserving the offsets in the source file. These would be 
> used later
> during highlighting. Clever filters could be selective about what gets 
> stripped and
> what gets passed on.

For what it's worth, I think that's a good design and would love to see 
this as a contribution.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org