You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mark Miller (JIRA)" <ji...@apache.org> on 2008/12/04 13:39:45 UTC

[jira] Commented: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

    [ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653283#action_12653283 ] 

Mark Miller commented on LUCENE-1286:
-------------------------------------

Hey Koji, I actually have some ideas to come back to this with, but no time for some time to actually work on it.

bq. Can you elaborate this - "rebuild the document by running through the query terms by using their offsets"?

Part of the problem with the Highlighter and large docs is that it runs through every token in the doc and scores that token, building the original highlighted doc as it goes. For a large doc, that can be a bit slow. What Ronnies highlighter did was just look at the offsets of the query terms (hence the need for term vectors) which allows you to rebuild the original highlighted document in big quick chunks (stitching things together between query term offsets).

I was attempting a similar thing here with phrase and span support, but I couldn't match the speed of what the current Span highlighter has - this is because the current Span Highlighter can highlight non position sensitive terms very fast. My method required getting non position sensitive terms from the MemoryIndex as well (via getSpans) and the cost ruined any benefit. I came up with a few things to try since then but havn't had the time to dedicate to it yet. Its hard to get around requiring term vectors (for the offsets), and I'd like to avoid that. At the same time, if you don't require term vectors, its probably going to be pretty slow re-analyzing the documents anyway...

> LargeDocHighlighter - another span highlighter optimized for large documents
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1286
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>    Affects Versions: 2.4
>            Reporter: Mark Miller
>            Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org