You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Chris Harris (JIRA)" <ji...@apache.org> on 2010/05/13 00:13:42 UTC

[jira] Created: (SOLR-1910) Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path

Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path
-----------------------------------------------------------------------------------------------------

Key: SOLR-1910
URL: https://issues.apache.org/jira/browse/SOLR-1910
Project: Solr
Issue Type: Improvement
Components: highlighter
Affects Versions: 1.4
Reporter: Chris Harris
Attachments: SOLR-1910.patch

Summary: Patch adds a hl.df parameter, to help with (some) situations where the highlighter currently uses the "wrong" analyzer for highlighting.

What: hl.df is like the normal df parameter, except that it takes effect only during highlighting. (In fact the implementation is basically to temporarily mess with the normal df parameter at the start of highlighting, and then revert to the original value when highlighting is complete.) When hl.df is specified, we make sure not to use the Query object that was parsed by QueryComponent, but rather make our own. In the right circumstances anyway, this means that a more appropriate analyzer gets used for highlighting.

Motivation: Currently, in a normal query+highlighting request, the highlighter re-uses the Query object parsed by the QueryComponent. This can result in incorrect highlights if the field being highlighted is of a different type than the field being queried. In my particular case:
* My queries don't explicitly specify field names; they always rely on the default field
* My default field for search is "body"
* body is a unigram-plus-bigram field. So, e.g. input "audit trail" gets turned into tokens "audit / audit trail / trail". (This is a performance optimzation.)
* If I try to highlight directly on "body", the highlights get screwed up. (This is because the highlighter doesn't really support the kind of "continuously overlapping" tokens generated by my analysis chain. In short, the bigrams confuse the TokenGroup class.)
* To avoid these highlighting problems, I don't directly highlight "body", but rather a "highlight" field, which has no bigram tokens. ("highlight" is populated from "body" with a copyfield directive.)
* Without hl.df, I have a new class of highlighting problems. In particular, if the user enters a phrase search (e.g. "audit trail"), then that phrase appears unhighlighted in the highlighter output. The short version for why is that the analyzer used to parse the query output a Query object that contains bigrams, but the text that we're highlighting doesn't contain bigrams.
* With hl.df, the analyzers match up for highlight; the Query object used for highlighting does _not_ contain bigrams, just like the "highlight" field.

(I realize it may help to expand the description of this use case, but I'm a bit hurried right now.)

I wanted to throw this out there, partly in case people have any better solutions. One variation on hl.df option that might be worth considering is hl.UseHighlightedFieldAsDefaultField, which would create a new Query object not just once at the start of highlighting, but separately for each particular field that's getting highlighted.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (SOLR-1910) Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path from search

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-1910:
-------------------------------

    Summary: Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path from search  (was: Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path)

> Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path from search
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1910
>                 URL: https://issues.apache.org/jira/browse/SOLR-1910
>             Project: Solr
>          Issue Type: Improvement
>          Components: highlighter
>    Affects Versions: 1.4
>            Reporter: Chris Harris
>         Attachments: SOLR-1910.patch
>
>
> Summary: Patch adds a hl.df parameter, to help with (some) situations where the highlighter currently uses the "wrong" analyzer for highlighting.
> What: hl.df is like the normal df parameter, except that it takes effect only during highlighting. (In fact the implementation is basically to temporarily mess with the normal df parameter at the start of highlighting, and then  revert to the original value when highlighting is complete.) When hl.df is specified, we make sure not to use the Query object that was parsed by QueryComponent, but rather make our own. In the right circumstances anyway, this means that a more appropriate analyzer gets used for highlighting.
> Motivation: Currently, in a normal query+highlighting request, the highlighter re-uses the Query object parsed by the QueryComponent. This can result in incorrect highlights if the field being highlighted is of a different type than the field being queried. In my particular case:
>  * My queries don't explicitly specify field names; they always rely on the default field
>  * My default field for search is "body"
>  * body is a unigram-plus-bigram field. So, e.g. input "audit trail" gets turned into tokens "audit / audit trail / trail". (This is a performance optimzation.)
>  * If I try to highlight directly on "body", the highlights get screwed up. (This is because the highlighter doesn't really support the kind of "continuously overlapping" tokens generated by my analysis chain. In short, the bigrams confuse the TokenGroup class.)
>  * To avoid these highlighting problems, I don't directly highlight "body", but rather a "highlight" field, which has no bigram tokens. ("highlight" is populated from "body" with a copyfield directive.)
>  * Without hl.df, I have a new class of highlighting problems. In particular, if the user enters a phrase search (e.g. "audit trail"), then that phrase appears unhighlighted in the highlighter output. The short version for why is that the analyzer used to parse the query output a Query object that contains bigrams, but the text that we're highlighting doesn't contain bigrams.
>  * With hl.df, the analyzers match up for highlight; the Query object used for highlighting does _not_ contain bigrams, just like the "highlight" field.
> (I realize it may help to expand the description of this use case, but I'm a bit hurried right now.)
> I wanted to throw this out there, partly in case people have any better solutions. One variation on hl.df option that might be worth considering is hl.UseHighlightedFieldAsDefaultField, which would create a new Query object not just once at the start of highlighting, but separately for each particular field that's getting highlighted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (SOLR-1910) Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path from search

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-1910:
-------------------------------

    Attachment: SOLR-1910.patch

> Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path from search
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1910
>                 URL: https://issues.apache.org/jira/browse/SOLR-1910
>             Project: Solr
>          Issue Type: Improvement
>          Components: highlighter
>    Affects Versions: 1.4
>            Reporter: Chris Harris
>         Attachments: SOLR-1910.patch
>
>
> Summary: Patch adds a hl.df parameter, to help with (some) situations where the highlighter currently uses the "wrong" analyzer for highlighting.
> What: hl.df is like the normal df parameter, except that it takes effect only during highlighting. (In fact the implementation is basically to temporarily mess with the normal df parameter at the start of highlighting, and then  revert to the original value when highlighting is complete.) When hl.df is specified, we make sure not to use the Query object that was parsed by QueryComponent, but rather make our own. In the right circumstances anyway, this means that a more appropriate analyzer gets used for highlighting.
> Motivation: Currently, in a normal query+highlighting request, the highlighter re-uses the Query object parsed by the QueryComponent. This can result in incorrect highlights if the field being highlighted is of a different type than the field being queried. In my particular case:
>  * My queries don't explicitly specify field names; they always rely on the default field
>  * My default field for search is "body"
>  * body is a unigram-plus-bigram field. So, e.g. input "audit trail" gets turned into tokens "audit / audit trail / trail". (This is a performance optimzation.)
>  * If I try to highlight directly on "body", the highlights get screwed up. (This is because the highlighter doesn't really support the kind of "continuously overlapping" tokens generated by my analysis chain. In short, the bigrams confuse the TokenGroup class.)
>  * To avoid these highlighting problems, I don't directly highlight "body", but rather a "highlight" field, which has no bigram tokens. ("highlight" is populated from "body" with a copyfield directive.)
>  * Without hl.df, I have a new class of highlighting problems. In particular, if the user enters a phrase search (e.g. "audit trail"), then that phrase appears unhighlighted in the highlighter output. The short version for why is that the analyzer used to parse the query output a Query object that contains bigrams, but the text that we're highlighting doesn't contain bigrams.
>  * With hl.df, the analyzers match up for highlight; the Query object used for highlighting does _not_ contain bigrams, just like the "highlight" field.
> (I realize it may help to expand the description of this use case, but I'm a bit hurried right now.)
> I wanted to throw this out there, partly in case people have any better solutions. One variation on hl.df option that might be worth considering is hl.UseHighlightedFieldAsDefaultField, which would create a new Query object not just once at the start of highlighting, but separately for each particular field that's getting highlighted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org