You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Zac Smith (Created) (JIRA)" <ji...@apache.org> on 2012/02/12 09:06:59 UTC

[jira] [Created] (SOLR-3127) Dismax to honor the KeywordTokenizerFactory when querying with multi word strings

Dismax to honor the KeywordTokenizerFactory when querying with multi word strings
---------------------------------------------------------------------------------

                 Key: SOLR-3127
                 URL: https://issues.apache.org/jira/browse/SOLR-3127
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis, search
    Affects Versions: 3.5
            Reporter: Zac Smith
            Priority: Minor


When using the KeywordTokenizerFactory with a multi word search string, the dismax query created is not very useful. Although the query analzyer doesn't tokenize the search input, each word of the input is include in the search.

e.g. if searching for 'chicken stock' the dismax query created would be:
+(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01) DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01)) DisjunctionMaxQuery((ingredient_synonyms:chicken stock^0.6)~0.01)

Note that although the query analyzer does not tokenize the term 'chicken stock' into 'chicken' and 'stock', they are still included and required in the search term.
I think the query created should be just:
DisjunctionMaxQuery((ingredient_synonyms:chicken stock)~0.01)
(or at least not have the individual terms as should match, not must match so you could configure with MM.

Example field type:
<fieldType name="keyword_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
	<analyzer type="index">
		<tokenizer class="solr.KeywordTokenizerFactory" />
	</analyzer>
	<analyzer type="query">
		<tokenizer class="solr.KeywordTokenizerFactory" />
	</analyzer>
</fieldType>




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-3127) Dismax to honor the KeywordTokenizerFactory when querying with multi word strings

Posted by "Hoss Man (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235940#comment-13235940 ] 

Hoss Man commented on SOLR-3127:
--------------------------------

whitespace is a significant meta character to dismax (and for that matter, the main lucene QUeryParser as well) ... it indicates the seperation betwen optional clauses.

the query parsing structure is independent of the analyzer used, so the fact that a  KeywordTokenizerFactory is used on the field in question is irrelevant, you might have another qf that doens't have KeywordTokenizerFactory so even if dismax tried to guess that it should treat the entire nput as all one string, it couldn't do that for other fields.

if you wnat your entire input to be treated as a literal, without treating whitespace as a meta-character, it needs to be quoted, or consider using an alternative parser (ie: the "field" QParser is designed for this type of "i want to query a single field for a specific value" type situation.
                
> Dismax to honor the KeywordTokenizerFactory when querying with multi word strings
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-3127
>                 URL: https://issues.apache.org/jira/browse/SOLR-3127
>             Project: Solr
>          Issue Type: Improvement
>          Components: query parsers
>    Affects Versions: 3.5
>            Reporter: Zac Smith
>            Priority: Minor
>              Labels: dismax
>
> When using the KeywordTokenizerFactory with a multi word search string, the dismax query created is not very useful. Although the query analzyer doesn't tokenize the search input, each word of the input is include in the search.
> e.g. if searching for 'chicken stock' the dismax query created would be:
> +(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01) DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01)) DisjunctionMaxQuery((ingredient_synonyms:chicken stock^0.6)~0.01)
> Note that although the query analyzer does not tokenize the term 'chicken stock' into 'chicken' and 'stock', they are still included and required in the search term.
> I think the query created should be just:
> DisjunctionMaxQuery((ingredient_synonyms:chicken stock)~0.01)
> (or at least not have the individual terms as should match, not must match so you could configure with MM.
> Example field type:
> <fieldType name="keyword_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
> 	<analyzer type="index">
> 		<tokenizer class="solr.KeywordTokenizerFactory" />
> 	</analyzer>
> 	<analyzer type="query">
> 		<tokenizer class="solr.KeywordTokenizerFactory" />
> 	</analyzer>
> </fieldType>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (SOLR-3127) Dismax to honor the KeywordTokenizerFactory when querying with multi word strings

Posted by "Hoss Man (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man resolved SOLR-3127.
----------------------------

    Resolution: Not A Problem

resolving since the issue here just seems to be a missunderstanding of how dismax works.

if you have questions about this, please start a thread on solr-user.  if you have specific suggestions for how to change dismax to work better in situations like yours (w/o breaking existing usecases obviously) or suggestions on improving the documentation then by all means: please open a new issue with your suggestions
                
> Dismax to honor the KeywordTokenizerFactory when querying with multi word strings
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-3127
>                 URL: https://issues.apache.org/jira/browse/SOLR-3127
>             Project: Solr
>          Issue Type: Improvement
>          Components: query parsers
>    Affects Versions: 3.5
>            Reporter: Zac Smith
>            Priority: Minor
>              Labels: dismax
>
> When using the KeywordTokenizerFactory with a multi word search string, the dismax query created is not very useful. Although the query analzyer doesn't tokenize the search input, each word of the input is include in the search.
> e.g. if searching for 'chicken stock' the dismax query created would be:
> +(DisjunctionMaxQuery((ingredient_synonyms:chicken^0.6)~0.01) DisjunctionMaxQuery((ingredient_synonyms:stock^0.6)~0.01)) DisjunctionMaxQuery((ingredient_synonyms:chicken stock^0.6)~0.01)
> Note that although the query analyzer does not tokenize the term 'chicken stock' into 'chicken' and 'stock', they are still included and required in the search term.
> I think the query created should be just:
> DisjunctionMaxQuery((ingredient_synonyms:chicken stock)~0.01)
> (or at least not have the individual terms as should match, not must match so you could configure with MM.
> Example field type:
> <fieldType name="keyword_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
> 	<analyzer type="index">
> 		<tokenizer class="solr.KeywordTokenizerFactory" />
> 	</analyzer>
> 	<analyzer type="query">
> 		<tokenizer class="solr.KeywordTokenizerFactory" />
> 	</analyzer>
> </fieldType>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org