You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Preetam Rao (JIRA)" <ji...@apache.org> on 2008/07/16 08:13:31 UTC

[jira] Created: (SOLR-633) Requesthandler for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis

Requesthandler for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis
-------------------------------------------------------------------------------------------------------------------------------------------

Key: SOLR-633
URL: https://issues.apache.org/jira/browse/SOLR-633
Project: Solr
Issue Type: New Feature
Components: search
Affects Versions: 1.3
Environment: All
Reporter: Preetam Rao
Priority: Minor
Fix For: 1.3

Create a request handler (actually a QParser) for use with user entered queries with following features-
a) Take a user query string and try to match it against multiple fields, while recognizing sub-phrase matches.
b) For each field give the below parameters:
1) phraseBoost - the factor which decides how good a n token sub phrase match is compared to n-1 token sub-phrase match.
2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the highest
3) ignoreDuplicates - If the same sub-phrase query matches multiple times, pick only one.
4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other parameters which are not relevant.
c) Try to provide all the parameters similar to dismax. Reuse or extend dismax.

Other suggestions and feedback appreciated :-)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-633) QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis

Posted by "Preetam Rao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744117#action_12744117 ] 

Preetam Rao commented on SOLR-633:
----------------------------------

Hi, Sorry for such a delay. 

let me take an example of a real estate site that I tried to implement free text search on, using dis max query. 

Also, when I say sub phrase, I mean adjacent terms appearing in a bigger phrase,

The index has the below fields and below example record. 
lets say there are about 4 million records.

city - New York
state - NY
beds (Multi valued or synonyms)- 3 beds, beds 3
baths (Multi valued or synonyms) - 4 baths, baths 4
description - newly built with swimming pool, new furniture, car parking etc
sales type - new home

Lets say the user enters a query like "homes in new york for price 400k with 3 beds 4 baths with swimming pool car parking"

I played with dismax for few days trying out various boosts and factors.The phrase options of dismax are not very useful because they consider all terms of the phrase to appear in a given field. (Thats what it appeared like). Word like "new" appearing in description field multiple times, or cities like "york" seemed to cause some variations.

The nature of the problem here is that, sub phrases like "new york", "3 beds" "price 400k", "car parking" become very important and must be matched in different fields without overlapping across fields.

This can be best solved by a SubPhraseQuery which is used by a DisMax-like query to combine multiple fields.

hence this is what I proposed:

SubPhraseQuery:
- scores based on longest sub phrases matched. Also gives a factor to boost based on match length. For example 4 word matches gets 16 score vs a 3 word match getting 9
- gives an option to score only one match per field. For example, a term "new home" gets scored only once even if it occurs N times in the description field.
- Option to score only longest match. For example, an occurrence of "swimming pool" and some other "pool" scores only "swimming pool".
- As usual, ability to ignore IDF, norms and any other factors, but just use phrase match.

And a DisMax-like query that uses the above:
- Each field can be configured with above query.
- Options to ignore matches in other fields when some match.

I feel this kind of use cases will be encountered when form searches are migrated to free text search, since we are trying to use solr's free text search on some kind of structured data where different fields have different meaning.

Probably dismax is meant for that use case. I spent few days fine tuning dismax for the above use case. Just that, I felt like I had play a lot with various factors and it looked like lot of trial and error and still I was not sure what would the end results look like. I felt that I needed some more control over individual fields and how a match would be scored in those fields on sub phrases.

Let me know your thoughts or alternatives and I will be glad to look at them.






 
 


> QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-633
>                 URL: https://issues.apache.org/jira/browse/SOLR-633
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.4
>         Environment: All
>            Reporter: Preetam Rao
>            Priority: Minor
>             Fix For: 1.5
>
>
> Create a request handler (actually a QParser) for use with user entered queries with following features-
> a) Take a user query string and try to match it against multiple fields, while recognizing sub-phrase matches.
> b) For each field give the below parameters:
>    1) phraseBoost - the factor which decides how good a n token sub phrase match is compared to n-1 token sub-phrase match.
>    2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the highest
>    3) ignoreDuplicates - If the same sub-phrase query matches multiple times, pick only one.
>    4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other parameters which are not relevant.
> c) Try to provide all the parameters similar to dismax. Reuse or extend dismax.  
> Other suggestions and feedback appreciated :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-633) QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis

Posted by "Preetam Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Preetam Rao updated SOLR-633:
-----------------------------

        Fix Version/s:     (was: 1.3)
    Affects Version/s:     (was: 1.3)

Removed 1.3 as fix version

> QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-633
>                 URL: https://issues.apache.org/jira/browse/SOLR-633
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>         Environment: All
>            Reporter: Preetam Rao
>            Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Create a request handler (actually a QParser) for use with user entered queries with following features-
> a) Take a user query string and try to match it against multiple fields, while recognizing sub-phrase matches.
> b) For each field give the below parameters:
>    1) phraseBoost - the factor which decides how good a n token sub phrase match is compared to n-1 token sub-phrase match.
>    2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the highest
>    3) ignoreDuplicates - If the same sub-phrase query matches multiple times, pick only one.
>    4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other parameters which are not relevant.
> c) Try to provide all the parameters similar to dismax. Reuse or extend dismax.  
> Other suggestions and feedback appreciated :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-633) QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-633:
---------------------------------------

         Fix Version/s: 1.4
     Affects Version/s: 1.4
    Remaining Estimate:     (was: 336h)
     Original Estimate:     (was: 336h)

> QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-633
>                 URL: https://issues.apache.org/jira/browse/SOLR-633
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.4
>         Environment: All
>            Reporter: Preetam Rao
>            Priority: Minor
>             Fix For: 1.4
>
>
> Create a request handler (actually a QParser) for use with user entered queries with following features-
> a) Take a user query string and try to match it against multiple fields, while recognizing sub-phrase matches.
> b) For each field give the below parameters:
>    1) phraseBoost - the factor which decides how good a n token sub phrase match is compared to n-1 token sub-phrase match.
>    2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the highest
>    3) ignoreDuplicates - If the same sub-phrase query matches multiple times, pick only one.
>    4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other parameters which are not relevant.
> c) Try to provide all the parameters similar to dismax. Reuse or extend dismax.  
> Other suggestions and feedback appreciated :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-633) QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700422#action_12700422 ] 

Otis Gospodnetic commented on SOLR-633:
---------------------------------------

This description could sure use an example! :)  I read it 3 times and still don't have a good picture of what this is really about.


> QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-633
>                 URL: https://issues.apache.org/jira/browse/SOLR-633
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.4
>         Environment: All
>            Reporter: Preetam Rao
>            Priority: Minor
>             Fix For: 1.5
>
>
> Create a request handler (actually a QParser) for use with user entered queries with following features-
> a) Take a user query string and try to match it against multiple fields, while recognizing sub-phrase matches.
> b) For each field give the below parameters:
>    1) phraseBoost - the factor which decides how good a n token sub phrase match is compared to n-1 token sub-phrase match.
>    2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the highest
>    3) ignoreDuplicates - If the same sub-phrase query matches multiple times, pick only one.
>    4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other parameters which are not relevant.
> c) Try to provide all the parameters similar to dismax. Reuse or extend dismax.  
> Other suggestions and feedback appreciated :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-633) QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis

Posted by "Preetam Rao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Preetam Rao updated SOLR-633:
-----------------------------

    Summary: QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis  (was: Requesthandler for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis)

Changed Request handler to QParser in the title...

> QParser for use with user-entered query which recognizes subphrases as well as allowing some other customizations on per field basis
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-633
>                 URL: https://issues.apache.org/jira/browse/SOLR-633
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>         Environment: All
>            Reporter: Preetam Rao
>            Priority: Minor
>             Fix For: 1.3
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Create a request handler (actually a QParser) for use with user entered queries with following features-
> a) Take a user query string and try to match it against multiple fields, while recognizing sub-phrase matches.
> b) For each field give the below parameters:
>    1) phraseBoost - the factor which decides how good a n token sub phrase match is compared to n-1 token sub-phrase match.
>    2) maxScoreOnly - If there are multiple sub-phrase matches pick, only the highest
>    3) ignoreDuplicates - If the same sub-phrase query matches multiple times, pick only one.
>    4) disableOtherScoreFactors - Ignore tf, query norm, idf and any other parameters which are not relevant.
> c) Try to provide all the parameters similar to dismax. Reuse or extend dismax.  
> Other suggestions and feedback appreciated :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.