You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2010/07/24 05:51:50 UTC

[jira] Created: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

add a config hook for autoGeneratePhraseQueries
-----------------------------------------------

                 Key: SOLR-2015
                 URL: https://issues.apache.org/jira/browse/SOLR-2015
             Project: Solr
          Issue Type: New Feature
    Affects Versions: 3.1, 4.0
            Reporter: Koji Sekiguchi
            Priority: Minor
             Fix For: 3.1, 4.0


After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-2015:
-------------------------------

    Attachment: SOLR-2015.patch

OK, here's a prototype patch.
I'll add some tests next.

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved SOLR-2015.
--------------------------------

    Resolution: Fixed

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892117#action_12892117 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

bq. Aside: some of these people would like multiple languages in the same field, which is part of the reason why I always felt that a the information about how two tokens are related should be produced by the tokenizer/filter creating such tokens.

I don't think we should design our apis around such hacks, especially unproven ones. I don't think the auto phrase generation actually helps english at all, and no one has shown results anywhere that it helps. The reason I don't think it helps is because any improvement in precision is accompanied by decrease in recall: e.g. in this example from the user list, not using the phrase query would find the document, but if you use the phrase query, it doesn't. http://www.lucidimagination.com/search/document/bacf34995067e3cb/worddelimiterfilter_and_phrase_queries

Furthermore, I dont think we should try to make complicated support for multiple languages. Instead we should support simple, proven approaches such as simple language-independent tokenization or n-gram analysis that actually works, not trying to support fine-grained detection and fancy stuff that overly complicates APIs and only provides worse results: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6844


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891972#action_12891972 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

bq. How can I implement "on a per-field basis"?

For per-field control, you must do it in your subclass instead of the flag.
The easiest way is this:

{code}
@Override
protected Query getFieldQuery(String field, String queryText, boolean quoted) {
// if we should generate for this field, then hardcode 'true' as quoted.
// so this means all whitespace-separated parts of the query are treated as quoted.
if (shouldAutoGeneratePhrasesFor(field))
  Query = super.getFieldQuery(field, queryText, true);
else
  Query = super.getFieldQuery(field, queryText, quoted);
}
{code}


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892319#action_12892319 ] 

Yonik Seeley commented on SOLR-2015:
------------------------------------

What would the fieldType for a generic international field look like?
If we can decide on that, we could add it at least.


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892029#action_12892029 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

though I disagree with a signficant amount of statements you made,
I don't think we would ever come to agreement anyway.

but, my concerns about this default basically disappear if we could
have example configs for other languages: first-class in the example
schema.xml and not tucked away and difficult to find. could even be
commented out.

because my problem with the default is all about making it more
difficult to get reasonable behavior, forcing people to go thru 
unneccessary hoops when all this shit can easily work.

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated SOLR-2015:
---------------------------------

    Attachment: SOLR-2015.patch

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892114#action_12892114 ] 

Yonik Seeley commented on SOLR-2015:
------------------------------------

bq. Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr?

Solr doesn't have an installer though... you unzip and  "cd example; java -jar start.jar".
And there are also some people interested in multiple languages in the same index.  Aside: some of these people would like multiple languages in the same field, which is part of the reason why I always felt that a the information about how two tokens are related should be produced by the tokenizer/filter creating such tokens.

bq. Can we make different example config/schema XML files for whitespace vs non-whitespace languages?

I'm not sure what that would accomplish by itself though... it's not like solr is much of an out-of-the-box solution for anything.
We have a default example so that people can easily run through the tutorial, and execute examples on wiki pages.
If there is a single field type that is good for many non-whitespace languages, it seems like we should just add it to the example schema.
And if there is enough demand to demonstrate Solr's international capabilities, we could add a few different-language docs to example/exampledocs and perhaps even to the tutorial.

More OOTB support for many languages is related to SOLR-1860 too.



> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-2015:
-------------------------------

    Attachment: SOLR-2015.patch

Here's an updated patch that adds a simple test, along with adding a note about autoGeneratePhraseQueries="true" not working well for non whitespace delimited languages.

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892015#action_12892015 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

bq. I'm upping this to the highest priority and taking it since the default behavior for our solr example server now really stinks. 

I don't think the default behavior stinks at all. As stated before, it now works with languages such as Thai where it formerly didnt really work at all (all queries are phrase queries).
If you don't think the behavior for english is perfect thats fine, but an open source product should work reasonably well for all languages.
So I don't think we should default with this behavior on, this behavior that is tied to whitespace-tokenization.


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892291#action_12892291 ] 

Michael McCandless commented on SOLR-2015:
------------------------------------------

Don't forget that this auto-phrase-gen is buggy: if the user's query
is wi fi, then this will *not* turn into a phrase.

Really, it's QueryParser that's buggy: it should not assume it can
pre-split on whitespace.

As Robert has pointed out, even if the feature weren't buggy, there's
no evidence auto-phrase-gen actually improves relevance even for
English.

Yet it's most definitely disastrous for non-whitespace languages (CJK,
Thai, etc.).

This is why, in my opinion, if we must pick a single global default
(for the 'text' field in Solr's example schema.xml), it should be
disabled by default: it's buggy for English and catastrophic for
non-whitespace languages.

To fix this "correctly", we somehow need a better QueryParser/Analyzer
interaction, such that all variants of wifi (WiFi, wifi, wi fi, wi-fi)
are consistently mapped during indexing and searching.  Just adding a
new per-token attr doesn't fix it (the wi fi example, above).

{quote}
I'm not sure what that would accomplish by itself though... it's not like solr is much of an out-of-the-box solution for anything.
We have a default example so that people can easily run through the tutorial, and execute examples on wiki pages.
{quote}

I suspect many apps take the default solrconfig/schema and run with
it / iteratitvely tweak it.

bq. Solr doesn't have an installer though... you unzip and "cd example; java -jar start.jar".

Maybe we insert a "cp {english,cjk}schema.xml schema.xml" in between
those two steps?  This would avoid the global default, ie, force an
explicit choice.

Or maybe we make separate default fieldTypes in schema.xml
(text_whitespace, text_non_whitespace -- need better names)?

Or, maybe we make this setting take three values: unset, on, off.  It
defaults to unset, but Solr refuses to run with this value, throwing
an exception saying you must set it?

Something along these lines would let us avoid having to agree on a
global default, ie, make the choice explicit.

This is just like what we did with maxFieldLength a while back.  Previously
it silently truncated after 10K terms, which was a dangerous default.  So, we
forced the choice, by making it a required param in IW.   (Later we then
change the default to no truncation, and make it not required).


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892024#action_12892024 ] 

Yonik Seeley commented on SOLR-2015:
------------------------------------

autoGeneratePhrase=true has been the behavior forever (before July 19th)... this just allows the behavior configurable per-field.  Changing the default to false would only make sense if it was a better choice for the majority of our users... and I don't think it is.
Although back compat is not the primary concern here, it is nice that someone can switch to the newest version and cut-n-paste some of their previous field definitions that worked well for them.

Our example schema is english oriented.
All of the example docs are in english, the "text" field has an english stemmer, the tutorial is in english, and people must know english in order to collaborate with our development.  English is the international language and we shouldn't make relevancy worse for it and other whitespace delimited languages by default.

I do also want to make things work better for other international languages - but not at the cost of european languages.  Given our existing user base, I think that's an acceptable position.  Now that we have both the ability to turn off autoGeneratePhrase, and the ability to configure it per-field,  what international field types should we add to the example schema to improve the situation?


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Issue Comment Edited: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892319#action_12892319 ] 

Yonik Seeley edited comment on SOLR-2015 at 7/26/10 10:23 AM:
--------------------------------------------------------------

What would the fieldType for a generic international field look like?
If we can decide on that, we could add it at least.

edit: paths crossed - I see you answered that above.

      was (Author: yseeley@gmail.com):
    What would the fieldType for a generic international field look like?
If we can decide on that, we could add it at least.

  
> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891971#action_12891971 ] 

Koji Sekiguchi commented on SOLR-2015:
--------------------------------------

How can I implement "on a per-field basis"? The flag seems to affect globally.

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892101#action_12892101 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

{quote}
Can we make different example config/schema XML files for whitespace vs non-whitespace languages?

Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr?
{quote}

+1, the config shouldn't be in english, english isn't the international language, its not special.

It might be important to Lucid or someone else, but I don't give a shit about it. 

This is an open source project, one language doesnt get to be held in higher esteem than another.


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892019#action_12892019 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

Yonik, i just dont think the default for autoGeneratePhrase queries should be "true", but false instead.
This is no problem for older existing schemas as the Version constant is respected already.
And I think it should be documented (e.g. in the example type text) that this option might not be suitable for non-whitespace separated languages.

Other than these concerns, I think in the fieldtype like this is a good approach.


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892330#action_12892330 ] 

Michael McCandless commented on SOLR-2015:
------------------------------------------

bq. The problem is not that we have an incorrect solution, but an incomplete solution.

True, but... I think you're splitting hairs ;)

>From the user's standpoint, auto-phrase is flakey -- in some cases it
works, in others it doesn't.

{quote}
Let's assume we had a QP that didn't split on whitespace (or whatever our optimal solution is).
IMO, I would still want tokens joined by a dash to form a phrase query, just like tokens surrounded by quotes.
It's important information and shouldn't be discarded.
{quote}

I agree we shouldn't discard a user's dashes -- they are important.
Google also treats wizard-of-oz as a phrase query (Uwe seems
particularly fond of this!).

Hmm though I just tried wizard-of-oz, wizard of oz, and "wizard of
oz", and got 3 different sets of results, from Google... hmmm.

bq. We can have different text field types in a single schema - it's just a matter of adding another one that's good for non-whitespace delimited languages?

OK this seems like a good solution for now, until we fix QP/Analyzer
to do this "privately".


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892314#action_12892314 ] 

Yonik Seeley commented on SOLR-2015:
------------------------------------

bq. is wi fi, then this will not turn into a phrase.

Right - but there's just a lack of information that can't be helped?
So while one might want stuff like this as a phrase, I don't think it's a bug that it's not.

What *is* a problem though is the lack of ability for the user to add additional context to fix the issue (i.e. a SynonymFilter to manually map "wi fi" wouldn't work since it would get "wi" and then "fi" in separate runs.

What is also the problem is that if the original doc contained "wifi" then a query of "wi-fi" won't match (since it queries for "wi fi").  We work around this today (for people that really need it) by indexing a second field that catenates instead of splits the parts of a split token).  It's certainly not ideal, but people tend to be happy with the cases we can match.

So while our current system is far from perfect (and we should work on improving it).
The problem is not that we have an incorrect solution, but an incomplete solution.
Let's assume we had a QP that didn't split on whitespace (or whatever our optimal solution is).
IMO, I would still want tokens joined by a dash to form a phrase query, just like tokens surrounded by quotes.
It's important information and shouldn't be discarded.

bq.  there's no evidence auto-phrase-gen actually improves relevance even for English.

IMO, it's a case of "the customer is always right".   Many people have asked how to do this sort of matching over the years and I think there is plenty of evidence that it increases relevancy.

bq. Maybe we insert a "cp {english,cjk}schema.xml schema.xml" in between those two steps? This would avoid the global default, ie, force an explicit choice.

And the tutorial that's in english would tell them to copy the english one... that only hurts english speakers and doesn't help anyone else..
We can have different text field types in a single schema - it's just a matter of adding another one that's good for non-whitespace delimited languages?


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892094#action_12892094 ] 

Michael McCandless commented on SOLR-2015:
------------------------------------------

Can we make different example config/schema XML files for whitespace vs non-whitespace languages?

Ie such that on install you must make an explicit choice and copy the right files over, before starting Solr?

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Koji Sekiguchi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891987#action_12891987 ] 

Koji Sekiguchi commented on SOLR-2015:
--------------------------------------

I see, thanks.

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892318#action_12892318 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

bq. Many people have asked how to do this sort of matching over the years and I think there is plenty of evidence that it increases relevancy.

You still haven't provided any evidence.

bq. it's just a matter of adding another one that's good for non-whitespace delimited languages?

There isn't a single tokenizer that is good for all these languages. ICUTokenizer is ok on average for these, but its not integrated.
I think we should add examples for all languages instead. The problem affects some whitespace-delimited languages, too.


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891935#action_12891935 ] 

Yonik Seeley commented on SOLR-2015:
------------------------------------

This should really be on a per-field basis at a minimum.
Even better, it should be in the token stream itself (i.e. some produced groups of tokens should be treated as a phrase, and some shouldn't... only the filter producing them knows for sure).


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-2015:
-------------------------------

    Assignee: Yonik Seeley
    Priority: Blocker  (was: Minor)

I'm upping this to the highest priority and taking it since the default behavior for our solr example server now really stinks.

> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892304#action_12892304 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

Even for the euro-languages where people think this is helpful, its sometimes a disaster.

I noticed a french case here where it caused a serious problem (enough for them to write custom code to try to get around it): http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance

Finally, I think this dictates behavior to the end user, and doesn't consider their information need at all.
Since google etc have become popular, i think users are familiar with putting things in quotes themselves. 
So a user who wants this behavior (causing a phrase) can always trigger it by putting the query in quotes.

This allows them to refine the query themselves like they would do in any other situation, its way more user-friendly
and consistent.


> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2015) add a config hook for autoGeneratePhraseQueries

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892321#action_12892321 ] 

Robert Muir commented on SOLR-2015:
-----------------------------------

bq. What would the fieldType for a generic international field look like?

All I am asking for is to add *commented out* text_XX examples for the languages we support?
This shouldnt affect the time it takes to startup solr and would resolve my concerns.



> add a config hook for autoGeneratePhraseQueries
> -----------------------------------------------
>
>                 Key: SOLR-2015
>                 URL: https://issues.apache.org/jira/browse/SOLR-2015
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1, 4.0
>            Reporter: Koji Sekiguchi
>            Assignee: Yonik Seeley
>            Priority: Blocker
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2015.patch, SOLR-2015.patch, SOLR-2015.patch
>
>
> After committed LUCENE-2458, a hook for autoGeneratePhraseQueries will be convenient for some situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org