You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2012/06/03 14:25:22 UTC

[jira] [Created] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Erick Erickson created SOLR-3503:
------------------------------------

             Summary: Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
                 Key: SOLR-3503
                 URL: https://issues.apache.org/jira/browse/SOLR-3503
             Project: Solr
          Issue Type: Improvement
            Reporter: Erick Erickson
            Assignee: Erick Erickson
            Priority: Minor
             Fix For: 4.0, 5.0


It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Posted by "Jack Krupansky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288202#comment-13288202 ] 

Jack Krupansky commented on SOLR-3503:
--------------------------------------

Ultimately it may simply come down to doing better documentation for the interactions between stemming and wildcards. After all, the stemmer does do its thing at index time, so even if the stemmer is not called at all at query time, the user who wants to use wildcards needs to know what rules the stemmer used at index time.

In any case, I'll think about this a little more before proceeding. And as I said, the restriction is that the results can't be worse than they are today.

                
> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3503
>                 URL: https://issues.apache.org/jira/browse/SOLR-3503
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Posted by "Erick Erickson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson resolved SOLR-3503.
----------------------------------

    Resolution: Invalid

Gaaah! That'll teach me to put up a JIRA when I haven't had enough coffee. I was just thinking about it in terms of the stemmer producing a single token, which would be fine.

The notion that _what_ the stem wound up being and the impossibility of "doing the right thing" given that transformation completely escaped my not-yet-awake brain. Or what remains of it.

Especially when you consider embedded wildcards (e.g. bil*et) as you pointed out.

So I'm closing this as "invalid". I don't think it's worth the effort. If someone _really_ wants to do this, they can try it with the "multiterm" analysis chain definition and suffer the consequences...
                
> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3503
>                 URL: https://issues.apache.org/jira/browse/SOLR-3503
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288155#comment-13288155 ] 

Robert Muir commented on SOLR-3503:
-----------------------------------

most stemmers use length of the string / syllable count. In general this won't work... I don't think we should do it.
                
> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3503
>                 URL: https://issues.apache.org/jira/browse/SOLR-3503
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Reopened] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Posted by "Erick Erickson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson reopened SOLR-3503:
----------------------------------


Changing to "won't fix"
                
> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3503
>                 URL: https://issues.apache.org/jira/browse/SOLR-3503
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Posted by "Erick Erickson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson resolved SOLR-3503.
----------------------------------

    Resolution: Won't Fix

Jack:

Go ahead and have a whack at it if you want, but given that if one really _wants_ to, one can just define a "multiterm" section of in the schema and put whatever one wants in there, I'm not inclined to spend time on this. The intent of the whole MultiTermAware bit was to do the safe, easily-explainable stuff. I suspect that this would just be a lot of effort for, arguably, no net benefit (by the time we had to explain all the caveats, whether it worked for language/stemmer X, Y or Z, etc). But I'll be happy for you to prove me wrong....
                
> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3503
>                 URL: https://issues.apache.org/jira/browse/SOLR-3503
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Posted by "Jack Krupansky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288174#comment-13288174 ] 

Jack Krupansky commented on SOLR-3503:
--------------------------------------

Or are you just trying to trick me into doing it?! (I may.)

I'm at least half-convinced that it would not be harmful, at least for some stemmers and the changes would be stemmer-specific anyway, so it would give incremental improvement even if not 100% solving all issues for all stemmbers.

How about changing the status to "Won't Fix" rather than "Invalid"?

                
> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3503
>                 URL: https://issues.apache.org/jira/browse/SOLR-3503
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-3503) Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware

Posted by "Jack Krupansky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-3503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288167#comment-13288167 ] 

Jack Krupansky commented on SOLR-3503:
--------------------------------------

It could be tricky, but it could work, but users would have to be made aware of how wildcards could interfere or interact with stemming. And testing is essential, as well as good user documentation of how to navigate the stemming vs. wildcards minefield.

Unless the user actually knows what the stemmed term will be, even simple trailing wildcards can be tricky since the stem could be much shorter than the user expects. For example "investment*" where the actual stemmed and indexed term might be "invest" for a particular stemmer.

Leading wildcards can sometimes be okay, but completely dependent on the particular stemmer. For example, "*ment".

And simple embedded wildcards can be a real wildcard, once again depending on the specific stemmer. For example, "inve*ment".

But, I don't think any or all of those concerns are any worse than the situation we have today.

But, some robust tests would be needed to persuade me that this improvement is actually okay.

Right now, I say go for it, including the test examples for various stemmers and documentation for issues that users must be aware of (call it "safe wildcards in the presence of stemming.") I think the only restriction is that query results should not be worse than without this improvement.

Unfortunately, the doc may be stemmer-dependent. And separate tests needed for each stemmer.

The bottom line is to reduce the surprise factor for the user.

As a side note, it would be nice if Solr had a mechanism to return "informative notes and warnings" with a query response. For example, "Wildcard term inves*ment matches no indexed terms".

                
> Make SnowballPorterFilterFactory (and other stemmers?) MultiTermAware
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3503
>                 URL: https://issues.apache.org/jira/browse/SOLR-3503
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> It seems to me that all the stemmers could be MultiTermAware, anyone know of a reason not?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org