You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Mike Klaas (JIRA)" <ji...@apache.org> on 2007/07/09 20:20:04 UTC

[jira] Created: (SOLR-293) Add "minPartLength" to WordDelimiterFilter

Add "minPartLength" to WordDelimiterFilter
------------------------------------------

                 Key: SOLR-293
                 URL: https://issues.apache.org/jira/browse/SOLR-293
             Project: Solr
          Issue Type: New Feature
          Components: update
    Affects Versions: 1.3
            Reporter: Mike Klaas
            Assignee: Mike Klaas
            Priority: Minor
             Fix For: 1.3


WDF is handy but over-tokenizes when faced with short word parts:

A9
R2D2
mp3

This creates one- or two- character tokens which are extremely slow to query as the doc freq is so high (this is contributing to a significant portion of our slowest queries).

This patch adds a "minPartLength" option that disables generation of parts below a certain length.  It is recommended to use it with catenateAll, so as to not lose tokens.

I'll add factory options and tests if we decide to include this (and are happy with the parameter name).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-293) Add "minPartLength" to WordDelimiterFilter

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas updated SOLR-293:
----------------------------

    Fix Version/s:     (was: 1.3)

> Add "minPartLength" to WordDelimiterFilter
> ------------------------------------------
>
>                 Key: SOLR-293
>                 URL: https://issues.apache.org/jira/browse/SOLR-293
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Mike Klaas
>            Assignee: Mike Klaas
>            Priority: Minor
>
> WDF is handy but over-tokenizes when faced with short word parts:
> A9
> R2D2
> mp3
> This creates one- or two- character tokens which are extremely slow to query as the doc freq is so high (this is contributing to a significant portion of our slowest queries).
> This patch adds a "minPartLength" option that disables generation of parts below a certain length.  It is recommended to use it with catenateAll, so as to not lose tokens.
> I'll add factory options and tests if we decide to include this (and are happy with the parameter name).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-293) Add "minPartLength" to WordDelimiterFilter

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511306 ] 

Mike Klaas commented on SOLR-293:
---------------------------------

> Would it be useful to be able to configure this separately for words and numbers? 

I think it would, but I wasn't sure.  Trivial to implement in either case.

>Is there anything that can be done along the same lines, when not catenating for the query analyzer, so "foo-bar" will still become "foo bar", but "A9" would stay as "A9"? 

There are a couple ways to approach this (though I'm not exactly sure what your question is):

 - instead of minimum part length, restrict analysis to tokens with length < some value.  with N=3, this would let "HiFi/hi-fi" -> "hi fi" but "hi8" -> "hi8".  This makes the setting dependent on separator characters.

- ensure character inclusion.  If any letter/number character was not included in any generated subpart, ensure that a larger containing token is generated.

"high-figh-888" -> "high figh 888" (and not "highfigh888")
"hi-fi-8" -> "hifi8"

- approach the delimiter question differently.  Currenly, parts are delimited on case change, alpha->num (and v.v.), and delimiter chars.  The last is much, much stronger as a lexical delimiter, and it would be nice to recognize the difference between "java5", "mp3", "4x4" and "99-bottle" "20-cent-piece", etc.

Save for the first, I can't think of easy, efficient implementations.  Perhaps WDF shouldn't get too sophisticated.

> Add "minPartLength" to WordDelimiterFilter
> ------------------------------------------
>
>                 Key: SOLR-293
>                 URL: https://issues.apache.org/jira/browse/SOLR-293
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Mike Klaas
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>
> WDF is handy but over-tokenizes when faced with short word parts:
> A9
> R2D2
> mp3
> This creates one- or two- character tokens which are extremely slow to query as the doc freq is so high (this is contributing to a significant portion of our slowest queries).
> This patch adds a "minPartLength" option that disables generation of parts below a certain length.  It is recommended to use it with catenateAll, so as to not lose tokens.
> I'll add factory options and tests if we decide to include this (and are happy with the parameter name).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-293) Add "minPartLength" to WordDelimiterFilter

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511248 ] 

Yonik Seeley commented on SOLR-293:
-----------------------------------

Would it be useful to be able to configure this separately for words and numbers?
minWordLength
minNumberLength

On the indexing side, it makes sense to index "A9" and not "A" or "9"

> It is recommended to use it with catenateAll

Is there anything that can be done along the same lines, when not catenating for the query analyzer, so "foo-bar" will still become "foo bar", but "A9" would stay as "A9"?



> Add "minPartLength" to WordDelimiterFilter
> ------------------------------------------
>
>                 Key: SOLR-293
>                 URL: https://issues.apache.org/jira/browse/SOLR-293
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Mike Klaas
>            Assignee: Mike Klaas
>            Priority: Minor
>             Fix For: 1.3
>
>
> WDF is handy but over-tokenizes when faced with short word parts:
> A9
> R2D2
> mp3
> This creates one- or two- character tokens which are extremely slow to query as the doc freq is so high (this is contributing to a significant portion of our slowest queries).
> This patch adds a "minPartLength" option that disables generation of parts below a certain length.  It is recommended to use it with catenateAll, so as to not lose tokens.
> I'll add factory options and tests if we decide to include this (and are happy with the parameter name).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.