You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/01/08 14:08:54 UTC

[jira] Created: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

convert worddelimiterfilter to new tokenstream API
--------------------------------------------------

                 Key: SOLR-1710
                 URL: https://issues.apache.org/jira/browse/SOLR-1710
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
            Reporter: Robert Muir


This one was a doozy, attached is a patch to convert it to the new tokenstream API.

Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
the filter is much more efficient now, no cloning.

before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.

NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1710:
------------------------------

    Attachment: SOLR-1710.patch

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Chris Male (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798260#action_12798260 ] 

Chris Male commented on SOLR-1710:
----------------------------------

I am working with this patch with the goal of simplifying its logic and increasing readability.  Seems great thus far though.

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798241#action_12798241 ] 

Robert Muir commented on SOLR-1710:
-----------------------------------

Chris, not really, if you see the description i say:
before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter

I guess this should say instead: make a copy of... I will fix.

obviously OriginalWordDelimiterFilter should not be committed, nor this random test that compares results against it.

but for now its convenient while working the issue to simply blast random strings against the old filter for testing.

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1710:
------------------------------

    Description: 
This one was a doozy, attached is a patch to convert it to the new tokenstream API.

Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
the filter is much more efficient now, no cloning.

before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.

NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.


  was:
This one was a doozy, attached is a patch to convert it to the new tokenstream API.

Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
the filter is much more efficient now, no cloning.

before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.

NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.



> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798268#action_12798268 ] 

Robert Muir commented on SOLR-1710:
-----------------------------------

chris yeah, its supposed to be similar to http://java.sun.com/j2se/1.4.2/docs/api/java/text/BreakIterator.html#next%28%29

i started by mimicing this api somewhat, i guess a future improvement would be if somehow this truly was a real BreakIterator.
Then say, you could create a RuleBasedBreakIterator or DictionaryBasedBreakIterator (which are fast compiled DFAs), and customize how words are delimited.
currently, you can only do this with by customizing the charTypeTable, which cannot take any context into account, so its rather limited.

all of the above is really just theoretical and not anything we should worry about, for practical purposes i mimiced BreakIterator api (but diverged somewhat), just because I am used to working with it and found it was one way to separate a lot of the logic.


> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Chris Male (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798266#action_12798266 ] 

Chris Male commented on SOLR-1710:
----------------------------------

Just wondering what the return type of WordDelimiterIterator#next() supposed to indicate? I see that it either returns the end index, or DONE but this value never seems to be used by the filter.  Does it have a role?

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798239#action_12798239 ] 

Robert Muir commented on SOLR-1710:
-----------------------------------

Yonik, thanks. Again i have a hesitation: the SOLR-1706 problem.

If i could fix this bug in the original code, i would be able to enable the problematic combinations in backwards testing:
* catenateNumbers != catenateWords
* generateWordParts != generateNumberParts

I was unable to figure this one out though, so excluding these from the test makes me a little nervous... what is there to do? 


> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Chris Male (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798245#action_12798245 ] 

Chris Male commented on SOLR-1710:
----------------------------------

Ah right, sorry missed that description.

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-1710.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 3.1
         Assignee: Mark Miller

This was resolved in revision 922957.

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: SOLR-1710-readable.patch, SOLR-1710-readable.patch, SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Chris Male (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Male updated SOLR-1710:
-----------------------------

    Attachment: SOLR-1710-readable.patch

Attaching a first pass at improving the readability of this code.  

Focused mostly on breaking up #incrementToken, extracting common behavior into helper methods, documenting each method, putting fields in a consistent place, trimming if else statement blocks etc etc.

I imagine there might be a small performance improvement due to these improvements, but they could have all been done by the compiler too.

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710-readable.patch, SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798234#action_12798234 ] 

Yonik Seeley commented on SOLR-1710:
------------------------------------

bq. For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters

Whew... nice thorough work.

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798248#action_12798248 ] 

Robert Muir commented on SOLR-1710:
-----------------------------------

Chris, no problem, I created this confusion until the patch is OK'ed.

once this happens, i can include some additional testcases that I had problems with.
i have all 7 revisions i made of this filter locally so i can see which scenarios fail on each previous iteration, I think these are good tests.


> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1710:
------------------------------

    Attachment: SOLR-1710.patch

for the 'wdf is only modifying single word with punctuation', don't clearAttributes() if its the first token, even though its modified... unless preserveOriginal is on (in this case the preserved original contained the attributes already, and we must clear).

this is a little confusing since the behavior for custom attributes depends on this preserveOriginal value, but i think it makes sense.

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Chris Male (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798240#action_12798240 ] 

Chris Male commented on SOLR-1710:
----------------------------------

Hi,

I notice in the patch that it references OriginalWordDelimiterFilter in TestWordDelimiterBWComp.  Is this an error?

Cheers

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, rename the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Chris Male (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Male updated SOLR-1710:
-----------------------------

    Attachment: SOLR-1710-readable.patch

Updated patch with method name changes.  doXYZ is now shouldXYZ and writeClear is now writeAndClear

> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710-readable.patch, SOLR-1710-readable.patch, SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798261#action_12798261 ] 

Robert Muir commented on SOLR-1710:
-----------------------------------

thanks in advance chris, I will help with testing and benchmarking anything you can do. 
I think i may have taken it as far as I can go, my head almost exploded.


> convert worddelimiterfilter to new tokenstream API
> --------------------------------------------------
>
>                 Key: SOLR-1710
>                 URL: https://issues.apache.org/jira/browse/SOLR-1710
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>         Attachments: SOLR-1710.patch, SOLR-1710.patch
>
>
> This one was a doozy, attached is a patch to convert it to the new tokenstream API.
> Some of the logic was split into WordDelimiterIterator (exposes a BreakIterator-like api for iterating subwords)
> the filter is much more efficient now, no cloning.
> before applying the patch, copy the existing WordDelimiterFilter to OriginalWordDelimiterFilter
> the patch includes a testcase (TestWordDelimiterBWComp) which generates random strings from various subword combinations.
> For each random string, it compares output against the existing WordDelimiterFilter for all 512 combinations of boolean parameters.
> NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.