You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2012/03/24 14:42:24 UTC

[jira] [Created] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

improve BaseTokenStreamTestCase random string generation
--------------------------------------------------------

                 Key: LUCENE-3911
                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
             Project: Lucene - Java
          Issue Type: Task
          Components: general/test
    Affects Versions: 3.6, 4.0
            Reporter: Robert Muir
         Attachments: LUCENE-3911.patch

Most analysis tests use mocktokenizer (which splits on whitespace), but
its rare that we generate a string with 'many tokens'. So I think we should
try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237526#comment-13237526 ] 

Robert Muir commented on LUCENE-3911:
-------------------------------------

one bug is that this generates overly short words, since the maxWordLength we pass in is really a max...
but we would want that to be the exact number of elements. I'll improve this.
                
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
>                 Key: LUCENE-3911
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/test
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3911.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.6

I think this is much better: if you want to see what the test strings look like now, have a look at ant test -Dtestcase=TestMockAnalyzer -Dtestmethod=testRandomStrings -Dtests.verbose=true
                
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
>                 Key: LUCENE-3911
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/test
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3911.patch, LUCENE-3911.patch, LUCENE-3911_more.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237561#comment-13237561 ] 

Robert Muir commented on LUCENE-3911:
-------------------------------------

I committed this. I have one more minor improvement (to make the randomRealistic more realistic).
Ill put up a patch.

Long term (for another day another issue), I think we should actually refactor this stuff with LineDocs
so that LineDocs can return 'synthetic' linedocs too, that way non-analysis tests can use this too.
                
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
>                 Key: LUCENE-3911
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/test
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3911.patch, LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3911:
--------------------------------

    Attachment: LUCENE-3911.patch

attached is a patch. also fixes off-by-one length bugs in all the _testUtil string generation methods too :)
                
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
>                 Key: LUCENE-3911
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/test
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3911.patch, LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3911:
--------------------------------

    Attachment: LUCENE-3911_more.patch

trivial patch: forces us to pass minLength as well to randomRealistic so in that case we get whole words in the same unicode block (good for stemmers), also sometimes uses randomRegexpIshString, so we get lots of punctuation (good for tokenizers/filters, etc)
                
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
>                 Key: LUCENE-3911
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/test
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3911.patch, LUCENE-3911.patch, LUCENE-3911_more.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237547#comment-13237547 ] 

Michael McCandless commented on LUCENE-3911:
--------------------------------------------

Looks great!
                
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
>                 Key: LUCENE-3911
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/test
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3911.patch, LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3911) improve BaseTokenStreamTestCase random string generation

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3911:
--------------------------------

    Attachment: LUCENE-3911.patch
    
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
>                 Key: LUCENE-3911
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3911
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/test
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org