You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2012/03/24 14:42:24 UTC
[jira] [Created] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
improve BaseTokenStreamTestCase random string generation
--------------------------------------------------------
Key: LUCENE-3911
URL: https://issues.apache.org/jira/browse/LUCENE-3911
Project: Lucene - Java
Issue Type: Task
Components: general/test
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
Attachments: LUCENE-3911.patch
Most analysis tests use mocktokenizer (which splits on whitespace), but
its rare that we generate a string with 'many tokens'. So I think we should
try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237526#comment-13237526 ]
Robert Muir commented on LUCENE-3911:
-------------------------------------
one bug is that this generates overly short words, since the maxWordLength we pass in is really a max...
but we would want that to be the exact number of elements. I'll improve this.
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
> Issue Type: Task
> Components: general/test
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Resolved] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir resolved LUCENE-3911.
---------------------------------
Resolution: Fixed
Fix Version/s: 4.0
3.6
I think this is much better: if you want to see what the test strings look like now, have a look at ant test -Dtestcase=TestMockAnalyzer -Dtestmethod=testRandomStrings -Dtests.verbose=true
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
> Issue Type: Task
> Components: general/test
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3911.patch, LUCENE-3911.patch, LUCENE-3911_more.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237561#comment-13237561 ]
Robert Muir commented on LUCENE-3911:
-------------------------------------
I committed this. I have one more minor improvement (to make the randomRealistic more realistic).
Ill put up a patch.
Long term (for another day another issue), I think we should actually refactor this stuff with LineDocs
so that LineDocs can return 'synthetic' linedocs too, that way non-analysis tests can use this too.
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
> Issue Type: Task
> Components: general/test
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3911.patch, LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-3911:
--------------------------------
Attachment: LUCENE-3911.patch
attached is a patch. also fixes off-by-one length bugs in all the _testUtil string generation methods too :)
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
> Issue Type: Task
> Components: general/test
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3911.patch, LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-3911:
--------------------------------
Attachment: LUCENE-3911_more.patch
trivial patch: forces us to pass minLength as well to randomRealistic so in that case we get whole words in the same unicode block (good for stemmers), also sometimes uses randomRegexpIshString, so we get lots of punctuation (good for tokenizers/filters, etc)
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
> Issue Type: Task
> Components: general/test
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3911.patch, LUCENE-3911.patch, LUCENE-3911_more.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237547#comment-13237547 ]
Michael McCandless commented on LUCENE-3911:
--------------------------------------------
Looks great!
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
> Issue Type: Task
> Components: general/test
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3911.patch, LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3911) improve BaseTokenStreamTestCase
random string generation
Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-3911:
--------------------------------
Attachment: LUCENE-3911.patch
> improve BaseTokenStreamTestCase random string generation
> --------------------------------------------------------
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
> Issue Type: Task
> Components: general/test
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: LUCENE-3911.patch
>
>
> Most analysis tests use mocktokenizer (which splits on whitespace), but
> its rare that we generate a string with 'many tokens'. So I think we should
> try to generate more realistic test strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org