You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@commons.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2012/01/26 18:30:38 UTC

[jira] [Created] (CODEC-132) BeiderMorseEncoder OOM issues

BeiderMorseEncoder OOM issues
-----------------------------

                 Key: CODEC-132
                 URL: https://issues.apache.org/jira/browse/CODEC-132
             Project: Commons Codec
          Issue Type: Bug
    Affects Versions: 1.6
            Reporter: Robert Muir
         Attachments: CODEC-132_test.patch

In Lucene/Solr, we integrated this encoder into the latest release.

Our tests use a variety of random strings, and we have recent jenkins failures
from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
resulting in OOM.

I've created a test case (length is 30 here) that will OOM with -Xmx256M. 

I haven't dug into this much as to what's causing it, but I suspect there might be a bug
revolving around certain punctuation characters: we didn't see this happening until
we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Gary D. Gregory (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222097#comment-13222097 ] 

Gary D. Gregory commented on CODEC-132:
---------------------------------------

Thank you for digging Thomas. 

Feel free to provide a patch :) with tests.

It would be great if Matthew P. could comment here as well.

Gary
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated CODEC-132:
------------------------------

    Attachment: CODEC-132_test.patch

Attached is the first test I came up with... blows up easily with something like -Xmx64M or -Xmx128M. 

It will eventually fail with -Xmx256M too, but takes minutes to do this.

                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Thomas Neidhart (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232327#comment-13232327 ] 

Thomas Neidhart commented on CODEC-132:
---------------------------------------

I have tested it with junit-benchmarks and my results are as follows:

With HashSet:

{noformat}
BeiderMorseEncoderTest.testAllChars: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 2.85 [+- 0.01], round.gc: 0.00 [+- 0.00], GC.calls: 1213, GC.time: 0.29, time.total: 43.66, time.warmup: 15.20, time.bench: 28.45
BeiderMorseEncoderTest.testAsciiEncodeNotEmpty1Letter: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 1, GC.time: 0.00, time.total: 0.06, time.warmup: 0.02, time.bench: 0.04
BeiderMorseEncoderTest.testAsciiEncodeNotEmpty2Letters: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.18 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 72, GC.time: 0.02, time.total: 2.79, time.warmup: 0.98, time.bench: 1.82
BeiderMorseEncoderTest.testEncodeAtzNotEmpty: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.02, time.warmup: 0.01, time.bench: 0.01
BeiderMorseEncoderTest.testEncodeGna: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testInvalidLangIllegalArgumentException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testInvalidLangIllegalStateException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testInvalidLanguageIllegalArgumentException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testLongestEnglishSurname: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.01 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 1, GC.time: 0.00, time.total: 0.10, time.warmup: 0.04, time.bench: 0.07
BeiderMorseEncoderTest.testNegativeIndexForRuleMatchIndexOutOfBoundsException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testOOM: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.01 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 2, GC.time: 0.00, time.total: 0.12, time.warmup: 0.04, time.bench: 0.08
BeiderMorseEncoderTest.testSetConcat: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSetNameTypeAsh: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSetRuleTypeExact: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSetRuleTypeToRulesIllegalArgumentException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSpeedCheck: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.41 [+- 0.01], round.gc: 0.00 [+- 0.00], GC.calls: 91, GC.time: 0.04, time.total: 6.13, time.warmup: 2.04, time.bench: 4.09
BeiderMorseEncoderTest.testSpeedCheck2: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.28 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 48, GC.time: 0.01, time.total: 4.14, time.warmup: 1.38, time.bench: 2.76
BeiderMorseEncoderTest.testSpeedCheck3: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.54 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 140, GC.time: 0.05, time.total: 8.09, time.warmup: 2.69, time.bench: 5.39
BeiderMorseEncoderTest.testEncodeEmpty: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testEncodeNull: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testEncodeWithInvalidObject: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testLocaleIndependence: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.00
{noformat}

With LinkedHashSet:

{noformat}
BeiderMorseEncoderTest.testAllChars: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 2.87 [+- 0.01], round.gc: 0.00 [+- 0.00], GC.calls: 1194, GC.time: 0.29, time.total: 44.02, time.warmup: 15.35, time.bench: 28.67
BeiderMorseEncoderTest.testAsciiEncodeNotEmpty1Letter: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 1, GC.time: 0.00, time.total: 0.06, time.warmup: 0.02, time.bench: 0.04
BeiderMorseEncoderTest.testAsciiEncodeNotEmpty2Letters: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.18 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 66, GC.time: 0.02, time.total: 2.85, time.warmup: 1.01, time.bench: 1.84
BeiderMorseEncoderTest.testEncodeAtzNotEmpty: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 1, GC.time: 0.00, time.total: 0.02, time.warmup: 0.01, time.bench: 0.01
BeiderMorseEncoderTest.testEncodeGna: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testInvalidLangIllegalArgumentException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testInvalidLangIllegalStateException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testInvalidLanguageIllegalArgumentException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testLongestEnglishSurname: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.01 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 1, GC.time: 0.00, time.total: 0.10, time.warmup: 0.03, time.bench: 0.07
BeiderMorseEncoderTest.testNegativeIndexForRuleMatchIndexOutOfBoundsException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testOOM: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.01 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 3, GC.time: 0.00, time.total: 0.12, time.warmup: 0.04, time.bench: 0.08
BeiderMorseEncoderTest.testSetConcat: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSetNameTypeAsh: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSetRuleTypeExact: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSetRuleTypeToRulesIllegalArgumentException: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testSpeedCheck: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.31 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 97, GC.time: 0.03, time.total: 4.59, time.warmup: 1.54, time.bench: 3.06
BeiderMorseEncoderTest.testSpeedCheck2: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.27 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 81, GC.time: 0.02, time.total: 4.03, time.warmup: 1.34, time.bench: 2.69
BeiderMorseEncoderTest.testSpeedCheck3: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.46 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 149, GC.time: 0.04, time.total: 6.92, time.warmup: 2.31, time.bench: 4.61
BeiderMorseEncoderTest.testEncodeEmpty: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testEncodeNull: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testEncodeWithInvalidObject: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.00, time.warmup: 0.00, time.bench: 0.00
BeiderMorseEncoderTest.testLocaleIndependence: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00, time.total: 0.01, time.warmup: 0.00, time.bench: 0.00
{noformat}

The speed tests are even faster for some reason, but in general I do not think that the data structure does make much of a difference as the number of phonemes is anyway now limited to 20 by default.
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>             Fix For: 1.7
>
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Gary D. Gregory (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary D. Gregory updated CODEC-132:
----------------------------------

    Fix Version/s:     (was: 1.6.1)
                   1.7
    
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>             Fix For: 1.7
>
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224619#comment-13224619 ] 

Robert Muir commented on CODEC-132:
-----------------------------------

Thomas: I haven't tested your patch with Lucene/Solr, but I'm +1 on premise.

In reality the random testing we do may seem absurd... yes in a way its totally unrealistic.

On the other hand if someone is indexing/crawling data, often times this type-detection of either file-type or character set
or whatever is really just a heuristic: its really impossible to ultimately prevent the indexing of some binary file
(e.g. misdetected character set or simply a video file or whatever). This is part of why we do the testing we do.

Thanks again everyone for digging in and reviewing.

                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Thomas Neidhart (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223729#comment-13223729 ] 

Thomas Neidhart edited comment on CODEC-132 at 3/6/12 10:13 PM:
----------------------------------------------------------------

Hi,

please find attached a patch for the outlined solution: addind a maximum phoneme parameter to the engine that limits the number of phonemes processed / returned.

By now, I have assumed a default of 20, if the user does not provide a value himself. Would like to hear some feedback from the original author on that.

Ah, testcoverage improved from 91% to 92% ;-)
                
      was (Author: tn):
    Hi,

please find attached a patch for the outlined solution: addind a maximum phoneme parameter to the engine that limits the number of phonemes processed / returned.

By now, I have assumed a default of 20, if the user does not provide a value himself. Would like to hear some feedback from the original author on that.
                  
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Thomas Neidhart (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224729#comment-13224729 ] 

Thomas Neidhart commented on CODEC-132:
---------------------------------------

@Matthew: thanks for your feedback, I had some experience with similar rule base systems before, and knew that they can become very fragile with unforeseen input as the number of rules grows (especially generic ones). Anyway, your code was easy to debug, and nice to read!

@Robert: your tests make perfect sense imo, thanks for reporting back.

@Gary: thanks for applying the patch
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>             Fix For: 1.6.1
>
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Thomas Neidhart (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221933#comment-13221933 ] 

Thomas Neidhart commented on CODEC-132:
---------------------------------------

I digged into this problem, and it is not related to punctuation or other special characters.

There are some generic rules defined, that blow up the set of possible phenomes, e.g.:

"a" "" "" "(e|o|a)" // hat | call | part

Considering you provide random data as input, this single rule will match most likely every single 'a' in the input, and triple the set of phenomes at every occasion. This leads quickly to very large sets and to OOMs of course.

I would not consider touching the rules, but instead include a parameter to the PhoneticEngine that defines how many different phonemes I want in the result as a maximum. Limiting the number of new phenomes in PhenomeBuilder.apply to this maximum.

For normal text, the number of phenomes is usually small anyway, so a default of 20 sounds reasonable, but should be user-controllable.

btw. you could also consider using setting the parameter concat to false, in that case each word is treated separately which should mitigate the problem a bit, as single words are smaller and thus do not suffer so much from the phenome explosion. 
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Gary D. Gregory (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary D. Gregory resolved CODEC-132.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.6.1

Patch applied in SVN, plus some other minor test improvements.
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>             Fix For: 1.6.1
>
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Matthew Pocock (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224610#comment-13224610 ] 

Matthew Pocock commented on CODEC-132:
--------------------------------------

Hi,

Limiting the size of the set of intermediate phonemes considered is probably a good thing for this kind of random-string testing, and may well have no discernible negative impact in normal use. The rules are not really intended to apply to random strings, and words from languages (and in particular, names) are very much not random.

I've not run a corpus of real names through this code to estimate the normal range of this phoneme set size. If we start seeing incomplete or strange results after this change, perhaps it would be worth doing.

Matthew


                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Thomas Neidhart (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Neidhart updated CODEC-132:
----------------------------------

    Attachment: CODEC-132.patch

Hi,

please find attached a patch for the outlined solution: addind a maximum phoneme parameter to the engine that limits the number of phonemes processed / returned.

By now, I have assumed a default of 20, if the user does not provide a value himself. Would like to hear some feedback from the original author on that.
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Thomas Neidhart (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232322#comment-13232322 ] 

Thomas Neidhart commented on CODEC-132:
---------------------------------------

Applied a minor change due to the referred test in lucene:

use a LinkedHashSet instead of a HashSet to make ordering of phonemes deterministic
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>             Fix For: 1.7
>
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Thomas Neidhart (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223729#comment-13223729 ] 

Thomas Neidhart edited comment on CODEC-132 at 3/6/12 10:35 PM:
----------------------------------------------------------------

Hi,

please find attached a patch for the outlined solution: adding a maximum phoneme parameter to the engine that limits the number of phonemes processed / returned.

By now, I have assumed a default of 20, if the user does not provide a value himself. Would like to hear some feedback from the original author on that.

Ah, testcoverage improved from 91% to 92% ;-)
                
      was (Author: tn):
    Hi,

please find attached a patch for the outlined solution: addind a maximum phoneme parameter to the engine that limits the number of phonemes processed / returned.

By now, I have assumed a default of 20, if the user does not provide a value himself. Would like to hear some feedback from the original author on that.

Ah, testcoverage improved from 91% to 92% ;-)
                  
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Gary D. Gregory (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary D. Gregory closed CODEC-132.
---------------------------------


Released in 1.7 today.
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>             Fix For: 1.7
>
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-132) BeiderMorseEncoder OOM issues

Posted by "Gary D. Gregory (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232323#comment-13232323 ] 

Gary D. Gregory commented on CODEC-132:
---------------------------------------

What is the performance impact of this change?
                
> BeiderMorseEncoder OOM issues
> -----------------------------
>
>                 Key: CODEC-132
>                 URL: https://issues.apache.org/jira/browse/CODEC-132
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Robert Muir
>             Fix For: 1.7
>
>         Attachments: CODEC-132.patch, CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g. > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M. 
> I haven't dug into this much as to what's causing it, but I suspect there might be a bug
> revolving around certain punctuation characters: we didn't see this happening until
> we beefed up our random string generation to start producing "html-like" strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira