You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Catalin Mititelu (Created) (JIRA)" <ji...@apache.org> on 2011/11/21 15:24:51 UTC

[jira] [Created] (OPENNLP-397) IndexHashTable can be improved

IndexHashTable can be improved
------------------------------

                 Key: OPENNLP-397
                 URL: https://issues.apache.org/jira/browse/OPENNLP-397
             Project: OpenNLP
          Issue Type: Improvement
          Components: Maxent
    Affects Versions: maxent-3.0.3-incubating
            Reporter: Catalin Mititelu
            Priority: Minor
         Attachments: patch-IndexHashTable.txt

Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved

Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154221#comment-13154221 ] 

Joern Kottmann commented on OPENNLP-397:
----------------------------------------

Does ebooks.txt contains a sentence per line?
In my tests I usually got a throughput of 1000 sentences per line. But i only have a Core Duo 2 CPU in my MacBook.

It might also depend on the data file, I usually used the English 300K sentence file from the leipzig corpus.
                
> IndexHashTable can be improved
> ------------------------------
>
>                 Key: OPENNLP-397
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-397
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Maxent
>    Affects Versions: maxent-3.0.3-incubating
>            Reporter: Catalin Mititelu
>            Priority: Minor
>              Labels: patch
>         Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved

Posted by "Catalin Mititelu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154227#comment-13154227 ] 

Catalin Mititelu commented on OPENNLP-397:
------------------------------------------

I used UTF8 plain text ebooks from Gutenberg project without any pre-processing. The sentences can be on many lines.
                
> IndexHashTable can be improved
> ------------------------------
>
>                 Key: OPENNLP-397
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-397
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Maxent
>    Affects Versions: maxent-3.0.3-incubating
>            Reporter: Catalin Mititelu
>            Priority: Minor
>              Labels: patch
>         Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (OPENNLP-397) IndexHashTable can be improved

Posted by "Catalin Mititelu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Catalin Mititelu updated OPENNLP-397:
-------------------------------------

    Attachment: patch-IndexHashTable.txt

This attachment fix this issue.
                
> IndexHashTable can be improved
> ------------------------------
>
>                 Key: OPENNLP-397
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-397
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Maxent
>    Affects Versions: maxent-3.0.3-incubating
>            Reporter: Catalin Mititelu
>            Priority: Minor
>              Labels: patch
>         Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved

Posted by "Catalin Mititelu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154219#comment-13154219 ] 

Catalin Mititelu commented on OPENNLP-397:
------------------------------------------

I used a profiler to detect why is "so slow" on POS parsing. I run also some tests before and after patch. I'm running on an i7 machine with 16GB memory, the used model is en-pos-maxent.bin. The test file is about 13M for the following results:
Before (3 steps): 
1st step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent.txt
Loading POS Tagger model ... done (1.192s)
Average: 3285.3 sent/s 
Total: 281320 sent
Runtime: 85.629s


2nd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent2.txt
Loading POS Tagger model ... done (1.136s)
Average: 3926.6 sent/s 
Total: 281320 sent
Runtime: 71.644s


3rd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent3.txt
Loading POS Tagger model ... done (0.930s)
Average: 3952.2 sent/s 
Total: 281320 sent
Runtime: 71.181s


After patch (using a HashMap) again in 3 steps:

1st step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent-patched.txt
Loading POS Tagger model ... done (0.920s)
Average: 5711.3 sent/s 
Total: 281320 sent
Runtime: 49.257s


2nd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent-patched2.txt
Loading POS Tagger model ... done (0.927s)
Average: 5739.8 sent/s 
Total: 281320 sent
Runtime: 49.012s

3rd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent-patched3.txt
Loading POS Tagger model ... done (0.928s)
Average: 5716.5 sent/s 
Total: 281320 sent
Runtime: 49.212s


I don't have any information about what memory is necessary.

Regards,
Catalin
                
> IndexHashTable can be improved
> ------------------------------
>
>                 Key: OPENNLP-397
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-397
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Maxent
>    Affects Versions: maxent-3.0.3-incubating
>            Reporter: Catalin Mititelu
>            Priority: Minor
>              Labels: patch
>         Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved

Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154209#comment-13154209 ] 

Joern Kottmann commented on OPENNLP-397:
----------------------------------------

We used a java.util.HashMap before but got a big performance regression because it needs more memory than our current solution. I believe this is because the map doesn't fit anymore into the CPU cache.

Did you just run a profiler, or coud also measure an actual speed up in throughput? Which CPU does your machine have? And which model did you use?

I also did measurements to get the difference between the java.util.HashMap and ours and couldn't measure a difference.


                
> IndexHashTable can be improved
> ------------------------------
>
>                 Key: OPENNLP-397
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-397
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Maxent
>    Affects Versions: maxent-3.0.3-incubating
>            Reporter: Catalin Mititelu
>            Priority: Minor
>              Labels: patch
>         Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved

Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154234#comment-13154234 ] 

Joern Kottmann commented on OPENNLP-397:
----------------------------------------

The POS Tagger assumes that you give it input which is tokenized (means tokens are white space separated) and one sentence per line.

Anyway I will redo my measurements on the test data I used before. The actual data shouldn't make a big difference when used for both measurements (as you did).

As far as I know does java.util.HashMap always us a power of two for the array size, and we use the load factor directly. When the array is larger the map usually is faster because you get less collision.
                
> IndexHashTable can be improved
> ------------------------------
>
>                 Key: OPENNLP-397
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-397
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Maxent
>    Affects Versions: maxent-3.0.3-incubating
>            Reporter: Catalin Mititelu
>            Priority: Minor
>              Labels: patch
>         Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira