You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Catalin Mititelu (Created) (JIRA)" <ji...@apache.org> on 2011/11/21 15:24:51 UTC
[jira] [Created] (OPENNLP-397) IndexHashTable can be improved
IndexHashTable can be improved
------------------------------
Key: OPENNLP-397
URL: https://issues.apache.org/jira/browse/OPENNLP-397
Project: OpenNLP
Issue Type: Improvement
Components: Maxent
Affects Versions: maxent-3.0.3-incubating
Reporter: Catalin Mititelu
Priority: Minor
Attachments: patch-IndexHashTable.txt
Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved
Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154221#comment-13154221 ]
Joern Kottmann commented on OPENNLP-397:
----------------------------------------
Does ebooks.txt contains a sentence per line?
In my tests I usually got a throughput of 1000 sentences per line. But i only have a Core Duo 2 CPU in my MacBook.
It might also depend on the data file, I usually used the English 300K sentence file from the leipzig corpus.
> IndexHashTable can be improved
> ------------------------------
>
> Key: OPENNLP-397
> URL: https://issues.apache.org/jira/browse/OPENNLP-397
> Project: OpenNLP
> Issue Type: Improvement
> Components: Maxent
> Affects Versions: maxent-3.0.3-incubating
> Reporter: Catalin Mititelu
> Priority: Minor
> Labels: patch
> Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved
Posted by "Catalin Mititelu (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154227#comment-13154227 ]
Catalin Mititelu commented on OPENNLP-397:
------------------------------------------
I used UTF8 plain text ebooks from Gutenberg project without any pre-processing. The sentences can be on many lines.
> IndexHashTable can be improved
> ------------------------------
>
> Key: OPENNLP-397
> URL: https://issues.apache.org/jira/browse/OPENNLP-397
> Project: OpenNLP
> Issue Type: Improvement
> Components: Maxent
> Affects Versions: maxent-3.0.3-incubating
> Reporter: Catalin Mititelu
> Priority: Minor
> Labels: patch
> Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (OPENNLP-397) IndexHashTable can be improved
Posted by "Catalin Mititelu (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Catalin Mititelu updated OPENNLP-397:
-------------------------------------
Attachment: patch-IndexHashTable.txt
This attachment fix this issue.
> IndexHashTable can be improved
> ------------------------------
>
> Key: OPENNLP-397
> URL: https://issues.apache.org/jira/browse/OPENNLP-397
> Project: OpenNLP
> Issue Type: Improvement
> Components: Maxent
> Affects Versions: maxent-3.0.3-incubating
> Reporter: Catalin Mititelu
> Priority: Minor
> Labels: patch
> Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved
Posted by "Catalin Mititelu (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154219#comment-13154219 ]
Catalin Mititelu commented on OPENNLP-397:
------------------------------------------
I used a profiler to detect why is "so slow" on POS parsing. I run also some tests before and after patch. I'm running on an i7 machine with 16GB memory, the used model is en-pos-maxent.bin. The test file is about 13M for the following results:
Before (3 steps):
1st step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent.txt
Loading POS Tagger model ... done (1.192s)
Average: 3285.3 sent/s
Total: 281320 sent
Runtime: 85.629s
2nd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent2.txt
Loading POS Tagger model ... done (1.136s)
Average: 3926.6 sent/s
Total: 281320 sent
Runtime: 71.644s
3rd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent3.txt
Loading POS Tagger model ... done (0.930s)
Average: 3952.2 sent/s
Total: 281320 sent
Runtime: 71.181s
After patch (using a HashMap) again in 3 steps:
1st step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent-patched.txt
Loading POS Tagger model ... done (0.920s)
Average: 5711.3 sent/s
Total: 281320 sent
Runtime: 49.257s
2nd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent-patched2.txt
Loading POS Tagger model ... done (0.927s)
Average: 5739.8 sent/s
Total: 281320 sent
Runtime: 49.012s
3rd step
bin/opennlp POSTagger models/en-pos-maxent.bin <samples/ebooks.txt >samples/ebooks-en-pos-maxent-patched3.txt
Loading POS Tagger model ... done (0.928s)
Average: 5716.5 sent/s
Total: 281320 sent
Runtime: 49.212s
I don't have any information about what memory is necessary.
Regards,
Catalin
> IndexHashTable can be improved
> ------------------------------
>
> Key: OPENNLP-397
> URL: https://issues.apache.org/jira/browse/OPENNLP-397
> Project: OpenNLP
> Issue Type: Improvement
> Components: Maxent
> Affects Versions: maxent-3.0.3-incubating
> Reporter: Catalin Mititelu
> Priority: Minor
> Labels: patch
> Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved
Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154209#comment-13154209 ]
Joern Kottmann commented on OPENNLP-397:
----------------------------------------
We used a java.util.HashMap before but got a big performance regression because it needs more memory than our current solution. I believe this is because the map doesn't fit anymore into the CPU cache.
Did you just run a profiler, or coud also measure an actual speed up in throughput? Which CPU does your machine have? And which model did you use?
I also did measurements to get the difference between the java.util.HashMap and ours and couldn't measure a difference.
> IndexHashTable can be improved
> ------------------------------
>
> Key: OPENNLP-397
> URL: https://issues.apache.org/jira/browse/OPENNLP-397
> Project: OpenNLP
> Issue Type: Improvement
> Components: Maxent
> Affects Versions: maxent-3.0.3-incubating
> Reporter: Catalin Mititelu
> Priority: Minor
> Labels: patch
> Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (OPENNLP-397) IndexHashTable can be improved
Posted by "Joern Kottmann (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/OPENNLP-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154234#comment-13154234 ]
Joern Kottmann commented on OPENNLP-397:
----------------------------------------
The POS Tagger assumes that you give it input which is tokenized (means tokens are white space separated) and one sentence per line.
Anyway I will redo my measurements on the test data I used before. The actual data shouldn't make a big difference when used for both measurements (as you did).
As far as I know does java.util.HashMap always us a power of two for the array size, and we use the load factor directly. When the array is larger the map usually is faster because you get less collision.
> IndexHashTable can be improved
> ------------------------------
>
> Key: OPENNLP-397
> URL: https://issues.apache.org/jira/browse/OPENNLP-397
> Project: OpenNLP
> Issue Type: Improvement
> Components: Maxent
> Affects Versions: maxent-3.0.3-incubating
> Reporter: Catalin Mititelu
> Priority: Minor
> Labels: patch
> Attachments: patch-IndexHashTable.txt
>
>
> Running a profiler on POSTagger with an maxent model showed me a lot of CPU usage on IndexHashTable class. This class can be optimized to be faster.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira