You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2016/04/29 22:53:16 UTC
[jira] [Comment Edited] (OPENNLP-830) Huge runtime improvement on training (POS, Chunk, ...)

    [ https://issues.apache.org/jira/browse/OPENNLP-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264710#comment-15264710 ] 

Joern Kottmann edited comment on OPENNLP-830 at 4/29/16 8:53 PM:
-----------------------------------------------------------------

I did some tests with the namefinder and came to similar improvement. When the IndexHashTable is replaced with java.util.HashMap it is around 60 - 70 % faster.
Trove4j and HPPC were both slower than java.util.HashMap which is found surprising because they offer Object to Int maps which are more cache friendly (but maybe JVM 8 has some tricks to optimize that).

On my servers it runs usually one JVM on each core (i know one jvm with a thread per core would be better) in that case it is still around 40 % faster.

The parser is around 5%, I didn't test any other components.

I suggest we replace it with the java.util.HashMap.


was (Author: joern):
I did some tests with the namefinder and came to similar improvement. When the IndexHashTable is replaced with java.util.HashMap it is around 60 - 70 % faster.
Trove4j and HPPC were both slower than java.util.HashMap which is found surprising because they offer Object to Int maps which are more cache friendly (but maybe JVM 8 has some tricks to optimize that).

The parser is around 5%, I didn't test any other components.

I suggest we replace it with the java.util.HashMap.

> Huge runtime improvement on training (POS, Chunk, ...)
> ------------------------------------------------------
>
>                 Key: OPENNLP-830
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-830
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Machine Learning, POS Tagger
>    Affects Versions: 1.6.0
>         Environment: Any
>            Reporter: Julien Subercaze
>              Labels: performance
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* (i.e. every model) and leads to disastrous performance.
> This hashtable is probably legacy some legacy and is highly inefficient. A simple drop-in replacement by a java.util.HashMap wrapper solves the issue, doesn't break compatibility and does not add any dependency.
> Training a pos-tagger on a large dataset with custom tags, I see a factor 5 improvement. It also seems to improve all ML models training pipeline.
> See : https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java
> For a quick fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)