You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@joshua.apache.org by mj...@apache.org on 2016/08/30 13:03:55 UTC

[2/4] incubator-joshua git commit: Fixed crashing when using Trie based KenLM models with State Minimization

Fixed crashing when using Trie based KenLM models with State Minimization

There are two separate issues that are causing very rare crashes when using KenLM with Joshua.  The only configuration under which we've seen crashing, is when using more than one Trie-based state-minimizing model.  One of these issues is addressed in this patch, another will have to be contributed to the KenLM repo.

This first patch deals with hash collisions in the mapping between a given KenLM state, and the stored memory address of the state in KenLM\u2019s memory pool.  Previously these collisions were ignored.  The problem with ignoring collisions here is that the two models can have different Vocabulary sizes.  This means if you collide and take the state of a model with a larger vocabulary you could potentially have a unigram word id that is out of range.  When searched for this will cause a crash.

The second issue (to be patched in KenLM) is that the equality operator in state.hh isn't comparing enough values to properly differentiate between states during a collision.


Project: http://git-wip-us.apache.org/repos/asf/incubator-joshua/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-joshua/commit/8e7e7776
Tree: http://git-wip-us.apache.org/repos/asf/incubator-joshua/tree/8e7e7776
Diff: http://git-wip-us.apache.org/repos/asf/incubator-joshua/diff/8e7e7776

Branch: refs/heads/7
Commit: 8e7e7776b8b0fa862f08750b1dddd93f9e5f3487
Parents: 2d106df
Author: Kellen Sunderland <ke...@amazon.com>
Authored: Tue Aug 30 10:52:07 2016 +0200
Committer: Kellen Sunderland <ke...@amazon.com>
Committed: Tue Aug 30 11:04:06 2016 +0200

----------------------------------------------------------------------
 jni/kenlm_wrap.cc | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-joshua/blob/8e7e7776/jni/kenlm_wrap.cc
----------------------------------------------------------------------
diff --git a/jni/kenlm_wrap.cc b/jni/kenlm_wrap.cc
index 97c52f1..11d9c28 100644
--- a/jni/kenlm_wrap.cc
+++ b/jni/kenlm_wrap.cc
@@ -45,13 +45,14 @@ template<> struct StaticCheck<true> {
 
 typedef StaticCheck<sizeof(jint) == sizeof(lm::WordIndex)>::StaticAssertionPassed FloatSize;
 
-typedef std::unordered_map<uint64_t, lm::ngram::ChartState*> PoolHash;
+typedef std::unordered_multimap<uint64_t, lm::ngram::ChartState*> PoolHash;
 
 /**
- * A Chart bundles together a hash_map that maps ChartState signatures to a single object
- * instantiated using a pool. This allows duplicate states to avoid allocating separate
- * state objects at multiple places throughout a sentence, and also allows state to be
- * shared across KenLMs for the same sentence.
+ * A Chart bundles together a unordered_multimap that maps ChartState signatures to a single
+ * object instantiated using a pool. This allows duplicate states to avoid allocating separate
+ * state objects at multiple places throughout a sentence, and also allows state to be shared
+ * across KenLMs for the same sentence.  Multimap is used to avoid hash collisions which can
+ * return incorrect results, and cause out-of-bounds lookups when multiple KenLMs are in use.
  */
 struct Chart {
   // A cache for allocated chart objects
@@ -71,15 +72,27 @@ struct Chart {
   }
 
   lm::ngram::ChartState* put(const lm::ngram::ChartState& state) {
+    lm::ngram::ChartState* state_ptr = nullptr;
     uint64_t hashValue = lm::ngram::hash_value(state);
+    auto state_it = poolHash->find(hashValue);
 
-    if (poolHash->find(hashValue) == poolHash->end()) {
-      lm::ngram::ChartState* pointer = (lm::ngram::ChartState *)pool->Allocate(sizeof(lm::ngram::ChartState));
-      *pointer = state;
-      (*poolHash)[hashValue] = pointer;
+    // Try to retrieve a matching ChartState pointer from our Pool
+    while(state_it != poolHash->end()) {
+      if (state == *(state_it->second)) {
+        state_ptr = state_it->second;
+        break;
+      }
+      state_it++;
+    }
+
+    // Unable to find this ChartState in our pool, allocate new space for it
+    if (!state_ptr) {
+      state_ptr = (lm::ngram::ChartState *) pool->Allocate(sizeof(lm::ngram::ChartState));
+      *state_ptr = state;
+      (*poolHash).insert({hashValue, state_ptr});
     }
 
-    return (*poolHash)[hashValue];
+    return state_ptr;
   }
 };