You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by KellenSunderland <gi...@git.apache.org> on 2016/09/15 17:33:03 UTC

[GitHub] incubator-joshua pull request #65: This PR is a first attempt to minimize ng...

GitHub user KellenSunderland opened a pull request:

    https://github.com/apache/incubator-joshua/pull/65

    This PR is a first attempt to minimize ngram array copies by utilizing DirectBuffers.

    This is a work in progress but posted so that Matt + Ken can review (and do some speed comparisons).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KellenSunderland/incubator-joshua master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-joshua/pull/65.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #65
    
----
commit eadfdf3d09bea43fa3af6c664a7f8944b0cc8f86
Author: Kellen Sunderland <ke...@amazon.com>
Date:   2016-09-15T17:06:04Z

    Convert to a DirectBuffer to transfer ngrams during probRule

commit 3917d8503659dd4ac3a55b3181ecb07f32b067da
Author: Kellen Sunderland <ke...@amazon.com>
Date:   2016-09-15T17:30:22Z

    Merge branch 'master' of https://github.com/apache/incubator-joshua

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    @kpu I partially agree about sentence.  The sentence object is meant to be scoped to a single translation of a sentence.  For calls to probRule this helps disambiguate. 
    
    For estimate I'm not so sure.  A rule can have estimateRuleCost called on it even when Joshua is not translation anything. For example as a result of calling sort on a grammar during startup before any requests are processed.  In fact the only usage of estimate being called perviously was passing null as the argument for sentence.  For the time being I'm creating the ChartState object once per call, and then immediately releasing it afterwards.  We'll have to test the performance impact of doing this, but this shouldn't actually be called when decoding so hopefully it won't affect decoding perf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua pull request #65: This PR is a first attempt to minimize ng...

Posted by kpu <gi...@git.apache.org>.
Github user kpu commented on a diff in the pull request:

    https://github.com/apache/incubator-joshua/pull/65#discussion_r79491699
  
    --- Diff: src/main/java/org/apache/joshua/decoder/ff/lm/StateMinimizingLanguageModel.java ---
    @@ -132,30 +138,32 @@ public DPState compute(Rule rule, List<HGNode> tailNodes, int i, int j, SourcePa
        * Maps given array of word/class ids to KenLM ids. For estimating cost and computing,
        * state retrieval differs slightly.
        */
    -  private long[] mapToKenLmIds(int[] ids, List<HGNode> tailNodes, boolean isOnlyEstimate) {
    +  private void writeKenLmIds(int[] ids, List<HGNode> tailNodes, boolean isOnlyEstimate,
    +                             KenLMPool poolWrapper) {
    +
    +    poolWrapper.getNgramBuffer().putLong(0, ids.length);
    +
         // The IDs we will to KenLM
    -    long[] kenIds = new long[ids.length];
         for (int x = 0; x < ids.length; x++) {
           int id = ids[x];
     
           if (isNonterminal(id)) {
     
             if (isOnlyEstimate) {
               // For the estimate, we can just mark negative values
    -          kenIds[x] = -1;
    +          poolWrapper.getNgramBuffer().putLong((x + 1) * LONG_SIZE_IN_BYTES, -1);
             } else {
               // Nonterminal: retrieve the KenLM long that records the state
               int index = -(id + 1);
               final KenLMState state = (KenLMState) tailNodes.get(index).getDPState(stateIndex);
    -          kenIds[x] = -state.getState();
    +          poolWrapper.getNgramBuffer().putLong((x + 1) * LONG_SIZE_IN_BYTES, -state.getState());
    --- End diff --
    
    Looks like this is crying out for an abstraction.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    I want to dig in to why this test is returning differing results between linux and osx.  Temporarily closing this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by kpu <gi...@git.apache.org>.
Github user kpu commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    I thought the role of Sentence here is to scope the lifecycle of ChartState objects and disambiguate what the vector the indices refer to.    


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    @kpu there's still a lot of array copying going on here.  We've got some further work coming in this commit https://github.com/KellenSunderland/incubator-joshua/commit/6a88bfbdd130301a73248ed1633397dc7779b210 but we can't be bothered to fix the failing tests tonight. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by Matt Post <po...@cs.jhu.edu>.
All right, testing now, will send out results in the morning assuming they run fine.

matt


> On Sep 19, 2016, at 5:20 PM, KellenSunderland <gi...@git.apache.org> wrote:
> 
> Github user KellenSunderland commented on the issue:
> 
>    https://github.com/apache/incubator-joshua/pull/65
> 
>    This PR should now implement all of the JNI improvements we wanted to make at the MT Marathon.  Before merging would you be able to run a quick test and see what the performance is like?  If the second commit doesn't speed things up we could consider removing it to reduce code complexity.
> 
>    I also noticed that language model ffs all were estimating with the following signature: 
> 
>    `public float estimateCost(Rule rule, Sentence sentence)`
> 
>    Does it actually make sense to pass the sentence object here?  Each rule should be able to be estimated with only its local feature score + weights right?  It shouldn't need any additional context?
> 
> 
> 
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
> with INFRA.
> ---


[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    This PR should now implement all of the JNI improvements we wanted to make at the MT Marathon.  Before merging would you be able to run a quick test and see what the performance is like?  If the second commit doesn't speed things up we could consider removing it to reduce code complexity.
    
    I also noticed that language model ffs all were estimating with the following signature: 
    
    `public float estimateCost(Rule rule, Sentence sentence)`
    
    Does it actually make sense to pass the sentence object here?  Each rule should be able to be estimated with only its local feature score + weights right?  It shouldn't need any additional context?
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua pull request #65: This PR is a first attempt to minimize ng...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland closed the pull request at:

    https://github.com/apache/incubator-joshua/pull/65


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by kpu <gi...@git.apache.org>.
Github user kpu commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    For estimateRuleCost, are we allocating a ChartState on a vector then throwing it away to score grammar rules in isolation?  I thought that was a separate JNI path that didn't want a sentence.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by Matt Post <po...@cs.jhu.edu>.
Sounds good to me.


> On Sep 21, 2016, at 9:14 AM, KellenSunderland <gi...@git.apache.org> wrote:
> 
> Github user KellenSunderland commented on the issue:
> 
>    https://github.com/apache/incubator-joshua/pull/65
> 
>    Well in that case I can open a PR without the DirectBuffer changes included.  They're pretty complicated and if they don't improve performance it's probably better to leave them out.
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
> with INFRA.
> ---


[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    Well in that case I can open a PR without the DirectBuffer changes included.  They're pretty complicated and if they don't improve performance it's probably better to leave them out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua issue #65: This PR is a first attempt to minimize ngram arr...

Posted by mjpost <gi...@git.apache.org>.
Github user mjpost commented on the issue:

    https://github.com/apache/incubator-joshua/pull/65
  
    I did some timing tests on commit 3ba83f84e8258a784fcef509fc9b158d44f15c66, and didn't see any change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua pull request #65: This PR is a first attempt to minimize ng...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland commented on a diff in the pull request:

    https://github.com/apache/incubator-joshua/pull/65#discussion_r79553933
  
    --- Diff: src/main/java/org/apache/joshua/decoder/ff/lm/StateMinimizingLanguageModel.java ---
    @@ -132,30 +138,32 @@ public DPState compute(Rule rule, List<HGNode> tailNodes, int i, int j, SourcePa
        * Maps given array of word/class ids to KenLM ids. For estimating cost and computing,
        * state retrieval differs slightly.
        */
    -  private long[] mapToKenLmIds(int[] ids, List<HGNode> tailNodes, boolean isOnlyEstimate) {
    +  private void writeKenLmIds(int[] ids, List<HGNode> tailNodes, boolean isOnlyEstimate,
    +                             KenLMPool poolWrapper) {
    +
    +    poolWrapper.getNgramBuffer().putLong(0, ids.length);
    +
         // The IDs we will to KenLM
    -    long[] kenIds = new long[ids.length];
         for (int x = 0; x < ids.length; x++) {
           int id = ids[x];
     
           if (isNonterminal(id)) {
     
             if (isOnlyEstimate) {
               // For the estimate, we can just mark negative values
    -          kenIds[x] = -1;
    +          poolWrapper.getNgramBuffer().putLong((x + 1) * LONG_SIZE_IN_BYTES, -1);
             } else {
               // Nonterminal: retrieve the KenLM long that records the state
               int index = -(id + 1);
               final KenLMState state = (KenLMState) tailNodes.get(index).getDPState(stateIndex);
    -          kenIds[x] = -state.getState();
    +          poolWrapper.getNgramBuffer().putLong((x + 1) * LONG_SIZE_IN_BYTES, -state.getState());
    --- End diff --
    
    Good point, I'll see if I can clean it up.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua pull request #65: This PR is a first attempt to minimize ng...

Posted by KellenSunderland <gi...@git.apache.org>.
Github user KellenSunderland closed the pull request at:

    https://github.com/apache/incubator-joshua/pull/65


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-joshua pull request #65: This PR is a first attempt to minimize ng...

Posted by KellenSunderland <gi...@git.apache.org>.
GitHub user KellenSunderland reopened a pull request:

    https://github.com/apache/incubator-joshua/pull/65

    This PR is a first attempt to minimize ngram array copies by utilizing DirectBuffers.

    This is a work in progress but posted so that @mjpost  and @kpu can review (and do some speed comparisons).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KellenSunderland/incubator-joshua master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-joshua/pull/65.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #65
    
----
commit 9ea7eebf0164d1676f633b441bd952eaa20b0760
Author: Kellen Sunderland <ke...@amazon.com>
Date:   2016-09-15T17:06:04Z

    Convert to a DirectBuffer to transfer ngrams during probRule

commit 3ba83f84e8258a784fcef509fc9b158d44f15c66
Author: Kellen Sunderland <ke...@amazon.com>
Date:   2016-09-15T17:31:21Z

    Converted estimateRule to also make use of DirectBuffer.
    Reduced number of array copies in probRule.
    Removed sentence from estimate method signature (as it was unused).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---