You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@joshua.apache.org by "Matt Post (JIRA)" <ji...@apache.org> on 2016/05/25 07:04:13 UTC

[jira] [Created] (JOSHUA-272) Simplify the packing and usage of phrase-based grammars

Matt Post created JOSHUA-272:
--------------------------------

             Summary: Simplify the packing and usage of phrase-based grammars
                 Key: JOSHUA-272
                 URL: https://issues.apache.org/jira/browse/JOSHUA-272
             Project: Joshua
          Issue Type: Improvement
            Reporter: Matt Post
            Assignee: Matt Post
             Fix For: 6.1


For historical reasons, phrase-based grammars add some complexity to decoding. The complete tree under each top-level trie node in packed grammars has to fit within a single packed grammars slice, which is limited to 2 GB due to constraints on the size of Java byte[] arrays. We used to sort on just the first item in the trie, which was a problem for phrase-based decoding, since phrase-based rules are implemented as left-branching hierarchical rules. In order to pack large grammars, we packed them without the leading [X,1], and then added it when loading the grammars, both for the packed and memory-based grammars. This was a real mess.

This was all fixed with a commit a while ago that packs and reads packed grammars based on the first two symbols on the source side. So we should remove all the complexity associated with phrases. They should just be regular rules. There is also a lot of redundancy across the codebase in parsing rules, converting them to different formats, and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)