You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@joshua.apache.org by "Matt Post (JIRA)" <ji...@apache.org> on 2016/11/14 18:38:58 UTC

[jira] [Resolved] (JOSHUA-315) Thrax keeps all rules

     [ https://issues.apache.org/jira/browse/JOSHUA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Post resolved JOSHUA-315.
------------------------------
    Resolution: Fixed

> Thrax keeps all rules
> ---------------------
>
>                 Key: JOSHUA-315
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-315
>             Project: Joshua
>          Issue Type: Bug
>            Reporter: Matt Post
>             Fix For: 6.2
>
>
> When extracting rules, Thrax keeps *all* options for each target side. For large bitexts and common source sides (e.g., "de" for Spanish–English), there can be tens of thousands of translations, due to errors in the alignments and phenomena like garbage collection. The decoder throws out all but the top num_translation_options of these (default 20), but before doing so, it has to score all the target side options with all feature functions, include the language model. This slows down "warming up" of the model and means that the first sentences to use these items are very slow to translation.
> I have updated scripts/training/filter-rules.pl to filter out using Thrax's rarity penalty field, but it would be much better if Thrax were to keep only the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)