You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by "Matt Post (JIRA)" <ji...@apache.org> on 2016/11/14 18:38:58 UTC

[jira] [Commented] (JOSHUA-315) Thrax keeps all rules

    [ https://issues.apache.org/jira/browse/JOSHUA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664649#comment-15664649 ] 

Matt Post commented on JOSHUA-315:
----------------------------------

This has been addressed in commit 885389d513b5d0f3f68b59c3b17a776584b3a208. If you add the word "count" to the list of thrax features in the thrax config file, a sixth field will be extracted with the rule count, e.g.,

    [X] ||| de ||| of ||| 0.72572 0.29124 1 0 0.39357 0.17023 ||| 0-0 ||| 2565758
    [X] ||| de ||| to ||| 2.89509 2.10811 1 0 2.87285 2.08282 ||| 0-0 ||| 215020
    [X] ||| de ||| in ||| 3.11663 2.17583 1 0 2.91081 2.34837 ||| 0-0 ||| 207011
    ...

This is then used by the filter-rules.pl script (with the flag -t 100) to prune remove all rules except the top 100 most frequent, for each source side. This has been added to the pipeline. The grammars seem to be about 5% smaller and should have only a positive effect on running time.

> Thrax keeps all rules
> ---------------------
>
>                 Key: JOSHUA-315
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-315
>             Project: Joshua
>          Issue Type: Bug
>            Reporter: Matt Post
>             Fix For: 6.2
>
>
> When extracting rules, Thrax keeps *all* options for each target side. For large bitexts and common source sides (e.g., "de" for Spanish–English), there can be tens of thousands of translations, due to errors in the alignments and phenomena like garbage collection. The decoder throws out all but the top num_translation_options of these (default 20), but before doing so, it has to score all the target side options with all feature functions, include the language model. This slows down "warming up" of the model and means that the first sentences to use these items are very slow to translation.
> I have updated scripts/training/filter-rules.pl to filter out using Thrax's rarity penalty field, but it would be much better if Thrax were to keep only the most 100 frequent translation options for each source side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)