You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@joshua.apache.org by "Kishani Kandasamy (Jira)" <ji...@apache.org> on 2021/02/24 01:53:03 UTC

[jira] [Commented] (JOSHUA-338) Generate smaller models for LPs

    [ https://issues.apache.org/jira/browse/JOSHUA-338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289560#comment-17289560 ] 

Kishani Kandasamy commented on JOSHUA-338:
------------------------------------------

Hi Tommaso Teofili, Thank you for your reply. I'm particularly interested
in this issue to complete as my GSoC 2021 Project. Currently , I'm reading
Language models used within Joshua  in order to understand project scope
thoroughly.Thank you.

On Fri, Nov 20, 2020 at 11:19 PM Tommaso Teofili (Jira) <ji...@apache.org>



> Generate smaller models for LPs
> -------------------------------
>
>                 Key: JOSHUA-338
>                 URL: https://issues.apache.org/jira/browse/JOSHUA-338
>             Project: Joshua
>          Issue Type: Task
>          Components: core
>            Reporter: Tommaso Teofili
>            Priority: Major
>              Labels: gsoc2019
>
> Phrase tables and grammars can get very big when trained on lots of parallel data, which makes it hard to distribute them in Language Packs. A quick way to reduce model size is to reduce the amount of parallel data used to build models, but sampling a subset of it. This is the very naive approach used in the construction of the original language packs (November 2016), but there are much better ways. One relatively simple one is the Vocabulary Saturation Filter (VSF), proposed by Will Lewis and Sauleh Eetemadi and described in paper [1]. It would be wonderful to implement this and use it to do a better job selecting which sentences to include for our general-purpose language packs.
> It would be ideal to implement this in Java, but Python or Scala would also fit well inside Joshua.
> [1] : http://www.aclweb.org/anthology/W/W13/W13-2235.pdf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)