You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@joshua.apache.org by Matt Post <po...@cs.jhu.edu> on 2016/11/29 03:08:20 UTC

★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

One project I think could be interesting for Joshua's future is sketched here.

- Dynamic phrase tables. Joshua currently lets people add custom phrases to the existing models that then get used. There is a research topic here for how to make it better (particularly, how to set the weights of rules that are added at runtime instead of learned from bitext), but it works really well for adding words that are OOV (since it's always cheaper to use the OOV). Here's a demo of how this works (this feature is included in the language packs).

https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables

- Translation memories. There is a large commercial market (billions) for tools called "translation memories", where translators are translating documents, and the sentences get queried against their past translations and matched in a fuzzy fashion. The big tool on the market for this is SDL Trados <http://www.sdl.com/solution/language/translation-productivity/trados-studio/>. I'm not talking about selling a product, but in a space that big, there have got to be a lot of people who'd rather just run their own system, than shell out for an expensive (and ugly) tool. So there is a big niche for an open source tool, and currently nothing really filling it. The "dynamic phrase table" feature above provides the beginnings of offering a TM competitor, but one that is "seeded" with a regular statistical machine translation model.

- Dynamic re-tuning. One thing that'd be *really* cool is to revamp the tuning infrastructure in Joshua. The use-case I imagine is that Joshua could sit on top of a large tuning set across diverse domains (e.g, formal news, informal web logs, spoken dialogue, etc). You could then add new phrases in sentences as above, which would get automatically aligned, and then everything could be retuned at the user's request (or perhaps at night). This way, when people added new data to their models, Joshua would automatically find the best weights, either immediately or on some schedule. There'd be less worry about bit rot.

- Data collection and sharing. Another cool idea would be to allow people to easily send us data. If we get to a place where people are building custom dynamic phrase tables, a cool ability would be to make it easy for people to upload the data they have added to their private systems, which we could then collect and further distribute. So Joshua could become an easy means for people to crowdsource data used for translation systems. This is obviously just a high-level idea that would require a lot of details to be figured out, but it would be super cool.

matt

Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

Posted by Tommaso Teofili <to...@gmail.com>.

Il giorno gio 1 dic 2016 alle ore 15:27 Matt Post <po...@cs.jhu.edu> ha
scritto:

> It wouldn't be hard to add some TMX-like features, no. There are some
> technical challenges, though — for example, the current demo lets you add
> phrases, but that doesn't affect the language model at all.
>

we probably need to rely on language models that can perform online
learning [1]; but as you said the other point is to do that in conjunction
with the Joshua model tuning.


>
> Ideally, we'd also allow people to add whole sentences, and would then run
> John's fast_align implementation (with a saved model) to break down that
> new sentence, and do proper incremental updating.
>
> How do you image Lucene fitting into this?
>

I just thought about it for the fuzzy search requirement.

Regards,
Tommaso


>
> matt
>
>
[1] : http://www.cs.cmu.edu/~dyogatam/papers/yogatama+etal.tacl2014.pdf


>
> > On Dec 1, 2016, at 9:22 AM, Tommaso Teofili <to...@gmail.com>
> wrote:
> >
> > Matt,
> >
> > really nice least of very useful features, thanks for this!
> > One comment only on the translation memories one: as seen by one that had
> > never heard about it, it sounds not too complicated to implement on top
> of
> > current Joshua (with IR library like Apache Lucene), is my understanding
> > correct ?
> >
> > My 2 cents,
> > Tommaso
> >
> >
> > Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post <post@cs.jhu.edu
> <ma...@cs.jhu.edu>> ha
> > scritto:
> >
> >> One project I think could be interesting for Joshua's future is sketched
> >> here.
> >>
> >> - Dynamic phrase tables. Joshua currently lets people add custom phrases
> >> to the existing models that then get used. There is a research topic
> here
> >> for how to make it better (particularly, how to set the weights of rules
> >> that are added at runtime instead of learned from bitext), but it works
> >> really well for adding words that are OOV (since it's always cheaper to
> use
> >> the OOV). Here's a demo of how this works (this feature is included in
> the
> >> language packs).
> >>
> >>
> >>
> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
> >>
> >> - Translation memories. There is a large commercial market (billions)
> for
> >> tools called "translation memories", where translators are translating
> >> documents, and the sentences get queried against their past translations
> >> and matched in a fuzzy fashion. The big tool on the market for this is
> SDL
> >> Trados <
> >>
> http://www.sdl.com/solution/language/translation-productivity/trados-studio/
> <
> http://www.sdl.com/solution/language/translation-productivity/trados-studio/
> >>.
> >> I'm not talking about selling a product, but in a space that big, there
> >> have got to be a lot of people who'd rather just run their own system,
> than
> >> shell out for an expensive (and ugly) tool. So there is a big niche for
> an
> >> open source tool, and currently nothing really filling it. The "dynamic
> >> phrase table" feature above provides the beginnings of offering a TM
> >> competitor, but one that is "seeded" with a regular statistical machine
> >> translation model.
> >>
> >> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
> >> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
> >> could sit on top of a large tuning set across diverse domains (e.g,
> formal
> >> news, informal web logs, spoken dialogue, etc). You could then add new
> >> phrases in sentences as above, which would get automatically aligned,
> and
> >> then everything could be retuned at the user's request (or perhaps at
> >> night). This way, when people added new data to their models, Joshua
> would
> >> automatically find the best weights, either immediately or on some
> >> schedule. There'd be less worry about bit rot.
> >>
> >> - Data collection and sharing. Another cool idea would be to allow
> people
> >> to easily send us data. If we get to a place where people are building
> >> custom dynamic phrase tables, a cool ability would be to make it easy
> for
> >> people to upload the data they have added to their private systems,
> which
> >> we could then collect and further distribute. So Joshua could become an
> >> easy means for people to crowdsource data used for translation systems.
> >> This is obviously just a high-level idea that would require a lot of
> >> details to be figured out, but it would be super cool.
> >>
> >> matt
>
>

Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

Posted by Matt Post <po...@cs.jhu.edu>.

It wouldn't be hard to add some TMX-like features, no. There are some technical challenges, though — for example, the current demo lets you add phrases, but that doesn't affect the language model at all.

Ideally, we'd also allow people to add whole sentences, and would then run John's fast_align implementation (with a saved model) to break down that new sentence, and do proper incremental updating.

How do you image Lucene fitting into this? 

matt


> On Dec 1, 2016, at 9:22 AM, Tommaso Teofili <to...@gmail.com> wrote:
> 
> Matt,
> 
> really nice least of very useful features, thanks for this!
> One comment only on the translation memories one: as seen by one that had
> never heard about it, it sounds not too complicated to implement on top of
> current Joshua (with IR library like Apache Lucene), is my understanding
> correct ?
> 
> My 2 cents,
> Tommaso
> 
> 
> Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post <post@cs.jhu.edu <ma...@cs.jhu.edu>> ha
> scritto:
> 
>> One project I think could be interesting for Joshua's future is sketched
>> here.
>> 
>> - Dynamic phrase tables. Joshua currently lets people add custom phrases
>> to the existing models that then get used. There is a research topic here
>> for how to make it better (particularly, how to set the weights of rules
>> that are added at runtime instead of learned from bitext), but it works
>> really well for adding words that are OOV (since it's always cheaper to use
>> the OOV). Here's a demo of how this works (this feature is included in the
>> language packs).
>> 
>> 
>> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>> 
>> - Translation memories. There is a large commercial market (billions) for
>> tools called "translation memories", where translators are translating
>> documents, and the sentences get queried against their past translations
>> and matched in a fuzzy fashion. The big tool on the market for this is SDL
>> Trados <
>> http://www.sdl.com/solution/language/translation-productivity/trados-studio/ <http://www.sdl.com/solution/language/translation-productivity/trados-studio/>>.
>> I'm not talking about selling a product, but in a space that big, there
>> have got to be a lot of people who'd rather just run their own system, than
>> shell out for an expensive (and ugly) tool. So there is a big niche for an
>> open source tool, and currently nothing really filling it. The "dynamic
>> phrase table" feature above provides the beginnings of offering a TM
>> competitor, but one that is "seeded" with a regular statistical machine
>> translation model.
>> 
>> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
>> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
>> could sit on top of a large tuning set across diverse domains (e.g, formal
>> news, informal web logs, spoken dialogue, etc). You could then add new
>> phrases in sentences as above, which would get automatically aligned, and
>> then everything could be retuned at the user's request (or perhaps at
>> night). This way, when people added new data to their models, Joshua would
>> automatically find the best weights, either immediately or on some
>> schedule. There'd be less worry about bit rot.
>> 
>> - Data collection and sharing. Another cool idea would be to allow people
>> to easily send us data. If we get to a place where people are building
>> custom dynamic phrase tables, a cool ability would be to make it easy for
>> people to upload the data they have added to their private systems, which
>> we could then collect and further distribute. So Joshua could become an
>> easy means for people to crowdsource data used for translation systems.
>> This is obviously just a high-level idea that would require a lot of
>> details to be figured out, but it would be super cool.
>> 
>> matt

Re: ★ joshua roadmap feature: dynamic phrase tables, retuning, and data sharing

Posted by Tommaso Teofili <to...@gmail.com>.

Matt,

really nice least of very useful features, thanks for this!
One comment only on the translation memories one: as seen by one that had
never heard about it, it sounds not too complicated to implement on top of
current Joshua (with IR library like Apache Lucene), is my understanding
correct ?

My 2 cents,
Tommaso


Il giorno mar 29 nov 2016 alle ore 04:08 Matt Post <po...@cs.jhu.edu> ha
scritto:

> One project I think could be interesting for Joshua's future is sketched
> here.
>
> - Dynamic phrase tables. Joshua currently lets people add custom phrases
> to the existing models that then get used. There is a research topic here
> for how to make it better (particularly, how to set the weights of rules
> that are added at runtime instead of learned from bitext), but it works
> really well for adding words that are OOV (since it's always cheaper to use
> the OOV). Here's a demo of how this works (this feature is included in the
> language packs).
>
>
> https://github.com/joshua-decoder/joshua/wiki/Joshua's-Dynamic-Phrase-Tables
>
> - Translation memories. There is a large commercial market (billions) for
> tools called "translation memories", where translators are translating
> documents, and the sentences get queried against their past translations
> and matched in a fuzzy fashion. The big tool on the market for this is SDL
> Trados <
> http://www.sdl.com/solution/language/translation-productivity/trados-studio/>.
> I'm not talking about selling a product, but in a space that big, there
> have got to be a lot of people who'd rather just run their own system, than
> shell out for an expensive (and ugly) tool. So there is a big niche for an
> open source tool, and currently nothing really filling it. The "dynamic
> phrase table" feature above provides the beginnings of offering a TM
> competitor, but one that is "seeded" with a regular statistical machine
> translation model.
>
> - Dynamic re-tuning. One thing that'd be *really* cool is to revamp the
> tuning infrastructure in Joshua. The use-case I imagine is that Joshua
> could sit on top of a large tuning set across diverse domains (e.g, formal
> news, informal web logs, spoken dialogue, etc). You could then add new
> phrases in sentences as above, which would get automatically aligned, and
> then everything could be retuned at the user's request (or perhaps at
> night). This way, when people added new data to their models, Joshua would
> automatically find the best weights, either immediately or on some
> schedule. There'd be less worry about bit rot.
>
> - Data collection and sharing. Another cool idea would be to allow people
> to easily send us data. If we get to a place where people are building
> custom dynamic phrase tables, a cool ability would be to make it easy for
> people to upload the data they have added to their private systems, which
> we could then collect and further distribute. So Joshua could become an
> easy means for people to crowdsource data used for translation systems.
> This is obviously just a high-level idea that would require a lot of
> details to be figured out, but it would be super cool.
>
> matt