You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@joshua.apache.org by Tommaso Teofili <to...@gmail.com> on 2020/10/19 06:16:35 UTC

NMT survey (was: Roll Cal)

Following up on the report topic, I've created an overleaf doc for everyone
who's interested in working on this [1].

First set of (AL-2 compatible) NMT toolkits I've found:
- Joey NMT [2]
- OpenNMT [3]
- MarianNMT [4]
- Sockeye [5]
- and of course RTG already shared by Thamme [6]

Regards,
Tommaso

[1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
[2] : https://github.com/joeynmt/joeynmt
[3] : https://github.com/OpenNMT
[4] : https://github.com/marian-nmt/marian
[5] : https://github.com/awslabs/sockeye
[6] : https://github.com/isi-nlp/rtg-xt

On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <to...@gmail.com>
wrote:

> very good idea Thamme!
> I'd be up for writing such a short survey paper as a result of our
> analysis.
>
> Regards,
> Tommaso
>
>
> On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com> wrote:
>
>> Tomasso and others,
>>
>> > I think we may now go into a research phase to understand what existing
>> toolkit we can more easily integrate with.
>> Agreed.
>> if we can write a (short) report that compares various NMT toolkits of
>> 2020, it would be useful for us to make this decision as well as to the
>> NMT
>> community.
>> Something like a survey paper on NMT research but focus on toolkits and
>> software engineering part.
>>
>>
>>
>> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
>> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>>
>> > Thamme, Jeff,
>> >
>> > your contributions will be very important for the project and the
>> > community, especially given your NLP background, thanks for your
>> support!
>> >
>> > I agree moving towards NMT is the best thing to do at this point for
>> > Joshua.
>> >
>> > Thamme, thanks for your suggestions!
>> > I think we may now go into a research phase to understand what existing
>> > toolkit we can more easily integrate with.
>> > Of course if you like to integrate your own toolkit then that'd be even
>> > more interesting to see how it compares to others.
>> >
>> > If that means moving to Python I think it's not a problem; we can still
>> > work on Java bindings to ship a new Joshua Decoder implementation.
>> >
>> > The pretrained models topic is imho something we will have to embrace at
>> > some point, so that others can:
>> > a) just download new LPs
>> > b) eventually fine tune with their own data
>> >
>> > I am looking forward to start this new phase of research on Joshua.
>> >
>> > Regards,
>> > Tommaso
>> >
>> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jz...@apache.org>
>> wrote:
>> >
>> > > I haven't contributed to this point but I would like to see Apache
>> Joshua
>> > > remain an active project so I am volunteering to help. I may not be a
>> lot
>> > > of help with code for a bit but I will help out with documentation,
>> > > releases, etc.
>> > >
>> > > I do agree that NMT is the best path forward but I will leave the
>> choice
>> > of
>> > > integrating an existing library into Joshua versus a new NMT
>> > implementation
>> > > in Joshua to those more familiar with the code and what they think is
>> > best
>> > > for the project.
>> > >
>> > > Jeff
>> > >
>> > >
>> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tg...@gmail.com>
>> wrote:
>> > >
>> > > > Hi Tomasso, and others
>> > > >
>> > > > *1.  I support the addition of neural MT decoder. *
>> > > > The world has moved on, and NMT is clearly the way to go forward.
>> > > > If you dont believe my words, read what Matt Post himself said [1]
>> > three
>> > > > years ago!
>> > > >
>> > > > I have spent the past three years focusing on NMT  as part of my job
>> > and
>> > > > Ph.D -- I'd be glad to contribute in that direction.
>> > > > There are many NMT toolkits out there today. (Fairseq, sockeye,
>> > > > tensor2tensor, ....)
>> > > >
>> > > > The right thing to do, IMHO, is simply merge one of the NMT toolkits
>> > with
>> > > > Joshua project.  We can do that as long as it's Apache License
>> right?
>> > > > We will now have to move towards python land as most toolkits are in
>> > > > python. On the positive side, we will be losing the ancient perl
>> > scripts
>> > > > that many are not fan of.
>> > > >
>> > > > I have been working on my own NMT toolkit for my work and research
>> --
>> > > RTG
>> > > > https://isi-nlp.github.io/rtg/#conf
>> > > > I had worked on Joshua in the past, mainly, I improved the code
>> quality
>> > > > [2]. So you can tell my new code'd be upto Apache's standards ;)
>> > > >
>> > > > *2. Pretrained MT models for lots of languages*
>> > > > I have been working on a lib to retrieve parallel data from many
>> > sources
>> > > --
>> > > > MTData [3]
>> > > > There is so much parallel data out their for hundreds of languages.
>> > > > My recent estimate is over a billion lines of parallel sentences for
>> > over
>> > > > 500 languages is freely and publicly available for download using
>> > MTData
>> > > > tool.
>> > > > If we find some sponsors to lend us some resources, we could train
>> > better
>> > > > MT models and update the Language Packs section [4].
>> > > > Perhaps, one massively multilingual NMT model that supports many
>> > > > translation directions (I know its possible with NMT; I tested it
>> > > recently
>> > > > with RTG)
>> > > >
>> > > > I am interested in hearing what others are thinking.
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
>> > > > [2] -
>> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
>> > > > [3] - https://github.com/thammegowda/mtdata
>> > > > [4] -
>> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> > > >
>> > > >
>> > > > Cheers,
>> > > > TG
>> > > >
>> > > > --
>> > > > *Thamme Gowda *
>> > > > @thammegowda <https://twitter.com/thammegowda> |
>> https://isi.edu/~tg
>> > > > ~Sent via somebody's Webmail server
>> > > >
>> > > >
>> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
>> Teofili <
>> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > This is a roll call for people interested in contributing to
>> Apache
>> > > > Joshua
>> > > > > going forward.
>> > > > > Contributing could be not just code, but anything that may help
>> the
>> > > > project
>> > > > > or serve the community.
>> > > > >
>> > > > > In case you're interested in helping out please speak up :-)
>> > > > >
>> > > > > Code-wise Joshua has not evolved much in the latest months,
>> there's
>> > > room
>> > > > > for both improvements to the current code (make a new minor
>> release)
>> > > and
>> > > > > new ideas / code branches (e.g. neural MT based Joshua Decoder).
>> > > > >
>> > > > > Regards,
>> > > > > Tommaso
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: NMT survey (was: Roll Cal)

Posted by Tommaso Teofili <to...@gmail.com>.

Hi everyone,

following up on this topic, how about performing a shared evaluation of the
tools we mentioned so far?
I'd address this by deciding upon a "shared MT task" on a well known and
not so big dataset and then getting an evaluation run for each of those
toolkits.
The evaluation task would require gettin:
- an accuracy metric value (BLEU? I know it's questionable, otherwise what
else?)
- a prediction speed measure (e.g. translations per second), reporting also
the hardware used
- a training speed measure (e.g. seconds/minutes/hours taken to train the
model)

What do others think?

Regards,
Tommaso


On Wed, 21 Oct 2020 at 15:57, Tommaso Teofili <to...@gmail.com>
wrote:

> hi Michael,
>
> nice to hear from you too on the dev@ list! We're looking forward to see
> you involved :)
> If I understood Thamme's proposal correctly, the paper is just a way to
> write down our own evaluation of current approaches to NMT; that would help
> us decide on our own way to pursue MT.
> At this stage I am not sure what we'll end up doing, it'd be nice not to
> just be a wrapper for one of those existing NMT tools, but let's see.
>
> Regards,
> Tommaso
>
>
> On Tue, 20 Oct 2020 at 15:37, Michael Wall <mj...@apache.org> wrote:
>
>> Hi,
>>
>> Been watching Joshua since it was incubating.  Finally may have some
>> free time and am would like to get involved.
>>
>> The NMT stuff looks interesting.  I don't have an overleaf account, so
>> maybe my next question is answered there.  What is the end result of
>> the paper?  Will you be choosing a framework to add to Joshua.  And if
>> so, what will make it different than just using said framework on it's
>> own?
>>
>> Thanks
>>
>> Mike
>>
>> On Tue, Oct 20, 2020 at 5:34 AM Tommaso Teofili
>> <to...@gmail.com> wrote:
>> >
>> > I've also added M2M-100 from FB-AI [1].
>> >
>> > Regarding desiderata, here's an unsorted list of first things that come
>> to
>> > my mind:
>> > - runs also on jvm
>> > - low resource requirements (e.g. for training)
>> > - can leverage existing / pretrained models
>> > - word and phrase translation capabilities
>> > - good effectiveness :)
>> >
>> > Regards,
>> > Tommaso
>> >
>> > [1] :
>> >
>> https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
>> >
>> > On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <
>> tommaso.teofili@gmail.com>
>> > wrote:
>> >
>> > > Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree
>> we can
>> > > surely have a look at others having different licensing too.
>> > > In the meantime I've added all of your suggestions to the paper (with
>> > > related reference when available).
>> > > We should decide what our desiderata are and establish a first set of
>> eval
>> > > benchmark just to understand what can work for us, at least
>> initially, then
>> > > we can have a more thorough evaluation with a small number of
>> "candidates".
>> > >
>> > > Regards,
>> > > Tommaso
>> > >
>> > > On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tg...@gmail.com> wrote:
>> > >
>> > >> Tomaso,
>> > >>
>> > >> Awesome! Thanks for the links.
>> > >> I will be happy to join, (But I wont be able to contribute to the
>> actual
>> > >> paper untill Oct 24).
>> > >>
>> > >> I suggest we should consider popular NMT toolkits for the survey
>> > >> regardless
>> > >> of their compatibility with AL-2.
>> > >> We should see all the tricks and features, and know if we are
>> missing out
>> > >> on any useful features after enforcing the AL-2 filter (and create
>> issues
>> > >> for adding those features).
>> > >>
>> > >> here are some more NMT toolkits to be included in the survey.
>> > >> - Fairseq https://github.com/pytorch/fairseq
>> > >> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/
>> > >> - Nematus  https://github.com/EdinburghNLP/nematus
>> > >> - xNMT https://github.com/neulab/xnmt
>> > >> - XLM   https://github.com/facebookresearch/XLM/
>> > >>     |-> MASS  https://github.com/microsoft/MASS/  -->
>> > >> https://github.com/thammegowda/unmass  (took that and made it
>> easier to
>> > >> install and use)
>> > >>
>> > >> Some old stuff which we are defnitely not going to use but worth
>> > >> mentioning
>> > >> in the survey (for the sake of completion)
>> > >> - https://github.com/google/seq2seq
>> > >> - https://github.com/tensorflow/nmt
>> > >> - https://github.com/isi-nlp/Zoph_RNN
>> > >>
>> > >>
>> > >>
>> > >> Cheers,
>> > >> TG
>> > >>
>> > >>
>> > >> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili
>> <
>> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >>
>> > >> > Following up on the report topic, I've created an overleaf doc for
>> > >> everyone
>> > >> > who's interested in working on this [1].
>> > >> >
>> > >> > First set of (AL-2 compatible) NMT toolkits I've found:
>> > >> > - Joey NMT [2]
>> > >> > - OpenNMT [3]
>> > >> > - MarianNMT [4]
>> > >> > - Sockeye [5]
>> > >> > - and of course RTG already shared by Thamme [6]
>> > >> >
>> > >> > Regards,
>> > >> > Tommaso
>> > >> >
>> > >> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
>> > >> > [2] : https://github.com/joeynmt/joeynmt
>> > >> > [3] : https://github.com/OpenNMT
>> > >> > [4] : https://github.com/marian-nmt/marian
>> > >> > [5] : https://github.com/awslabs/sockeye
>> > >> > [6] : https://github.com/isi-nlp/rtg-xt
>> > >> >
>> > >> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <
>> > >> tommaso.teofili@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > very good idea Thamme!
>> > >> > > I'd be up for writing such a short survey paper as a result of
>> our
>> > >> > > analysis.
>> > >> > >
>> > >> > > Regards,
>> > >> > > Tommaso
>> > >> > >
>> > >> > >
>> > >> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com>
>> wrote:
>> > >> > >
>> > >> > >> Tomasso and others,
>> > >> > >>
>> > >> > >> > I think we may now go into a research phase to understand what
>> > >> > existing
>> > >> > >> toolkit we can more easily integrate with.
>> > >> > >> Agreed.
>> > >> > >> if we can write a (short) report that compares various NMT
>> toolkits
>> > >> of
>> > >> > >> 2020, it would be useful for us to make this decision as well
>> as to
>> > >> the
>> > >> > >> NMT
>> > >> > >> community.
>> > >> > >> Something like a survey paper on NMT research but focus on
>> toolkits
>> > >> and
>> > >> > >> software engineering part.
>> > >> > >>
>> > >> > >>
>> > >> > >>
>> > >> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
>> Teofili
>> > >> <
>> > >> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >> > >>
>> > >> > >> > Thamme, Jeff,
>> > >> > >> >
>> > >> > >> > your contributions will be very important for the project and
>> the
>> > >> > >> > community, especially given your NLP background, thanks for
>> your
>> > >> > >> support!
>> > >> > >> >
>> > >> > >> > I agree moving towards NMT is the best thing to do at this
>> point
>> > >> for
>> > >> > >> > Joshua.
>> > >> > >> >
>> > >> > >> > Thamme, thanks for your suggestions!
>> > >> > >> > I think we may now go into a research phase to understand what
>> > >> > existing
>> > >> > >> > toolkit we can more easily integrate with.
>> > >> > >> > Of course if you like to integrate your own toolkit then
>> that'd be
>> > >> > even
>> > >> > >> > more interesting to see how it compares to others.
>> > >> > >> >
>> > >> > >> > If that means moving to Python I think it's not a problem; we
>> can
>> > >> > still
>> > >> > >> > work on Java bindings to ship a new Joshua Decoder
>> implementation.
>> > >> > >> >
>> > >> > >> > The pretrained models topic is imho something we will have to
>> > >> embrace
>> > >> > at
>> > >> > >> > some point, so that others can:
>> > >> > >> > a) just download new LPs
>> > >> > >> > b) eventually fine tune with their own data
>> > >> > >> >
>> > >> > >> > I am looking forward to start this new phase of research on
>> Joshua.
>> > >> > >> >
>> > >> > >> > Regards,
>> > >> > >> > Tommaso
>> > >> > >> >
>> > >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <
>> jzemerick@apache.org>
>> > >> > >> wrote:
>> > >> > >> >
>> > >> > >> > > I haven't contributed to this point but I would like to see
>> > >> Apache
>> > >> > >> Joshua
>> > >> > >> > > remain an active project so I am volunteering to help. I
>> may not
>> > >> be
>> > >> > a
>> > >> > >> lot
>> > >> > >> > > of help with code for a bit but I will help out with
>> > >> documentation,
>> > >> > >> > > releases, etc.
>> > >> > >> > >
>> > >> > >> > > I do agree that NMT is the best path forward but I will
>> leave the
>> > >> > >> choice
>> > >> > >> > of
>> > >> > >> > > integrating an existing library into Joshua versus a new NMT
>> > >> > >> > implementation
>> > >> > >> > > in Joshua to those more familiar with the code and what they
>> > >> think
>> > >> > is
>> > >> > >> > best
>> > >> > >> > > for the project.
>> > >> > >> > >
>> > >> > >> > > Jeff
>> > >> > >> > >
>> > >> > >> > >
>> > >> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <
>> tgowdan@gmail.com>
>> > >> > >> wrote:
>> > >> > >> > >
>> > >> > >> > > > Hi Tomasso, and others
>> > >> > >> > > >
>> > >> > >> > > > *1.  I support the addition of neural MT decoder. *
>> > >> > >> > > > The world has moved on, and NMT is clearly the way to go
>> > >> forward.
>> > >> > >> > > > If you dont believe my words, read what Matt Post himself
>> said
>> > >> [1]
>> > >> > >> > three
>> > >> > >> > > > years ago!
>> > >> > >> > > >
>> > >> > >> > > > I have spent the past three years focusing on NMT  as
>> part of
>> > >> my
>> > >> > job
>> > >> > >> > and
>> > >> > >> > > > Ph.D -- I'd be glad to contribute in that direction.
>> > >> > >> > > > There are many NMT toolkits out there today. (Fairseq,
>> sockeye,
>> > >> > >> > > > tensor2tensor, ....)
>> > >> > >> > > >
>> > >> > >> > > > The right thing to do, IMHO, is simply merge one of the
>> NMT
>> > >> > toolkits
>> > >> > >> > with
>> > >> > >> > > > Joshua project.  We can do that as long as it's Apache
>> License
>> > >> > >> right?
>> > >> > >> > > > We will now have to move towards python land as most
>> toolkits
>> > >> are
>> > >> > in
>> > >> > >> > > > python. On the positive side, we will be losing the
>> ancient
>> > >> perl
>> > >> > >> > scripts
>> > >> > >> > > > that many are not fan of.
>> > >> > >> > > >
>> > >> > >> > > > I have been working on my own NMT toolkit for my work and
>> > >> research
>> > >> > >> --
>> > >> > >> > > RTG
>> > >> > >> > > > https://isi-nlp.github.io/rtg/#conf
>> > >> > >> > > > I had worked on Joshua in the past, mainly, I improved
>> the code
>> > >> > >> quality
>> > >> > >> > > > [2]. So you can tell my new code'd be upto Apache's
>> standards
>> > >> ;)
>> > >> > >> > > >
>> > >> > >> > > > *2. Pretrained MT models for lots of languages*
>> > >> > >> > > > I have been working on a lib to retrieve parallel data
>> from
>> > >> many
>> > >> > >> > sources
>> > >> > >> > > --
>> > >> > >> > > > MTData [3]
>> > >> > >> > > > There is so much parallel data out their for hundreds of
>> > >> > languages.
>> > >> > >> > > > My recent estimate is over a billion lines of parallel
>> > >> sentences
>> > >> > for
>> > >> > >> > over
>> > >> > >> > > > 500 languages is freely and publicly available for
>> download
>> > >> using
>> > >> > >> > MTData
>> > >> > >> > > > tool.
>> > >> > >> > > > If we find some sponsors to lend us some resources, we
>> could
>> > >> train
>> > >> > >> > better
>> > >> > >> > > > MT models and update the Language Packs section [4].
>> > >> > >> > > > Perhaps, one massively multilingual NMT model that
>> supports
>> > >> many
>> > >> > >> > > > translation directions (I know its possible with NMT; I
>> tested
>> > >> it
>> > >> > >> > > recently
>> > >> > >> > > > with RTG)
>> > >> > >> > > >
>> > >> > >> > > > I am interested in hearing what others are thinking.
>> > >> > >> > > >
>> > >> > >> > > > [1]
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
>> > >> > >> > > > [2] -
>> > >> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
>> > >> > >> > > > [3] - https://github.com/thammegowda/mtdata
>> > >> > >> > > > [4] -
>> > >> > >> >
>> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > > > Cheers,
>> > >> > >> > > > TG
>> > >> > >> > > >
>> > >> > >> > > > --
>> > >> > >> > > > *Thamme Gowda *
>> > >> > >> > > > @thammegowda <https://twitter.com/thammegowda> |
>> > >> > >> https://isi.edu/~tg
>> > >> > >> > > > ~Sent via somebody's Webmail server
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು
>> Tommaso
>> > >> > >> Teofili <
>> > >> > >> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >> > >> > > >
>> > >> > >> > > > > Hi all,
>> > >> > >> > > > >
>> > >> > >> > > > > This is a roll call for people interested in
>> contributing to
>> > >> > >> Apache
>> > >> > >> > > > Joshua
>> > >> > >> > > > > going forward.
>> > >> > >> > > > > Contributing could be not just code, but anything that
>> may
>> > >> help
>> > >> > >> the
>> > >> > >> > > > project
>> > >> > >> > > > > or serve the community.
>> > >> > >> > > > >
>> > >> > >> > > > > In case you're interested in helping out please speak
>> up :-)
>> > >> > >> > > > >
>> > >> > >> > > > > Code-wise Joshua has not evolved much in the latest
>> months,
>> > >> > >> there's
>> > >> > >> > > room
>> > >> > >> > > > > for both improvements to the current code (make a new
>> minor
>> > >> > >> release)
>> > >> > >> > > and
>> > >> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua
>> > >> Decoder).
>> > >> > >> > > > >
>> > >> > >> > > > > Regards,
>> > >> > >> > > > > Tommaso
>> > >> > >> > > > >
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >>
>> > >> > >
>> > >> >
>> > >>
>> > >
>>
>

Re: NMT survey (was: Roll Cal)

Posted by Tommaso Teofili <to...@gmail.com>.

hi Michael,

nice to hear from you too on the dev@ list! We're looking forward to see
you involved :)
If I understood Thamme's proposal correctly, the paper is just a way to
write down our own evaluation of current approaches to NMT; that would help
us decide on our own way to pursue MT.
At this stage I am not sure what we'll end up doing, it'd be nice not to
just be a wrapper for one of those existing NMT tools, but let's see.

Regards,
Tommaso


On Tue, 20 Oct 2020 at 15:37, Michael Wall <mj...@apache.org> wrote:

> Hi,
>
> Been watching Joshua since it was incubating.  Finally may have some
> free time and am would like to get involved.
>
> The NMT stuff looks interesting.  I don't have an overleaf account, so
> maybe my next question is answered there.  What is the end result of
> the paper?  Will you be choosing a framework to add to Joshua.  And if
> so, what will make it different than just using said framework on it's
> own?
>
> Thanks
>
> Mike
>
> On Tue, Oct 20, 2020 at 5:34 AM Tommaso Teofili
> <to...@gmail.com> wrote:
> >
> > I've also added M2M-100 from FB-AI [1].
> >
> > Regarding desiderata, here's an unsorted list of first things that come
> to
> > my mind:
> > - runs also on jvm
> > - low resource requirements (e.g. for training)
> > - can leverage existing / pretrained models
> > - word and phrase translation capabilities
> > - good effectiveness :)
> >
> > Regards,
> > Tommaso
> >
> > [1] :
> >
> https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
> >
> > On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <tommaso.teofili@gmail.com
> >
> > wrote:
> >
> > > Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree we
> can
> > > surely have a look at others having different licensing too.
> > > In the meantime I've added all of your suggestions to the paper (with
> > > related reference when available).
> > > We should decide what our desiderata are and establish a first set of
> eval
> > > benchmark just to understand what can work for us, at least initially,
> then
> > > we can have a more thorough evaluation with a small number of
> "candidates".
> > >
> > > Regards,
> > > Tommaso
> > >
> > > On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tg...@gmail.com> wrote:
> > >
> > >> Tomaso,
> > >>
> > >> Awesome! Thanks for the links.
> > >> I will be happy to join, (But I wont be able to contribute to the
> actual
> > >> paper untill Oct 24).
> > >>
> > >> I suggest we should consider popular NMT toolkits for the survey
> > >> regardless
> > >> of their compatibility with AL-2.
> > >> We should see all the tricks and features, and know if we are missing
> out
> > >> on any useful features after enforcing the AL-2 filter (and create
> issues
> > >> for adding those features).
> > >>
> > >> here are some more NMT toolkits to be included in the survey.
> > >> - Fairseq https://github.com/pytorch/fairseq
> > >> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/
> > >> - Nematus  https://github.com/EdinburghNLP/nematus
> > >> - xNMT https://github.com/neulab/xnmt
> > >> - XLM   https://github.com/facebookresearch/XLM/
> > >>     |-> MASS  https://github.com/microsoft/MASS/  -->
> > >> https://github.com/thammegowda/unmass  (took that and made it easier
> to
> > >> install and use)
> > >>
> > >> Some old stuff which we are defnitely not going to use but worth
> > >> mentioning
> > >> in the survey (for the sake of completion)
> > >> - https://github.com/google/seq2seq
> > >> - https://github.com/tensorflow/nmt
> > >> - https://github.com/isi-nlp/Zoph_RNN
> > >>
> > >>
> > >>
> > >> Cheers,
> > >> TG
> > >>
> > >>
> > >> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> > >>
> > >> > Following up on the report topic, I've created an overleaf doc for
> > >> everyone
> > >> > who's interested in working on this [1].
> > >> >
> > >> > First set of (AL-2 compatible) NMT toolkits I've found:
> > >> > - Joey NMT [2]
> > >> > - OpenNMT [3]
> > >> > - MarianNMT [4]
> > >> > - Sockeye [5]
> > >> > - and of course RTG already shared by Thamme [6]
> > >> >
> > >> > Regards,
> > >> > Tommaso
> > >> >
> > >> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
> > >> > [2] : https://github.com/joeynmt/joeynmt
> > >> > [3] : https://github.com/OpenNMT
> > >> > [4] : https://github.com/marian-nmt/marian
> > >> > [5] : https://github.com/awslabs/sockeye
> > >> > [6] : https://github.com/isi-nlp/rtg-xt
> > >> >
> > >> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <
> > >> tommaso.teofili@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > very good idea Thamme!
> > >> > > I'd be up for writing such a short survey paper as a result of our
> > >> > > analysis.
> > >> > >
> > >> > > Regards,
> > >> > > Tommaso
> > >> > >
> > >> > >
> > >> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com>
> wrote:
> > >> > >
> > >> > >> Tomasso and others,
> > >> > >>
> > >> > >> > I think we may now go into a research phase to understand what
> > >> > existing
> > >> > >> toolkit we can more easily integrate with.
> > >> > >> Agreed.
> > >> > >> if we can write a (short) report that compares various NMT
> toolkits
> > >> of
> > >> > >> 2020, it would be useful for us to make this decision as well as
> to
> > >> the
> > >> > >> NMT
> > >> > >> community.
> > >> > >> Something like a survey paper on NMT research but focus on
> toolkits
> > >> and
> > >> > >> software engineering part.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
> Teofili
> > >> <
> > >> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> > >> > >>
> > >> > >> > Thamme, Jeff,
> > >> > >> >
> > >> > >> > your contributions will be very important for the project and
> the
> > >> > >> > community, especially given your NLP background, thanks for
> your
> > >> > >> support!
> > >> > >> >
> > >> > >> > I agree moving towards NMT is the best thing to do at this
> point
> > >> for
> > >> > >> > Joshua.
> > >> > >> >
> > >> > >> > Thamme, thanks for your suggestions!
> > >> > >> > I think we may now go into a research phase to understand what
> > >> > existing
> > >> > >> > toolkit we can more easily integrate with.
> > >> > >> > Of course if you like to integrate your own toolkit then
> that'd be
> > >> > even
> > >> > >> > more interesting to see how it compares to others.
> > >> > >> >
> > >> > >> > If that means moving to Python I think it's not a problem; we
> can
> > >> > still
> > >> > >> > work on Java bindings to ship a new Joshua Decoder
> implementation.
> > >> > >> >
> > >> > >> > The pretrained models topic is imho something we will have to
> > >> embrace
> > >> > at
> > >> > >> > some point, so that others can:
> > >> > >> > a) just download new LPs
> > >> > >> > b) eventually fine tune with their own data
> > >> > >> >
> > >> > >> > I am looking forward to start this new phase of research on
> Joshua.
> > >> > >> >
> > >> > >> > Regards,
> > >> > >> > Tommaso
> > >> > >> >
> > >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <
> jzemerick@apache.org>
> > >> > >> wrote:
> > >> > >> >
> > >> > >> > > I haven't contributed to this point but I would like to see
> > >> Apache
> > >> > >> Joshua
> > >> > >> > > remain an active project so I am volunteering to help. I may
> not
> > >> be
> > >> > a
> > >> > >> lot
> > >> > >> > > of help with code for a bit but I will help out with
> > >> documentation,
> > >> > >> > > releases, etc.
> > >> > >> > >
> > >> > >> > > I do agree that NMT is the best path forward but I will
> leave the
> > >> > >> choice
> > >> > >> > of
> > >> > >> > > integrating an existing library into Joshua versus a new NMT
> > >> > >> > implementation
> > >> > >> > > in Joshua to those more familiar with the code and what they
> > >> think
> > >> > is
> > >> > >> > best
> > >> > >> > > for the project.
> > >> > >> > >
> > >> > >> > > Jeff
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <
> tgowdan@gmail.com>
> > >> > >> wrote:
> > >> > >> > >
> > >> > >> > > > Hi Tomasso, and others
> > >> > >> > > >
> > >> > >> > > > *1.  I support the addition of neural MT decoder. *
> > >> > >> > > > The world has moved on, and NMT is clearly the way to go
> > >> forward.
> > >> > >> > > > If you dont believe my words, read what Matt Post himself
> said
> > >> [1]
> > >> > >> > three
> > >> > >> > > > years ago!
> > >> > >> > > >
> > >> > >> > > > I have spent the past three years focusing on NMT  as part
> of
> > >> my
> > >> > job
> > >> > >> > and
> > >> > >> > > > Ph.D -- I'd be glad to contribute in that direction.
> > >> > >> > > > There are many NMT toolkits out there today. (Fairseq,
> sockeye,
> > >> > >> > > > tensor2tensor, ....)
> > >> > >> > > >
> > >> > >> > > > The right thing to do, IMHO, is simply merge one of the NMT
> > >> > toolkits
> > >> > >> > with
> > >> > >> > > > Joshua project.  We can do that as long as it's Apache
> License
> > >> > >> right?
> > >> > >> > > > We will now have to move towards python land as most
> toolkits
> > >> are
> > >> > in
> > >> > >> > > > python. On the positive side, we will be losing the ancient
> > >> perl
> > >> > >> > scripts
> > >> > >> > > > that many are not fan of.
> > >> > >> > > >
> > >> > >> > > > I have been working on my own NMT toolkit for my work and
> > >> research
> > >> > >> --
> > >> > >> > > RTG
> > >> > >> > > > https://isi-nlp.github.io/rtg/#conf
> > >> > >> > > > I had worked on Joshua in the past, mainly, I improved the
> code
> > >> > >> quality
> > >> > >> > > > [2]. So you can tell my new code'd be upto Apache's
> standards
> > >> ;)
> > >> > >> > > >
> > >> > >> > > > *2. Pretrained MT models for lots of languages*
> > >> > >> > > > I have been working on a lib to retrieve parallel data from
> > >> many
> > >> > >> > sources
> > >> > >> > > --
> > >> > >> > > > MTData [3]
> > >> > >> > > > There is so much parallel data out their for hundreds of
> > >> > languages.
> > >> > >> > > > My recent estimate is over a billion lines of parallel
> > >> sentences
> > >> > for
> > >> > >> > over
> > >> > >> > > > 500 languages is freely and publicly available for download
> > >> using
> > >> > >> > MTData
> > >> > >> > > > tool.
> > >> > >> > > > If we find some sponsors to lend us some resources, we
> could
> > >> train
> > >> > >> > better
> > >> > >> > > > MT models and update the Language Packs section [4].
> > >> > >> > > > Perhaps, one massively multilingual NMT model that supports
> > >> many
> > >> > >> > > > translation directions (I know its possible with NMT; I
> tested
> > >> it
> > >> > >> > > recently
> > >> > >> > > > with RTG)
> > >> > >> > > >
> > >> > >> > > > I am interested in hearing what others are thinking.
> > >> > >> > > >
> > >> > >> > > > [1]
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
> > >> > >> > > > [2] -
> > >> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
> > >> > >> > > > [3] - https://github.com/thammegowda/mtdata
> > >> > >> > > > [4] -
> > >> > >> >
> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > Cheers,
> > >> > >> > > > TG
> > >> > >> > > >
> > >> > >> > > > --
> > >> > >> > > > *Thamme Gowda *
> > >> > >> > > > @thammegowda <https://twitter.com/thammegowda> |
> > >> > >> https://isi.edu/~tg
> > >> > >> > > > ~Sent via somebody's Webmail server
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು
> Tommaso
> > >> > >> Teofili <
> > >> > >> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> > >> > >> > > >
> > >> > >> > > > > Hi all,
> > >> > >> > > > >
> > >> > >> > > > > This is a roll call for people interested in
> contributing to
> > >> > >> Apache
> > >> > >> > > > Joshua
> > >> > >> > > > > going forward.
> > >> > >> > > > > Contributing could be not just code, but anything that
> may
> > >> help
> > >> > >> the
> > >> > >> > > > project
> > >> > >> > > > > or serve the community.
> > >> > >> > > > >
> > >> > >> > > > > In case you're interested in helping out please speak up
> :-)
> > >> > >> > > > >
> > >> > >> > > > > Code-wise Joshua has not evolved much in the latest
> months,
> > >> > >> there's
> > >> > >> > > room
> > >> > >> > > > > for both improvements to the current code (make a new
> minor
> > >> > >> release)
> > >> > >> > > and
> > >> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua
> > >> Decoder).
> > >> > >> > > > >
> > >> > >> > > > > Regards,
> > >> > >> > > > > Tommaso
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
>

Re: NMT survey (was: Roll Cal)

Posted by Michael Wall <mj...@apache.org>.

Hi,

Been watching Joshua since it was incubating.  Finally may have some
free time and am would like to get involved.

The NMT stuff looks interesting.  I don't have an overleaf account, so
maybe my next question is answered there.  What is the end result of
the paper?  Will you be choosing a framework to add to Joshua.  And if
so, what will make it different than just using said framework on it's
own?

Thanks

Mike

On Tue, Oct 20, 2020 at 5:34 AM Tommaso Teofili
<to...@gmail.com> wrote:
>
> I've also added M2M-100 from FB-AI [1].
>
> Regarding desiderata, here's an unsorted list of first things that come to
> my mind:
> - runs also on jvm
> - low resource requirements (e.g. for training)
> - can leverage existing / pretrained models
> - word and phrase translation capabilities
> - good effectiveness :)
>
> Regards,
> Tommaso
>
> [1] :
> https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
>
> On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <to...@gmail.com>
> wrote:
>
> > Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree we can
> > surely have a look at others having different licensing too.
> > In the meantime I've added all of your suggestions to the paper (with
> > related reference when available).
> > We should decide what our desiderata are and establish a first set of eval
> > benchmark just to understand what can work for us, at least initially, then
> > we can have a more thorough evaluation with a small number of "candidates".
> >
> > Regards,
> > Tommaso
> >
> > On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tg...@gmail.com> wrote:
> >
> >> Tomaso,
> >>
> >> Awesome! Thanks for the links.
> >> I will be happy to join, (But I wont be able to contribute to the actual
> >> paper untill Oct 24).
> >>
> >> I suggest we should consider popular NMT toolkits for the survey
> >> regardless
> >> of their compatibility with AL-2.
> >> We should see all the tricks and features, and know if we are missing out
> >> on any useful features after enforcing the AL-2 filter (and create issues
> >> for adding those features).
> >>
> >> here are some more NMT toolkits to be included in the survey.
> >> - Fairseq https://github.com/pytorch/fairseq
> >> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/
> >> - Nematus  https://github.com/EdinburghNLP/nematus
> >> - xNMT https://github.com/neulab/xnmt
> >> - XLM   https://github.com/facebookresearch/XLM/
> >>     |-> MASS  https://github.com/microsoft/MASS/  -->
> >> https://github.com/thammegowda/unmass  (took that and made it easier to
> >> install and use)
> >>
> >> Some old stuff which we are defnitely not going to use but worth
> >> mentioning
> >> in the survey (for the sake of completion)
> >> - https://github.com/google/seq2seq
> >> - https://github.com/tensorflow/nmt
> >> - https://github.com/isi-nlp/Zoph_RNN
> >>
> >>
> >>
> >> Cheers,
> >> TG
> >>
> >>
> >> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
> >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >>
> >> > Following up on the report topic, I've created an overleaf doc for
> >> everyone
> >> > who's interested in working on this [1].
> >> >
> >> > First set of (AL-2 compatible) NMT toolkits I've found:
> >> > - Joey NMT [2]
> >> > - OpenNMT [3]
> >> > - MarianNMT [4]
> >> > - Sockeye [5]
> >> > - and of course RTG already shared by Thamme [6]
> >> >
> >> > Regards,
> >> > Tommaso
> >> >
> >> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
> >> > [2] : https://github.com/joeynmt/joeynmt
> >> > [3] : https://github.com/OpenNMT
> >> > [4] : https://github.com/marian-nmt/marian
> >> > [5] : https://github.com/awslabs/sockeye
> >> > [6] : https://github.com/isi-nlp/rtg-xt
> >> >
> >> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <
> >> tommaso.teofili@gmail.com>
> >> > wrote:
> >> >
> >> > > very good idea Thamme!
> >> > > I'd be up for writing such a short survey paper as a result of our
> >> > > analysis.
> >> > >
> >> > > Regards,
> >> > > Tommaso
> >> > >
> >> > >
> >> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com> wrote:
> >> > >
> >> > >> Tomasso and others,
> >> > >>
> >> > >> > I think we may now go into a research phase to understand what
> >> > existing
> >> > >> toolkit we can more easily integrate with.
> >> > >> Agreed.
> >> > >> if we can write a (short) report that compares various NMT toolkits
> >> of
> >> > >> 2020, it would be useful for us to make this decision as well as to
> >> the
> >> > >> NMT
> >> > >> community.
> >> > >> Something like a survey paper on NMT research but focus on toolkits
> >> and
> >> > >> software engineering part.
> >> > >>
> >> > >>
> >> > >>
> >> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili
> >> <
> >> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >> > >>
> >> > >> > Thamme, Jeff,
> >> > >> >
> >> > >> > your contributions will be very important for the project and the
> >> > >> > community, especially given your NLP background, thanks for your
> >> > >> support!
> >> > >> >
> >> > >> > I agree moving towards NMT is the best thing to do at this point
> >> for
> >> > >> > Joshua.
> >> > >> >
> >> > >> > Thamme, thanks for your suggestions!
> >> > >> > I think we may now go into a research phase to understand what
> >> > existing
> >> > >> > toolkit we can more easily integrate with.
> >> > >> > Of course if you like to integrate your own toolkit then that'd be
> >> > even
> >> > >> > more interesting to see how it compares to others.
> >> > >> >
> >> > >> > If that means moving to Python I think it's not a problem; we can
> >> > still
> >> > >> > work on Java bindings to ship a new Joshua Decoder implementation.
> >> > >> >
> >> > >> > The pretrained models topic is imho something we will have to
> >> embrace
> >> > at
> >> > >> > some point, so that others can:
> >> > >> > a) just download new LPs
> >> > >> > b) eventually fine tune with their own data
> >> > >> >
> >> > >> > I am looking forward to start this new phase of research on Joshua.
> >> > >> >
> >> > >> > Regards,
> >> > >> > Tommaso
> >> > >> >
> >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jz...@apache.org>
> >> > >> wrote:
> >> > >> >
> >> > >> > > I haven't contributed to this point but I would like to see
> >> Apache
> >> > >> Joshua
> >> > >> > > remain an active project so I am volunteering to help. I may not
> >> be
> >> > a
> >> > >> lot
> >> > >> > > of help with code for a bit but I will help out with
> >> documentation,
> >> > >> > > releases, etc.
> >> > >> > >
> >> > >> > > I do agree that NMT is the best path forward but I will leave the
> >> > >> choice
> >> > >> > of
> >> > >> > > integrating an existing library into Joshua versus a new NMT
> >> > >> > implementation
> >> > >> > > in Joshua to those more familiar with the code and what they
> >> think
> >> > is
> >> > >> > best
> >> > >> > > for the project.
> >> > >> > >
> >> > >> > > Jeff
> >> > >> > >
> >> > >> > >
> >> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tg...@gmail.com>
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > Hi Tomasso, and others
> >> > >> > > >
> >> > >> > > > *1.  I support the addition of neural MT decoder. *
> >> > >> > > > The world has moved on, and NMT is clearly the way to go
> >> forward.
> >> > >> > > > If you dont believe my words, read what Matt Post himself said
> >> [1]
> >> > >> > three
> >> > >> > > > years ago!
> >> > >> > > >
> >> > >> > > > I have spent the past three years focusing on NMT  as part of
> >> my
> >> > job
> >> > >> > and
> >> > >> > > > Ph.D -- I'd be glad to contribute in that direction.
> >> > >> > > > There are many NMT toolkits out there today. (Fairseq, sockeye,
> >> > >> > > > tensor2tensor, ....)
> >> > >> > > >
> >> > >> > > > The right thing to do, IMHO, is simply merge one of the NMT
> >> > toolkits
> >> > >> > with
> >> > >> > > > Joshua project.  We can do that as long as it's Apache License
> >> > >> right?
> >> > >> > > > We will now have to move towards python land as most toolkits
> >> are
> >> > in
> >> > >> > > > python. On the positive side, we will be losing the ancient
> >> perl
> >> > >> > scripts
> >> > >> > > > that many are not fan of.
> >> > >> > > >
> >> > >> > > > I have been working on my own NMT toolkit for my work and
> >> research
> >> > >> --
> >> > >> > > RTG
> >> > >> > > > https://isi-nlp.github.io/rtg/#conf
> >> > >> > > > I had worked on Joshua in the past, mainly, I improved the code
> >> > >> quality
> >> > >> > > > [2]. So you can tell my new code'd be upto Apache's standards
> >> ;)
> >> > >> > > >
> >> > >> > > > *2. Pretrained MT models for lots of languages*
> >> > >> > > > I have been working on a lib to retrieve parallel data from
> >> many
> >> > >> > sources
> >> > >> > > --
> >> > >> > > > MTData [3]
> >> > >> > > > There is so much parallel data out their for hundreds of
> >> > languages.
> >> > >> > > > My recent estimate is over a billion lines of parallel
> >> sentences
> >> > for
> >> > >> > over
> >> > >> > > > 500 languages is freely and publicly available for download
> >> using
> >> > >> > MTData
> >> > >> > > > tool.
> >> > >> > > > If we find some sponsors to lend us some resources, we could
> >> train
> >> > >> > better
> >> > >> > > > MT models and update the Language Packs section [4].
> >> > >> > > > Perhaps, one massively multilingual NMT model that supports
> >> many
> >> > >> > > > translation directions (I know its possible with NMT; I tested
> >> it
> >> > >> > > recently
> >> > >> > > > with RTG)
> >> > >> > > >
> >> > >> > > > I am interested in hearing what others are thinking.
> >> > >> > > >
> >> > >> > > > [1]
> >> > >> > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
> >> > >> > > > [2] -
> >> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
> >> > >> > > > [3] - https://github.com/thammegowda/mtdata
> >> > >> > > > [4] -
> >> > >> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > Cheers,
> >> > >> > > > TG
> >> > >> > > >
> >> > >> > > > --
> >> > >> > > > *Thamme Gowda *
> >> > >> > > > @thammegowda <https://twitter.com/thammegowda> |
> >> > >> https://isi.edu/~tg
> >> > >> > > > ~Sent via somebody's Webmail server
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
> >> > >> Teofili <
> >> > >> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >> > >> > > >
> >> > >> > > > > Hi all,
> >> > >> > > > >
> >> > >> > > > > This is a roll call for people interested in contributing to
> >> > >> Apache
> >> > >> > > > Joshua
> >> > >> > > > > going forward.
> >> > >> > > > > Contributing could be not just code, but anything that may
> >> help
> >> > >> the
> >> > >> > > > project
> >> > >> > > > > or serve the community.
> >> > >> > > > >
> >> > >> > > > > In case you're interested in helping out please speak up :-)
> >> > >> > > > >
> >> > >> > > > > Code-wise Joshua has not evolved much in the latest months,
> >> > >> there's
> >> > >> > > room
> >> > >> > > > > for both improvements to the current code (make a new minor
> >> > >> release)
> >> > >> > > and
> >> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua
> >> Decoder).
> >> > >> > > > >
> >> > >> > > > > Regards,
> >> > >> > > > > Tommaso
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >

Re: NMT survey (was: Roll Cal)

Posted by Tommaso Teofili <to...@gmail.com>.

I've also added M2M-100 from FB-AI [1].

Regarding desiderata, here's an unsorted list of first things that come to
my mind:
- runs also on jvm
- low resource requirements (e.g. for training)
- can leverage existing / pretrained models
- word and phrase translation capabilities
- good effectiveness :)

Regards,
Tommaso

[1] :
https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/

On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <to...@gmail.com>
wrote:

> Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree we can
> surely have a look at others having different licensing too.
> In the meantime I've added all of your suggestions to the paper (with
> related reference when available).
> We should decide what our desiderata are and establish a first set of eval
> benchmark just to understand what can work for us, at least initially, then
> we can have a more thorough evaluation with a small number of "candidates".
>
> Regards,
> Tommaso
>
> On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tg...@gmail.com> wrote:
>
>> Tomaso,
>>
>> Awesome! Thanks for the links.
>> I will be happy to join, (But I wont be able to contribute to the actual
>> paper untill Oct 24).
>>
>> I suggest we should consider popular NMT toolkits for the survey
>> regardless
>> of their compatibility with AL-2.
>> We should see all the tricks and features, and know if we are missing out
>> on any useful features after enforcing the AL-2 filter (and create issues
>> for adding those features).
>>
>> here are some more NMT toolkits to be included in the survey.
>> - Fairseq https://github.com/pytorch/fairseq
>> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/
>> - Nematus  https://github.com/EdinburghNLP/nematus
>> - xNMT https://github.com/neulab/xnmt
>> - XLM   https://github.com/facebookresearch/XLM/
>>     |-> MASS  https://github.com/microsoft/MASS/  -->
>> https://github.com/thammegowda/unmass  (took that and made it easier to
>> install and use)
>>
>> Some old stuff which we are defnitely not going to use but worth
>> mentioning
>> in the survey (for the sake of completion)
>> - https://github.com/google/seq2seq
>> - https://github.com/tensorflow/nmt
>> - https://github.com/isi-nlp/Zoph_RNN
>>
>>
>>
>> Cheers,
>> TG
>>
>>
>> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
>> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>>
>> > Following up on the report topic, I've created an overleaf doc for
>> everyone
>> > who's interested in working on this [1].
>> >
>> > First set of (AL-2 compatible) NMT toolkits I've found:
>> > - Joey NMT [2]
>> > - OpenNMT [3]
>> > - MarianNMT [4]
>> > - Sockeye [5]
>> > - and of course RTG already shared by Thamme [6]
>> >
>> > Regards,
>> > Tommaso
>> >
>> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
>> > [2] : https://github.com/joeynmt/joeynmt
>> > [3] : https://github.com/OpenNMT
>> > [4] : https://github.com/marian-nmt/marian
>> > [5] : https://github.com/awslabs/sockeye
>> > [6] : https://github.com/isi-nlp/rtg-xt
>> >
>> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <
>> tommaso.teofili@gmail.com>
>> > wrote:
>> >
>> > > very good idea Thamme!
>> > > I'd be up for writing such a short survey paper as a result of our
>> > > analysis.
>> > >
>> > > Regards,
>> > > Tommaso
>> > >
>> > >
>> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com> wrote:
>> > >
>> > >> Tomasso and others,
>> > >>
>> > >> > I think we may now go into a research phase to understand what
>> > existing
>> > >> toolkit we can more easily integrate with.
>> > >> Agreed.
>> > >> if we can write a (short) report that compares various NMT toolkits
>> of
>> > >> 2020, it would be useful for us to make this decision as well as to
>> the
>> > >> NMT
>> > >> community.
>> > >> Something like a survey paper on NMT research but focus on toolkits
>> and
>> > >> software engineering part.
>> > >>
>> > >>
>> > >>
>> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili
>> <
>> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >>
>> > >> > Thamme, Jeff,
>> > >> >
>> > >> > your contributions will be very important for the project and the
>> > >> > community, especially given your NLP background, thanks for your
>> > >> support!
>> > >> >
>> > >> > I agree moving towards NMT is the best thing to do at this point
>> for
>> > >> > Joshua.
>> > >> >
>> > >> > Thamme, thanks for your suggestions!
>> > >> > I think we may now go into a research phase to understand what
>> > existing
>> > >> > toolkit we can more easily integrate with.
>> > >> > Of course if you like to integrate your own toolkit then that'd be
>> > even
>> > >> > more interesting to see how it compares to others.
>> > >> >
>> > >> > If that means moving to Python I think it's not a problem; we can
>> > still
>> > >> > work on Java bindings to ship a new Joshua Decoder implementation.
>> > >> >
>> > >> > The pretrained models topic is imho something we will have to
>> embrace
>> > at
>> > >> > some point, so that others can:
>> > >> > a) just download new LPs
>> > >> > b) eventually fine tune with their own data
>> > >> >
>> > >> > I am looking forward to start this new phase of research on Joshua.
>> > >> >
>> > >> > Regards,
>> > >> > Tommaso
>> > >> >
>> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jz...@apache.org>
>> > >> wrote:
>> > >> >
>> > >> > > I haven't contributed to this point but I would like to see
>> Apache
>> > >> Joshua
>> > >> > > remain an active project so I am volunteering to help. I may not
>> be
>> > a
>> > >> lot
>> > >> > > of help with code for a bit but I will help out with
>> documentation,
>> > >> > > releases, etc.
>> > >> > >
>> > >> > > I do agree that NMT is the best path forward but I will leave the
>> > >> choice
>> > >> > of
>> > >> > > integrating an existing library into Joshua versus a new NMT
>> > >> > implementation
>> > >> > > in Joshua to those more familiar with the code and what they
>> think
>> > is
>> > >> > best
>> > >> > > for the project.
>> > >> > >
>> > >> > > Jeff
>> > >> > >
>> > >> > >
>> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tg...@gmail.com>
>> > >> wrote:
>> > >> > >
>> > >> > > > Hi Tomasso, and others
>> > >> > > >
>> > >> > > > *1.  I support the addition of neural MT decoder. *
>> > >> > > > The world has moved on, and NMT is clearly the way to go
>> forward.
>> > >> > > > If you dont believe my words, read what Matt Post himself said
>> [1]
>> > >> > three
>> > >> > > > years ago!
>> > >> > > >
>> > >> > > > I have spent the past three years focusing on NMT  as part of
>> my
>> > job
>> > >> > and
>> > >> > > > Ph.D -- I'd be glad to contribute in that direction.
>> > >> > > > There are many NMT toolkits out there today. (Fairseq, sockeye,
>> > >> > > > tensor2tensor, ....)
>> > >> > > >
>> > >> > > > The right thing to do, IMHO, is simply merge one of the NMT
>> > toolkits
>> > >> > with
>> > >> > > > Joshua project.  We can do that as long as it's Apache License
>> > >> right?
>> > >> > > > We will now have to move towards python land as most toolkits
>> are
>> > in
>> > >> > > > python. On the positive side, we will be losing the ancient
>> perl
>> > >> > scripts
>> > >> > > > that many are not fan of.
>> > >> > > >
>> > >> > > > I have been working on my own NMT toolkit for my work and
>> research
>> > >> --
>> > >> > > RTG
>> > >> > > > https://isi-nlp.github.io/rtg/#conf
>> > >> > > > I had worked on Joshua in the past, mainly, I improved the code
>> > >> quality
>> > >> > > > [2]. So you can tell my new code'd be upto Apache's standards
>> ;)
>> > >> > > >
>> > >> > > > *2. Pretrained MT models for lots of languages*
>> > >> > > > I have been working on a lib to retrieve parallel data from
>> many
>> > >> > sources
>> > >> > > --
>> > >> > > > MTData [3]
>> > >> > > > There is so much parallel data out their for hundreds of
>> > languages.
>> > >> > > > My recent estimate is over a billion lines of parallel
>> sentences
>> > for
>> > >> > over
>> > >> > > > 500 languages is freely and publicly available for download
>> using
>> > >> > MTData
>> > >> > > > tool.
>> > >> > > > If we find some sponsors to lend us some resources, we could
>> train
>> > >> > better
>> > >> > > > MT models and update the Language Packs section [4].
>> > >> > > > Perhaps, one massively multilingual NMT model that supports
>> many
>> > >> > > > translation directions (I know its possible with NMT; I tested
>> it
>> > >> > > recently
>> > >> > > > with RTG)
>> > >> > > >
>> > >> > > > I am interested in hearing what others are thinking.
>> > >> > > >
>> > >> > > > [1]
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
>> > >> > > > [2] -
>> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
>> > >> > > > [3] - https://github.com/thammegowda/mtdata
>> > >> > > > [4] -
>> > >> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> > >> > > >
>> > >> > > >
>> > >> > > > Cheers,
>> > >> > > > TG
>> > >> > > >
>> > >> > > > --
>> > >> > > > *Thamme Gowda *
>> > >> > > > @thammegowda <https://twitter.com/thammegowda> |
>> > >> https://isi.edu/~tg
>> > >> > > > ~Sent via somebody's Webmail server
>> > >> > > >
>> > >> > > >
>> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
>> > >> Teofili <
>> > >> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >> > > >
>> > >> > > > > Hi all,
>> > >> > > > >
>> > >> > > > > This is a roll call for people interested in contributing to
>> > >> Apache
>> > >> > > > Joshua
>> > >> > > > > going forward.
>> > >> > > > > Contributing could be not just code, but anything that may
>> help
>> > >> the
>> > >> > > > project
>> > >> > > > > or serve the community.
>> > >> > > > >
>> > >> > > > > In case you're interested in helping out please speak up :-)
>> > >> > > > >
>> > >> > > > > Code-wise Joshua has not evolved much in the latest months,
>> > >> there's
>> > >> > > room
>> > >> > > > > for both improvements to the current code (make a new minor
>> > >> release)
>> > >> > > and
>> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua
>> Decoder).
>> > >> > > > >
>> > >> > > > > Regards,
>> > >> > > > > Tommaso
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>

Re: NMT survey (was: Roll Cal)

Posted by Tommaso Teofili <to...@gmail.com>.

Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree we can
surely have a look at others having different licensing too.
In the meantime I've added all of your suggestions to the paper (with
related reference when available).
We should decide what our desiderata are and establish a first set of eval
benchmark just to understand what can work for us, at least initially, then
we can have a more thorough evaluation with a small number of "candidates".

Regards,
Tommaso

On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tg...@gmail.com> wrote:

> Tomaso,
>
> Awesome! Thanks for the links.
> I will be happy to join, (But I wont be able to contribute to the actual
> paper untill Oct 24).
>
> I suggest we should consider popular NMT toolkits for the survey regardless
> of their compatibility with AL-2.
> We should see all the tricks and features, and know if we are missing out
> on any useful features after enforcing the AL-2 filter (and create issues
> for adding those features).
>
> here are some more NMT toolkits to be included in the survey.
> - Fairseq https://github.com/pytorch/fairseq
> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/
> - Nematus  https://github.com/EdinburghNLP/nematus
> - xNMT https://github.com/neulab/xnmt
> - XLM   https://github.com/facebookresearch/XLM/
>     |-> MASS  https://github.com/microsoft/MASS/  -->
> https://github.com/thammegowda/unmass  (took that and made it easier to
> install and use)
>
> Some old stuff which we are defnitely not going to use but worth mentioning
> in the survey (for the sake of completion)
> - https://github.com/google/seq2seq
> - https://github.com/tensorflow/nmt
> - https://github.com/isi-nlp/Zoph_RNN
>
>
>
> Cheers,
> TG
>
>
> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>
> > Following up on the report topic, I've created an overleaf doc for
> everyone
> > who's interested in working on this [1].
> >
> > First set of (AL-2 compatible) NMT toolkits I've found:
> > - Joey NMT [2]
> > - OpenNMT [3]
> > - MarianNMT [4]
> > - Sockeye [5]
> > - and of course RTG already shared by Thamme [6]
> >
> > Regards,
> > Tommaso
> >
> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
> > [2] : https://github.com/joeynmt/joeynmt
> > [3] : https://github.com/OpenNMT
> > [4] : https://github.com/marian-nmt/marian
> > [5] : https://github.com/awslabs/sockeye
> > [6] : https://github.com/isi-nlp/rtg-xt
> >
> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <tommaso.teofili@gmail.com
> >
> > wrote:
> >
> > > very good idea Thamme!
> > > I'd be up for writing such a short survey paper as a result of our
> > > analysis.
> > >
> > > Regards,
> > > Tommaso
> > >
> > >
> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com> wrote:
> > >
> > >> Tomasso and others,
> > >>
> > >> > I think we may now go into a research phase to understand what
> > existing
> > >> toolkit we can more easily integrate with.
> > >> Agreed.
> > >> if we can write a (short) report that compares various NMT toolkits of
> > >> 2020, it would be useful for us to make this decision as well as to
> the
> > >> NMT
> > >> community.
> > >> Something like a survey paper on NMT research but focus on toolkits
> and
> > >> software engineering part.
> > >>
> > >>
> > >>
> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> > >>
> > >> > Thamme, Jeff,
> > >> >
> > >> > your contributions will be very important for the project and the
> > >> > community, especially given your NLP background, thanks for your
> > >> support!
> > >> >
> > >> > I agree moving towards NMT is the best thing to do at this point for
> > >> > Joshua.
> > >> >
> > >> > Thamme, thanks for your suggestions!
> > >> > I think we may now go into a research phase to understand what
> > existing
> > >> > toolkit we can more easily integrate with.
> > >> > Of course if you like to integrate your own toolkit then that'd be
> > even
> > >> > more interesting to see how it compares to others.
> > >> >
> > >> > If that means moving to Python I think it's not a problem; we can
> > still
> > >> > work on Java bindings to ship a new Joshua Decoder implementation.
> > >> >
> > >> > The pretrained models topic is imho something we will have to
> embrace
> > at
> > >> > some point, so that others can:
> > >> > a) just download new LPs
> > >> > b) eventually fine tune with their own data
> > >> >
> > >> > I am looking forward to start this new phase of research on Joshua.
> > >> >
> > >> > Regards,
> > >> > Tommaso
> > >> >
> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jz...@apache.org>
> > >> wrote:
> > >> >
> > >> > > I haven't contributed to this point but I would like to see Apache
> > >> Joshua
> > >> > > remain an active project so I am volunteering to help. I may not
> be
> > a
> > >> lot
> > >> > > of help with code for a bit but I will help out with
> documentation,
> > >> > > releases, etc.
> > >> > >
> > >> > > I do agree that NMT is the best path forward but I will leave the
> > >> choice
> > >> > of
> > >> > > integrating an existing library into Joshua versus a new NMT
> > >> > implementation
> > >> > > in Joshua to those more familiar with the code and what they think
> > is
> > >> > best
> > >> > > for the project.
> > >> > >
> > >> > > Jeff
> > >> > >
> > >> > >
> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tg...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hi Tomasso, and others
> > >> > > >
> > >> > > > *1.  I support the addition of neural MT decoder. *
> > >> > > > The world has moved on, and NMT is clearly the way to go
> forward.
> > >> > > > If you dont believe my words, read what Matt Post himself said
> [1]
> > >> > three
> > >> > > > years ago!
> > >> > > >
> > >> > > > I have spent the past three years focusing on NMT  as part of my
> > job
> > >> > and
> > >> > > > Ph.D -- I'd be glad to contribute in that direction.
> > >> > > > There are many NMT toolkits out there today. (Fairseq, sockeye,
> > >> > > > tensor2tensor, ....)
> > >> > > >
> > >> > > > The right thing to do, IMHO, is simply merge one of the NMT
> > toolkits
> > >> > with
> > >> > > > Joshua project.  We can do that as long as it's Apache License
> > >> right?
> > >> > > > We will now have to move towards python land as most toolkits
> are
> > in
> > >> > > > python. On the positive side, we will be losing the ancient perl
> > >> > scripts
> > >> > > > that many are not fan of.
> > >> > > >
> > >> > > > I have been working on my own NMT toolkit for my work and
> research
> > >> --
> > >> > > RTG
> > >> > > > https://isi-nlp.github.io/rtg/#conf
> > >> > > > I had worked on Joshua in the past, mainly, I improved the code
> > >> quality
> > >> > > > [2]. So you can tell my new code'd be upto Apache's standards ;)
> > >> > > >
> > >> > > > *2. Pretrained MT models for lots of languages*
> > >> > > > I have been working on a lib to retrieve parallel data from many
> > >> > sources
> > >> > > --
> > >> > > > MTData [3]
> > >> > > > There is so much parallel data out their for hundreds of
> > languages.
> > >> > > > My recent estimate is over a billion lines of parallel sentences
> > for
> > >> > over
> > >> > > > 500 languages is freely and publicly available for download
> using
> > >> > MTData
> > >> > > > tool.
> > >> > > > If we find some sponsors to lend us some resources, we could
> train
> > >> > better
> > >> > > > MT models and update the Language Packs section [4].
> > >> > > > Perhaps, one massively multilingual NMT model that supports many
> > >> > > > translation directions (I know its possible with NMT; I tested
> it
> > >> > > recently
> > >> > > > with RTG)
> > >> > > >
> > >> > > > I am interested in hearing what others are thinking.
> > >> > > >
> > >> > > > [1]
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
> > >> > > > [2] -
> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
> > >> > > > [3] - https://github.com/thammegowda/mtdata
> > >> > > > [4] -
> > >> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
> > >> > > >
> > >> > > >
> > >> > > > Cheers,
> > >> > > > TG
> > >> > > >
> > >> > > > --
> > >> > > > *Thamme Gowda *
> > >> > > > @thammegowda <https://twitter.com/thammegowda> |
> > >> https://isi.edu/~tg
> > >> > > > ~Sent via somebody's Webmail server
> > >> > > >
> > >> > > >
> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
> > >> Teofili <
> > >> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > This is a roll call for people interested in contributing to
> > >> Apache
> > >> > > > Joshua
> > >> > > > > going forward.
> > >> > > > > Contributing could be not just code, but anything that may
> help
> > >> the
> > >> > > > project
> > >> > > > > or serve the community.
> > >> > > > >
> > >> > > > > In case you're interested in helping out please speak up :-)
> > >> > > > >
> > >> > > > > Code-wise Joshua has not evolved much in the latest months,
> > >> there's
> > >> > > room
> > >> > > > > for both improvements to the current code (make a new minor
> > >> release)
> > >> > > and
> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua
> Decoder).
> > >> > > > >
> > >> > > > > Regards,
> > >> > > > > Tommaso
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: NMT survey (was: Roll Cal)

Posted by Thamme Gowda <tg...@gmail.com>.

Tomaso,

Awesome! Thanks for the links.
I will be happy to join, (But I wont be able to contribute to the actual
paper untill Oct 24).

I suggest we should consider popular NMT toolkits for the survey regardless
of their compatibility with AL-2.
We should see all the tricks and features, and know if we are missing out
on any useful features after enforcing the AL-2 filter (and create issues
for adding those features).

here are some more NMT toolkits to be included in the survey.
- Fairseq https://github.com/pytorch/fairseq
- Tensor2tensor https://github.com/tensorflow/tensor2tensor/
- Nematus  https://github.com/EdinburghNLP/nematus
- xNMT https://github.com/neulab/xnmt
- XLM   https://github.com/facebookresearch/XLM/
    |-> MASS  https://github.com/microsoft/MASS/  -->
https://github.com/thammegowda/unmass  (took that and made it easier to
install and use)

Some old stuff which we are defnitely not going to use but worth mentioning
in the survey (for the sake of completion)
- https://github.com/google/seq2seq
- https://github.com/tensorflow/nmt
- https://github.com/isi-nlp/Zoph_RNN



Cheers,
TG


ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:

> Following up on the report topic, I've created an overleaf doc for everyone
> who's interested in working on this [1].
>
> First set of (AL-2 compatible) NMT toolkits I've found:
> - Joey NMT [2]
> - OpenNMT [3]
> - MarianNMT [4]
> - Sockeye [5]
> - and of course RTG already shared by Thamme [6]
>
> Regards,
> Tommaso
>
> [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
> [2] : https://github.com/joeynmt/joeynmt
> [3] : https://github.com/OpenNMT
> [4] : https://github.com/marian-nmt/marian
> [5] : https://github.com/awslabs/sockeye
> [6] : https://github.com/isi-nlp/rtg-xt
>
> On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <to...@gmail.com>
> wrote:
>
> > very good idea Thamme!
> > I'd be up for writing such a short survey paper as a result of our
> > analysis.
> >
> > Regards,
> > Tommaso
> >
> >
> > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com> wrote:
> >
> >> Tomasso and others,
> >>
> >> > I think we may now go into a research phase to understand what
> existing
> >> toolkit we can more easily integrate with.
> >> Agreed.
> >> if we can write a (short) report that compares various NMT toolkits of
> >> 2020, it would be useful for us to make this decision as well as to the
> >> NMT
> >> community.
> >> Something like a survey paper on NMT research but focus on toolkits and
> >> software engineering part.
> >>
> >>
> >>
> >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili <
> >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >>
> >> > Thamme, Jeff,
> >> >
> >> > your contributions will be very important for the project and the
> >> > community, especially given your NLP background, thanks for your
> >> support!
> >> >
> >> > I agree moving towards NMT is the best thing to do at this point for
> >> > Joshua.
> >> >
> >> > Thamme, thanks for your suggestions!
> >> > I think we may now go into a research phase to understand what
> existing
> >> > toolkit we can more easily integrate with.
> >> > Of course if you like to integrate your own toolkit then that'd be
> even
> >> > more interesting to see how it compares to others.
> >> >
> >> > If that means moving to Python I think it's not a problem; we can
> still
> >> > work on Java bindings to ship a new Joshua Decoder implementation.
> >> >
> >> > The pretrained models topic is imho something we will have to embrace
> at
> >> > some point, so that others can:
> >> > a) just download new LPs
> >> > b) eventually fine tune with their own data
> >> >
> >> > I am looking forward to start this new phase of research on Joshua.
> >> >
> >> > Regards,
> >> > Tommaso
> >> >
> >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <jz...@apache.org>
> >> wrote:
> >> >
> >> > > I haven't contributed to this point but I would like to see Apache
> >> Joshua
> >> > > remain an active project so I am volunteering to help. I may not be
> a
> >> lot
> >> > > of help with code for a bit but I will help out with documentation,
> >> > > releases, etc.
> >> > >
> >> > > I do agree that NMT is the best path forward but I will leave the
> >> choice
> >> > of
> >> > > integrating an existing library into Joshua versus a new NMT
> >> > implementation
> >> > > in Joshua to those more familiar with the code and what they think
> is
> >> > best
> >> > > for the project.
> >> > >
> >> > > Jeff
> >> > >
> >> > >
> >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <tg...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hi Tomasso, and others
> >> > > >
> >> > > > *1.  I support the addition of neural MT decoder. *
> >> > > > The world has moved on, and NMT is clearly the way to go forward.
> >> > > > If you dont believe my words, read what Matt Post himself said [1]
> >> > three
> >> > > > years ago!
> >> > > >
> >> > > > I have spent the past three years focusing on NMT  as part of my
> job
> >> > and
> >> > > > Ph.D -- I'd be glad to contribute in that direction.
> >> > > > There are many NMT toolkits out there today. (Fairseq, sockeye,
> >> > > > tensor2tensor, ....)
> >> > > >
> >> > > > The right thing to do, IMHO, is simply merge one of the NMT
> toolkits
> >> > with
> >> > > > Joshua project.  We can do that as long as it's Apache License
> >> right?
> >> > > > We will now have to move towards python land as most toolkits are
> in
> >> > > > python. On the positive side, we will be losing the ancient perl
> >> > scripts
> >> > > > that many are not fan of.
> >> > > >
> >> > > > I have been working on my own NMT toolkit for my work and research
> >> --
> >> > > RTG
> >> > > > https://isi-nlp.github.io/rtg/#conf
> >> > > > I had worked on Joshua in the past, mainly, I improved the code
> >> quality
> >> > > > [2]. So you can tell my new code'd be upto Apache's standards ;)
> >> > > >
> >> > > > *2. Pretrained MT models for lots of languages*
> >> > > > I have been working on a lib to retrieve parallel data from many
> >> > sources
> >> > > --
> >> > > > MTData [3]
> >> > > > There is so much parallel data out their for hundreds of
> languages.
> >> > > > My recent estimate is over a billion lines of parallel sentences
> for
> >> > over
> >> > > > 500 languages is freely and publicly available for download using
> >> > MTData
> >> > > > tool.
> >> > > > If we find some sponsors to lend us some resources, we could train
> >> > better
> >> > > > MT models and update the Language Packs section [4].
> >> > > > Perhaps, one massively multilingual NMT model that supports many
> >> > > > translation directions (I know its possible with NMT; I tested it
> >> > > recently
> >> > > > with RTG)
> >> > > >
> >> > > > I am interested in hearing what others are thinking.
> >> > > >
> >> > > > [1]
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
> >> > > > [2] -
> >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
> >> > > > [3] - https://github.com/thammegowda/mtdata
> >> > > > [4] -
> >> > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
> >> > > >
> >> > > >
> >> > > > Cheers,
> >> > > > TG
> >> > > >
> >> > > > --
> >> > > > *Thamme Gowda *
> >> > > > @thammegowda <https://twitter.com/thammegowda> |
> >> https://isi.edu/~tg
> >> > > > ~Sent via somebody's Webmail server
> >> > > >
> >> > > >
> >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
> >> Teofili <
> >> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > This is a roll call for people interested in contributing to
> >> Apache
> >> > > > Joshua
> >> > > > > going forward.
> >> > > > > Contributing could be not just code, but anything that may help
> >> the
> >> > > > project
> >> > > > > or serve the community.
> >> > > > >
> >> > > > > In case you're interested in helping out please speak up :-)
> >> > > > >
> >> > > > > Code-wise Joshua has not evolved much in the latest months,
> >> there's
> >> > > room
> >> > > > > for both improvements to the current code (make a new minor
> >> release)
> >> > > and
> >> > > > > new ideas / code branches (e.g. neural MT based Joshua Decoder).
> >> > > > >
> >> > > > > Regards,
> >> > > > > Tommaso
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>