You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by Tommaso Teofili <to...@gmail.com> on 2020/11/20 17:52:47 UTC

Re: NMT survey (was: Roll Cal)

Hi everyone,

following up on this topic, how about performing a shared evaluation of the
tools we mentioned so far?
I'd address this by deciding upon a "shared MT task" on a well known and
not so big dataset and then getting an evaluation run for each of those
toolkits.
The evaluation task would require gettin:
- an accuracy metric value (BLEU? I know it's questionable, otherwise what
else?)
- a prediction speed measure (e.g. translations per second), reporting also
the hardware used
- a training speed measure (e.g. seconds/minutes/hours taken to train the
model)

What do others think?

Regards,
Tommaso


On Wed, 21 Oct 2020 at 15:57, Tommaso Teofili <to...@gmail.com>
wrote:

> hi Michael,
>
> nice to hear from you too on the dev@ list! We're looking forward to see
> you involved :)
> If I understood Thamme's proposal correctly, the paper is just a way to
> write down our own evaluation of current approaches to NMT; that would help
> us decide on our own way to pursue MT.
> At this stage I am not sure what we'll end up doing, it'd be nice not to
> just be a wrapper for one of those existing NMT tools, but let's see.
>
> Regards,
> Tommaso
>
>
> On Tue, 20 Oct 2020 at 15:37, Michael Wall <mj...@apache.org> wrote:
>
>> Hi,
>>
>> Been watching Joshua since it was incubating.  Finally may have some
>> free time and am would like to get involved.
>>
>> The NMT stuff looks interesting.  I don't have an overleaf account, so
>> maybe my next question is answered there.  What is the end result of
>> the paper?  Will you be choosing a framework to add to Joshua.  And if
>> so, what will make it different than just using said framework on it's
>> own?
>>
>> Thanks
>>
>> Mike
>>
>> On Tue, Oct 20, 2020 at 5:34 AM Tommaso Teofili
>> <to...@gmail.com> wrote:
>> >
>> > I've also added M2M-100 from FB-AI [1].
>> >
>> > Regarding desiderata, here's an unsorted list of first things that come
>> to
>> > my mind:
>> > - runs also on jvm
>> > - low resource requirements (e.g. for training)
>> > - can leverage existing / pretrained models
>> > - word and phrase translation capabilities
>> > - good effectiveness :)
>> >
>> > Regards,
>> > Tommaso
>> >
>> > [1] :
>> >
>> https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/
>> >
>> > On Mon, 19 Oct 2020 at 14:09, Tommaso Teofili <
>> tommaso.teofili@gmail.com>
>> > wrote:
>> >
>> > > Thanks a lot Thamme, I sticked to AL-2 compatible ones, but I agree
>> we can
>> > > surely have a look at others having different licensing too.
>> > > In the meantime I've added all of your suggestions to the paper (with
>> > > related reference when available).
>> > > We should decide what our desiderata are and establish a first set of
>> eval
>> > > benchmark just to understand what can work for us, at least
>> initially, then
>> > > we can have a more thorough evaluation with a small number of
>> "candidates".
>> > >
>> > > Regards,
>> > > Tommaso
>> > >
>> > > On Mon, 19 Oct 2020 at 09:05, Thamme Gowda <tg...@gmail.com> wrote:
>> > >
>> > >> Tomaso,
>> > >>
>> > >> Awesome! Thanks for the links.
>> > >> I will be happy to join, (But I wont be able to contribute to the
>> actual
>> > >> paper untill Oct 24).
>> > >>
>> > >> I suggest we should consider popular NMT toolkits for the survey
>> > >> regardless
>> > >> of their compatibility with AL-2.
>> > >> We should see all the tricks and features, and know if we are
>> missing out
>> > >> on any useful features after enforcing the AL-2 filter (and create
>> issues
>> > >> for adding those features).
>> > >>
>> > >> here are some more NMT toolkits to be included in the survey.
>> > >> - Fairseq https://github.com/pytorch/fairseq
>> > >> - Tensor2tensor https://github.com/tensorflow/tensor2tensor/
>> > >> - Nematus  https://github.com/EdinburghNLP/nematus
>> > >> - xNMT https://github.com/neulab/xnmt
>> > >> - XLM   https://github.com/facebookresearch/XLM/
>> > >>     |-> MASS  https://github.com/microsoft/MASS/  -->
>> > >> https://github.com/thammegowda/unmass  (took that and made it
>> easier to
>> > >> install and use)
>> > >>
>> > >> Some old stuff which we are defnitely not going to use but worth
>> > >> mentioning
>> > >> in the survey (for the sake of completion)
>> > >> - https://github.com/google/seq2seq
>> > >> - https://github.com/tensorflow/nmt
>> > >> - https://github.com/isi-nlp/Zoph_RNN
>> > >>
>> > >>
>> > >>
>> > >> Cheers,
>> > >> TG
>> > >>
>> > >>
>> > >> ಭಾನು, ಅಕ್ಟೋ 18, 2020 ರಂದು 11:17 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso Teofili
>> <
>> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >>
>> > >> > Following up on the report topic, I've created an overleaf doc for
>> > >> everyone
>> > >> > who's interested in working on this [1].
>> > >> >
>> > >> > First set of (AL-2 compatible) NMT toolkits I've found:
>> > >> > - Joey NMT [2]
>> > >> > - OpenNMT [3]
>> > >> > - MarianNMT [4]
>> > >> > - Sockeye [5]
>> > >> > - and of course RTG already shared by Thamme [6]
>> > >> >
>> > >> > Regards,
>> > >> > Tommaso
>> > >> >
>> > >> > [1] : https://www.overleaf.com/8617554857qkvtqtpcxxmw
>> > >> > [2] : https://github.com/joeynmt/joeynmt
>> > >> > [3] : https://github.com/OpenNMT
>> > >> > [4] : https://github.com/marian-nmt/marian
>> > >> > [5] : https://github.com/awslabs/sockeye
>> > >> > [6] : https://github.com/isi-nlp/rtg-xt
>> > >> >
>> > >> > On Wed, 14 Oct 2020 at 11:06, Tommaso Teofili <
>> > >> tommaso.teofili@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > very good idea Thamme!
>> > >> > > I'd be up for writing such a short survey paper as a result of
>> our
>> > >> > > analysis.
>> > >> > >
>> > >> > > Regards,
>> > >> > > Tommaso
>> > >> > >
>> > >> > >
>> > >> > > On Wed, 14 Oct 2020 at 05:23, Thamme Gowda <tg...@gmail.com>
>> wrote:
>> > >> > >
>> > >> > >> Tomasso and others,
>> > >> > >>
>> > >> > >> > I think we may now go into a research phase to understand what
>> > >> > existing
>> > >> > >> toolkit we can more easily integrate with.
>> > >> > >> Agreed.
>> > >> > >> if we can write a (short) report that compares various NMT
>> toolkits
>> > >> of
>> > >> > >> 2020, it would be useful for us to make this decision as well
>> as to
>> > >> the
>> > >> > >> NMT
>> > >> > >> community.
>> > >> > >> Something like a survey paper on NMT research but focus on
>> toolkits
>> > >> and
>> > >> > >> software engineering part.
>> > >> > >>
>> > >> > >>
>> > >> > >>
>> > >> > >> ಶುಕ್ರ, ಅಕ್ಟೋ 9, 2020 ರಂದು 11:39 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Tommaso
>> Teofili
>> > >> <
>> > >> > >> tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >> > >>
>> > >> > >> > Thamme, Jeff,
>> > >> > >> >
>> > >> > >> > your contributions will be very important for the project and
>> the
>> > >> > >> > community, especially given your NLP background, thanks for
>> your
>> > >> > >> support!
>> > >> > >> >
>> > >> > >> > I agree moving towards NMT is the best thing to do at this
>> point
>> > >> for
>> > >> > >> > Joshua.
>> > >> > >> >
>> > >> > >> > Thamme, thanks for your suggestions!
>> > >> > >> > I think we may now go into a research phase to understand what
>> > >> > existing
>> > >> > >> > toolkit we can more easily integrate with.
>> > >> > >> > Of course if you like to integrate your own toolkit then
>> that'd be
>> > >> > even
>> > >> > >> > more interesting to see how it compares to others.
>> > >> > >> >
>> > >> > >> > If that means moving to Python I think it's not a problem; we
>> can
>> > >> > still
>> > >> > >> > work on Java bindings to ship a new Joshua Decoder
>> implementation.
>> > >> > >> >
>> > >> > >> > The pretrained models topic is imho something we will have to
>> > >> embrace
>> > >> > at
>> > >> > >> > some point, so that others can:
>> > >> > >> > a) just download new LPs
>> > >> > >> > b) eventually fine tune with their own data
>> > >> > >> >
>> > >> > >> > I am looking forward to start this new phase of research on
>> Joshua.
>> > >> > >> >
>> > >> > >> > Regards,
>> > >> > >> > Tommaso
>> > >> > >> >
>> > >> > >> > On Tue, 6 Oct 2020 at 18:30, Jeff Zemerick <
>> jzemerick@apache.org>
>> > >> > >> wrote:
>> > >> > >> >
>> > >> > >> > > I haven't contributed to this point but I would like to see
>> > >> Apache
>> > >> > >> Joshua
>> > >> > >> > > remain an active project so I am volunteering to help. I
>> may not
>> > >> be
>> > >> > a
>> > >> > >> lot
>> > >> > >> > > of help with code for a bit but I will help out with
>> > >> documentation,
>> > >> > >> > > releases, etc.
>> > >> > >> > >
>> > >> > >> > > I do agree that NMT is the best path forward but I will
>> leave the
>> > >> > >> choice
>> > >> > >> > of
>> > >> > >> > > integrating an existing library into Joshua versus a new NMT
>> > >> > >> > implementation
>> > >> > >> > > in Joshua to those more familiar with the code and what they
>> > >> think
>> > >> > is
>> > >> > >> > best
>> > >> > >> > > for the project.
>> > >> > >> > >
>> > >> > >> > > Jeff
>> > >> > >> > >
>> > >> > >> > >
>> > >> > >> > > On Tue, Oct 6, 2020 at 2:28 AM Thamme Gowda <
>> tgowdan@gmail.com>
>> > >> > >> wrote:
>> > >> > >> > >
>> > >> > >> > > > Hi Tomasso, and others
>> > >> > >> > > >
>> > >> > >> > > > *1.  I support the addition of neural MT decoder. *
>> > >> > >> > > > The world has moved on, and NMT is clearly the way to go
>> > >> forward.
>> > >> > >> > > > If you dont believe my words, read what Matt Post himself
>> said
>> > >> [1]
>> > >> > >> > three
>> > >> > >> > > > years ago!
>> > >> > >> > > >
>> > >> > >> > > > I have spent the past three years focusing on NMT  as
>> part of
>> > >> my
>> > >> > job
>> > >> > >> > and
>> > >> > >> > > > Ph.D -- I'd be glad to contribute in that direction.
>> > >> > >> > > > There are many NMT toolkits out there today. (Fairseq,
>> sockeye,
>> > >> > >> > > > tensor2tensor, ....)
>> > >> > >> > > >
>> > >> > >> > > > The right thing to do, IMHO, is simply merge one of the
>> NMT
>> > >> > toolkits
>> > >> > >> > with
>> > >> > >> > > > Joshua project.  We can do that as long as it's Apache
>> License
>> > >> > >> right?
>> > >> > >> > > > We will now have to move towards python land as most
>> toolkits
>> > >> are
>> > >> > in
>> > >> > >> > > > python. On the positive side, we will be losing the
>> ancient
>> > >> perl
>> > >> > >> > scripts
>> > >> > >> > > > that many are not fan of.
>> > >> > >> > > >
>> > >> > >> > > > I have been working on my own NMT toolkit for my work and
>> > >> research
>> > >> > >> --
>> > >> > >> > > RTG
>> > >> > >> > > > https://isi-nlp.github.io/rtg/#conf
>> > >> > >> > > > I had worked on Joshua in the past, mainly, I improved
>> the code
>> > >> > >> quality
>> > >> > >> > > > [2]. So you can tell my new code'd be upto Apache's
>> standards
>> > >> ;)
>> > >> > >> > > >
>> > >> > >> > > > *2. Pretrained MT models for lots of languages*
>> > >> > >> > > > I have been working on a lib to retrieve parallel data
>> from
>> > >> many
>> > >> > >> > sources
>> > >> > >> > > --
>> > >> > >> > > > MTData [3]
>> > >> > >> > > > There is so much parallel data out their for hundreds of
>> > >> > languages.
>> > >> > >> > > > My recent estimate is over a billion lines of parallel
>> > >> sentences
>> > >> > for
>> > >> > >> > over
>> > >> > >> > > > 500 languages is freely and publicly available for
>> download
>> > >> using
>> > >> > >> > MTData
>> > >> > >> > > > tool.
>> > >> > >> > > > If we find some sponsors to lend us some resources, we
>> could
>> > >> train
>> > >> > >> > better
>> > >> > >> > > > MT models and update the Language Packs section [4].
>> > >> > >> > > > Perhaps, one massively multilingual NMT model that
>> supports
>> > >> many
>> > >> > >> > > > translation directions (I know its possible with NMT; I
>> tested
>> > >> it
>> > >> > >> > > recently
>> > >> > >> > > > with RTG)
>> > >> > >> > > >
>> > >> > >> > > > I am interested in hearing what others are thinking.
>> > >> > >> > > >
>> > >> > >> > > > [1]
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> https://mail-archives.apache.org/mod_mbox/joshua-dev/201709.mbox/%3CA481E867-A845-4BC0-B5AF-5CEAAB3D0B7D%40cs.jhu.edu%3E
>> > >> > >> > > > [2] -
>> > >> > >> https://github.com/apache/joshua/pulls?q=author%3Athammegowda+
>> > >> > >> > > > [3] - https://github.com/thammegowda/mtdata
>> > >> > >> > > > [4] -
>> > >> > >> >
>> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > > > Cheers,
>> > >> > >> > > > TG
>> > >> > >> > > >
>> > >> > >> > > > --
>> > >> > >> > > > *Thamme Gowda *
>> > >> > >> > > > @thammegowda <https://twitter.com/thammegowda> |
>> > >> > >> https://isi.edu/~tg
>> > >> > >> > > > ~Sent via somebody's Webmail server
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > > > ಸೋಮ, ಅಕ್ಟೋ 5, 2020 ರಂದು 12:16 ಪೂರ್ವಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು
>> Tommaso
>> > >> > >> Teofili <
>> > >> > >> > > > tommaso.teofili@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
>> > >> > >> > > >
>> > >> > >> > > > > Hi all,
>> > >> > >> > > > >
>> > >> > >> > > > > This is a roll call for people interested in
>> contributing to
>> > >> > >> Apache
>> > >> > >> > > > Joshua
>> > >> > >> > > > > going forward.
>> > >> > >> > > > > Contributing could be not just code, but anything that
>> may
>> > >> help
>> > >> > >> the
>> > >> > >> > > > project
>> > >> > >> > > > > or serve the community.
>> > >> > >> > > > >
>> > >> > >> > > > > In case you're interested in helping out please speak
>> up :-)
>> > >> > >> > > > >
>> > >> > >> > > > > Code-wise Joshua has not evolved much in the latest
>> months,
>> > >> > >> there's
>> > >> > >> > > room
>> > >> > >> > > > > for both improvements to the current code (make a new
>> minor
>> > >> > >> release)
>> > >> > >> > > and
>> > >> > >> > > > > new ideas / code branches (e.g. neural MT based Joshua
>> > >> Decoder).
>> > >> > >> > > > >
>> > >> > >> > > > > Regards,
>> > >> > >> > > > > Tommaso
>> > >> > >> > > > >
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >>
>> > >> > >
>> > >> >
>> > >>
>> > >
>>
>