You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Merrill, Jeremy" <je...@nytimes.com> on 2017/03/20 15:30:07 UTC

machine translation recommendation for use with Tika?

Hi friends,

I've been tasked with figuring out how to machine-translate a large set of
documents from a common European language into English, using a system that
already utilizes Tika.

I know Tika integrates with a handful of machine-translation APIs
<https://tika.apache.org/1.14/api/org/apache/tika/language/translate/package-summary.html>.
Do you all have a sense of which works best, both in terms of translation
quality and ease of integration with Tika?

(We know we're going to have to pay, but the amount of content won't be
huge, so differences in price aren't a big factor.)

Thanks in advance,
Jeremy B. Merrill

Re: machine translation recommendation for use with Tika?

Posted by "Merrill, Jeremy" <je...@nytimes.com>.
Thank you! I'll give 'em a try.

---
Jeremy B. Merrill
The New York Times


On Mon, Mar 20, 2017 at 4:01 PM, Mattmann, Chris A (3010) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> I would try em’ all out honestly. Performance-wise, setup wise they are
> kind of different, though
> Tika boils it down to a config file for each which is nice. I am working
> on a paper that compares
> all of them but am not done yet ;)
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *"Merrill, Jeremy" <je...@nytimes.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Monday, March 20, 2017 at 11:59 AM
> *To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Subject: *Re: machine translation recommendation for use with Tika?
>
>
>
> Hi Chris,
>
> Thank you, this is helpful. I think running our own system is out of the
> question, just on account of time (News just keeps on happening. Though
> it'd certainly would be fun to play with...) and -- presumably -- result
> quality.
>
> Do you have thoughts on which of Google, Microsoft and Lingo24 might be
> easiest? Or are they all just as easy to use with Tika and I should just
> try 'em all out?
>
> Thanks,
>
>
> ---
>
> Jeremy B. Merrill
>
> The New York Times
>
>
>
>
>
> On Mon, Mar 20, 2017 at 1:43 PM, Mattmann, Chris A (3010) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> Hi Jeremy,
>
>
>
> Thanks for reaching out.
>
>
>
> So far I have had really good experience with the Lingo24 translator. It
> really depends though
> and is based on two families of what you are trying to do. For example, if
> you want the widest,
> most broad coverage and trained translation, Google, Microsoft, Lingo24,
> fall into the remote
> translation API service category. They all have tons of data, and
> training. I also think all use
> human curators for quality review of some things. All will eventually cost
> you. I know that you
> get some X million characters of translation a month in the services.
>
>
>
> On the other end is if you deploy your own Apache Joshua (incubating)
> and/or Moses MT system,
> and then have Tika connect to them as a service. In this case you control
> the costs and can run it
> on your own servers, etc, but you are limited by the quality of your
> trained models, and your language
> pairs.
>
>
>
> Does this make sense?
>
>
>
> Cheers,
>
> Chris
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *"Merrill, Jeremy" <je...@nytimes.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Monday, March 20, 2017 at 8:30 AM
> *To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Subject: *machine translation recommendation for use with Tika?
>
>
>
> Hi friends,
>
> I've been tasked with figuring out how to machine-translate a large set of
> documents from a common European language into English, using a system that
> already utilizes Tika.
>
> I know Tika integrates with a handful of machine-translation APIs
> <https://tika.apache.org/1.14/api/org/apache/tika/language/translate/package-summary.html>.
> Do you all have a sense of which works best, both in terms of translation
> quality and ease of integration with Tika?
>
> (We know we're going to have to pay, but the amount of content won't be
> huge, so differences in price aren't a big factor.)
>
> Thanks in advance,
>
> Jeremy B. Merrill
>
>
>
>
>
>
>

Re: machine translation recommendation for use with Tika?

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.
I would try em’ all out honestly. Performance-wise, setup wise they are kind of different, though
Tika boils it down to a config file for each which is nice. I am working on a paper that compares
all of them but am not done yet ;)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: "Merrill, Jeremy" <je...@nytimes.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Monday, March 20, 2017 at 11:59 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: machine translation recommendation for use with Tika?

Hi Chris,
Thank you, this is helpful. I think running our own system is out of the question, just on account of time (News just keeps on happening. Though it'd certainly would be fun to play with...) and -- presumably -- result quality.
Do you have thoughts on which of Google, Microsoft and Lingo24 might be easiest? Or are they all just as easy to use with Tika and I should just try 'em all out?
Thanks,

---
Jeremy B. Merrill
The New York Times


On Mon, Mar 20, 2017 at 1:43 PM, Mattmann, Chris A (3010) <ch...@jpl.nasa.gov>> wrote:
Hi Jeremy,

Thanks for reaching out.

So far I have had really good experience with the Lingo24 translator. It really depends though
and is based on two families of what you are trying to do. For example, if you want the widest,
most broad coverage and trained translation, Google, Microsoft, Lingo24, fall into the remote
translation API service category. They all have tons of data, and training. I also think all use
human curators for quality review of some things. All will eventually cost you. I know that you
get some X million characters of translation a month in the services.

On the other end is if you deploy your own Apache Joshua (incubating) and/or Moses MT system,
and then have Tika connect to them as a service. In this case you control the costs and can run it
on your own servers, etc, but you are limited by the quality of your trained models, and your language
pairs.

Does this make sense?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: "Merrill, Jeremy" <je...@nytimes.com>>
Reply-To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Date: Monday, March 20, 2017 at 8:30 AM
To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Subject: machine translation recommendation for use with Tika?

Hi friends,

I've been tasked with figuring out how to machine-translate a large set of documents from a common European language into English, using a system that already utilizes Tika.

I know Tika integrates with a handful of machine-translation APIs<https://tika.apache.org/1.14/api/org/apache/tika/language/translate/package-summary.html>. Do you all have a sense of which works best, both in terms of translation quality and ease of integration with Tika?

(We know we're going to have to pay, but the amount of content won't be huge, so differences in price aren't a big factor.)

Thanks in advance,
Jeremy B. Merrill




Re: machine translation recommendation for use with Tika?

Posted by "Merrill, Jeremy" <je...@nytimes.com>.
Hi Chris,

Thank you, this is helpful. I think running our own system is out of the
question, just on account of time (News just keeps on happening. Though
it'd certainly would be fun to play with...) and -- presumably -- result
quality.

Do you have thoughts on which of Google, Microsoft and Lingo24 might be
easiest? Or are they all just as easy to use with Tika and I should just
try 'em all out?

Thanks,

---
Jeremy B. Merrill
The New York Times


On Mon, Mar 20, 2017 at 1:43 PM, Mattmann, Chris A (3010) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi Jeremy,
>
>
>
> Thanks for reaching out.
>
>
>
> So far I have had really good experience with the Lingo24 translator. It
> really depends though
> and is based on two families of what you are trying to do. For example, if
> you want the widest,
> most broad coverage and trained translation, Google, Microsoft, Lingo24,
> fall into the remote
> translation API service category. They all have tons of data, and
> training. I also think all use
> human curators for quality review of some things. All will eventually cost
> you. I know that you
> get some X million characters of translation a month in the services.
>
>
>
> On the other end is if you deploy your own Apache Joshua (incubating)
> and/or Moses MT system,
> and then have Tika connect to them as a service. In this case you control
> the costs and can run it
> on your own servers, etc, but you are limited by the quality of your
> trained models, and your language
> pairs.
>
>
>
> Does this make sense?
>
>
>
> Cheers,
>
> Chris
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: chris.a.mattmann@nasa.gov
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *"Merrill, Jeremy" <je...@nytimes.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Monday, March 20, 2017 at 8:30 AM
> *To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Subject: *machine translation recommendation for use with Tika?
>
>
>
> Hi friends,
>
> I've been tasked with figuring out how to machine-translate a large set of
> documents from a common European language into English, using a system that
> already utilizes Tika.
>
> I know Tika integrates with a handful of machine-translation APIs
> <https://tika.apache.org/1.14/api/org/apache/tika/language/translate/package-summary.html>.
> Do you all have a sense of which works best, both in terms of translation
> quality and ease of integration with Tika?
>
> (We know we're going to have to pay, but the amount of content won't be
> huge, so differences in price aren't a big factor.)
>
> Thanks in advance,
>
> Jeremy B. Merrill
>
>
>
>
>

Re: machine translation recommendation for use with Tika?

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.
Hi Jeremy,

Thanks for reaching out.

So far I have had really good experience with the Lingo24 translator. It really depends though
and is based on two families of what you are trying to do. For example, if you want the widest,
most broad coverage and trained translation, Google, Microsoft, Lingo24, fall into the remote
translation API service category. They all have tons of data, and training. I also think all use
human curators for quality review of some things. All will eventually cost you. I know that you
get some X million characters of translation a month in the services.

On the other end is if you deploy your own Apache Joshua (incubating) and/or Moses MT system,
and then have Tika connect to them as a service. In this case you control the costs and can run it
on your own servers, etc, but you are limited by the quality of your trained models, and your language
pairs.

Does this make sense?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: "Merrill, Jeremy" <je...@nytimes.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Monday, March 20, 2017 at 8:30 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: machine translation recommendation for use with Tika?

Hi friends,

I've been tasked with figuring out how to machine-translate a large set of documents from a common European language into English, using a system that already utilizes Tika.

I know Tika integrates with a handful of machine-translation APIs<https://tika.apache.org/1.14/api/org/apache/tika/language/translate/package-summary.html>. Do you all have a sense of which works best, both in terms of translation quality and ease of integration with Tika?

(We know we're going to have to pay, but the amount of content won't be huge, so differences in price aren't a big factor.)

Thanks in advance,
Jeremy B. Merrill