You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tyler Palsulich <tp...@gmail.com> on 2014/06/04 00:00:12 UTC

Review Request 22219: Add Translation to Tika

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/
-----------------------------------------------------------

Review request for tika and Chris Mattmann.


Repository: tika


Description
-------

This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 

There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!


Diffs
-----

  trunk/tika-core/pom.xml 1599587 
  trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1599587 
  trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java PRE-CREATION 
  trunk/tika-core/src/main/resources/org/apache/tika/language/translator.properties PRE-CREATION 
  trunk/tika-core/src/test/java/org/apache/tika/language/TranslatorTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/22219/diff/


Testing
-------

There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).


Thanks,

Tyler Palsulich


Re: Review Request 22219: Add Translation to Tika

Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44731
-----------------------------------------------------------



trunk/tika-core/src/main/java/org/apache/tika/Tika.java
<https://reviews.apache.org/r/22219/#comment79218>

    probably should make this an interface.



trunk/tika-core/src/main/java/org/apache/tika/Tika.java
<https://reviews.apache.org/r/22219/#comment79220>

    maybe give an example (e.g., "en" or "fr") here



trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java
<https://reviews.apache.org/r/22219/#comment79221>

    make this an interface and make the MicrosoftTranslator one Impl of it.



trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java
<https://reviews.apache.org/r/22219/#comment79222>

    interface method.



trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java
<https://reviews.apache.org/r/22219/#comment79223>

    interface method.


- Chris Mattmann


On June 3, 2014, 10 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 3, 2014, 10 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/tika-core/pom.xml 1599587 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1599587 
>   trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java PRE-CREATION 
>   trunk/tika-core/src/main/resources/org/apache/tika/language/translator.properties PRE-CREATION 
>   trunk/tika-core/src/test/java/org/apache/tika/language/TranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44732
-----------------------------------------------------------


Tyler this is a great first start! I think if you turn Translator into an interface, we can commit this right away.

- Chris Mattmann


On June 3, 2014, 10 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 3, 2014, 10 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/tika-core/pom.xml 1599587 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1599587 
>   trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java PRE-CREATION 
>   trunk/tika-core/src/main/resources/org/apache/tika/language/translator.properties PRE-CREATION 
>   trunk/tika-core/src/test/java/org/apache/tika/language/TranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44758
-----------------------------------------------------------



trunk/tika-core/src/main/java/org/apache/tika/Tika.java
<https://reviews.apache.org/r/22219/#comment79299>

    You don't need this - you can do service loading just like we do with the Parser class? Can you check that out? We should just need the interface.



trunk/tika-core/src/main/java/org/apache/tika/Tika.java
<https://reviews.apache.org/r/22219/#comment79301>

    should be dynamically loaded via JavaSPI



trunk/tika-core/src/main/java/org/apache/tika/language/MicrosoftTranslator.java
<https://reviews.apache.org/r/22219/#comment79303>

    use Eclipse or IdeaJ to auto put javadoc in for interfaces?


- Chris Mattmann


On June 4, 2014, 7:17 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 4, 2014, 7:17 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/tika-core/pom.xml 1600418 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600418 
>   trunk/tika-core/src/main/java/org/apache/tika/language/MicrosoftTranslator.java PRE-CREATION 
>   trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java PRE-CREATION 
>   trunk/tika-core/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
>   trunk/tika-core/src/test/java/org/apache/tika/language/MicrosoftTranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44841
-----------------------------------------------------------



trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java
<https://reviews.apache.org/r/22219/#comment79395>

    Need Apache license here. I will add it.


- Chris Mattmann


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java PRE-CREATION 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java PRE-CREATION 
>   trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator PRE-CREATION 
>   trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
>   trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44842
-----------------------------------------------------------

Ship it!


Ship It!

- Chris Mattmann


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java PRE-CREATION 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java PRE-CREATION 
>   trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator PRE-CREATION 
>   trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
>   trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Paul Ramirez <pr...@jpl.nasa.gov>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44825
-----------------------------------------------------------

Ship it!


Ship It!

- Paul Ramirez


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java PRE-CREATION 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java PRE-CREATION 
>   trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator PRE-CREATION 
>   trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
>   trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Chris Mattmann <ma...@apache.org>.

> On June 5, 2014, 8:35 p.m., Matthias Krueger wrote:
> > I'm new to Tika and late to this feature so I don't have any background on the targeted use case. Are Tika users looking to machine-translate metadata, parsed content or bot? Do they really benefit by having this in Tika and not use their existing libraries (they need to squeeze them into a Translator interface impl anyway)? Also, org.apache.tika.Tika handles detection and parsing of files/binary streams, it seems a bit awkward to now also have a method to translate Strings from one language to another.
> > 
> > The implementation looks ok. Would it make sense to have more fine grained exception handling for translations? The generic Exception throwing in the Translator interface and the catch(Exception) and rethrow IllegalStateException in Tika#translate seems a bit weird.

Thanks for the comments. One of the goals in Tika has long been to support Language "detection". It's useful of course for text extraction and metadata extraction, but also has a number of other impactful use cases. As does language "translation", which is always something I personally have wanted to support since it has equal possible benefits (imagine adding metadata in parsers in multiple languages by turning on a "switch" in the ParseContext object, etc.) This adds basic support for doing those things, to the core of Tika, and an example language translator. Doesn't have to be the only one and ideally I'd like to have a default language translator that doesn't necessary rely on an external service, but this is a great start. Now we can start to build out functionality on top of this work and I think it will really benefit the project.

That said, if you have some concrete ideas for benefitting the code and its exception handling, please open up an issue and throw up a patch on ReviewBoard, etc., and I'd be happy to comment and review it. This patch is about to be committed and has met the bar IMO to enter into the sources.

Cheers.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44846
-----------------------------------------------------------


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java PRE-CREATION 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java PRE-CREATION 
>   trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator PRE-CREATION 
>   trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
>   trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Matthias Krueger <co...@mkr.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/#review44846
-----------------------------------------------------------


I'm new to Tika and late to this feature so I don't have any background on the targeted use case. Are Tika users looking to machine-translate metadata, parsed content or bot? Do they really benefit by having this in Tika and not use their existing libraries (they need to squeeze them into a Translator interface impl anyway)? Also, org.apache.tika.Tika handles detection and parsing of files/binary streams, it seems a bit awkward to now also have a method to translate Strings from one language to another.

The implementation looks ok. Would it make sense to have more fine grained exception handling for translations? The generic Exception throwing in the Translator interface and the catch(Exception) and rethrow IllegalStateException in Tika#translate seems a bit weird.

- Matthias Krueger


On June 5, 2014, 4:19 p.m., Tyler Palsulich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22219/
> -----------------------------------------------------------
> 
> (Updated June 5, 2014, 4:19 p.m.)
> 
> 
> Review request for tika and Chris Mattmann.
> 
> 
> Repository: tika
> 
> 
> Description
> -------
> 
> This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 
> 
> There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!
> 
> 
> Diffs
> -----
> 
>   trunk/pom.xml 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 1600565 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java PRE-CREATION 
>   trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java PRE-CREATION 
>   trunk/tika-translate/pom.xml PRE-CREATION 
>   trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java PRE-CREATION 
>   trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator PRE-CREATION 
>   trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
>   trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22219/diff/
> 
> 
> Testing
> -------
> 
> There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).
> 
> 
> Thanks,
> 
> Tyler Palsulich
> 
>


Re: Review Request 22219: Add Translation to Tika

Posted by Tyler Palsulich <tp...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/
-----------------------------------------------------------

(Updated June 5, 2014, 4:19 p.m.)


Review request for tika and Chris Mattmann.


Changes
-------

I updated the patch (off of r1600565). I created the tika-translate module. The Translator interface is still in tika-core. MicrosoftTranslator (in tika-translate) is the only implementation of the Translator interface. tika-core/Tika uses SWI to load the Translators from the META-INF/services/org.apache.tika.language.translate.Translator file in tika-translate, so tika-core does not depend on tika-translate. A notable result of this is, I added a Translator field to TikaConfig -- so users can specify a translator in a DOM, get a DefaultTranslator, etc. I updated some of the JavaDoc, too. 


Repository: tika


Description
-------

This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 

There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!


Diffs (updated)
-----

  trunk/pom.xml 1600565 
  trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600565 
  trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java 1600565 
  trunk/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java PRE-CREATION 
  trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java PRE-CREATION 
  trunk/tika-translate/pom.xml PRE-CREATION 
  trunk/tika-translate/src/main/java/org/apache/tika/language/translate/MicrosoftTranslator.java PRE-CREATION 
  trunk/tika-translate/src/main/resources/META-INF/services/org.apache.tika.language.translate.Translator PRE-CREATION 
  trunk/tika-translate/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
  trunk/tika-translate/src/test/java/org/apache/tika/language/translate/MicrosoftTranslatorTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/22219/diff/


Testing
-------

There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).


Thanks,

Tyler Palsulich


Re: Review Request 22219: Add Translation to Tika

Posted by Tyler Palsulich <tp...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/
-----------------------------------------------------------

(Updated June 4, 2014, 7:17 p.m.)


Review request for tika and Chris Mattmann.


Changes
-------

Whoops, forgot to add the properties file to the diff.


Repository: tika


Description
-------

This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 

There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!


Diffs (updated)
-----

  trunk/tika-core/pom.xml 1600418 
  trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600418 
  trunk/tika-core/src/main/java/org/apache/tika/language/MicrosoftTranslator.java PRE-CREATION 
  trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java PRE-CREATION 
  trunk/tika-core/src/main/resources/org/apache/tika/language/translator.microsoft.properties PRE-CREATION 
  trunk/tika-core/src/test/java/org/apache/tika/language/MicrosoftTranslatorTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/22219/diff/


Testing
-------

There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).


Thanks,

Tyler Palsulich


Re: Review Request 22219: Add Translation to Tika

Posted by Tyler Palsulich <tp...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22219/
-----------------------------------------------------------

(Updated June 4, 2014, 7:11 p.m.)


Review request for tika and Chris Mattmann.


Changes
-------

Thanks for the input! I updated the patch. Translator is now an interface. MicrosoftTranslator is an implementation of it. I added another method to the interface -- isAvailable. This is useful for users who will not use a translation service, but still want all of the unit tests to pass. Let me know if there is anything else I should update!

Tyler


Repository: tika


Description
-------

This patch adds basic language translation functionality to Tika. Translation is provided by a Microsoft API, but accessed through Apache 2 licensed com.memetix.microsoft-translator-java-api (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants to use the translation feature, they have to add a client id and client secret to the tika-core/src/main/resources/org/apache/tika/language/translator.properties file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added com.memetix as a dependency in tika-core. I put the Translator class in org.apache.tika.language. There is no integration with the server or CLI, yet. Further, only Strings are translated right now -- if you pass in a full document with xml tags, the structure will be mangled. But, I think that would be a cool feature -- translate the body, title, subtitle, etc, but not the structural elements. 

There is still more work to do, but I wanted some more eyes on this to make sure I'm heading in the right direction and this is a desired feature. Let me know what you think!


Diffs (updated)
-----

  trunk/tika-core/pom.xml 1600418 
  trunk/tika-core/src/main/java/org/apache/tika/Tika.java 1600418 
  trunk/tika-core/src/main/java/org/apache/tika/language/MicrosoftTranslator.java PRE-CREATION 
  trunk/tika-core/src/main/java/org/apache/tika/language/Translator.java PRE-CREATION 
  trunk/tika-core/src/test/java/org/apache/tika/language/MicrosoftTranslatorTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/22219/diff/


Testing
-------

There are two simple unit tests for now which translate "hello" to French ("salut"). One for inputting the source and target languages, one for inputing just the target language (and detecting the source language automatically).


Thanks,

Tyler Palsulich