You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Chris Mattmann <ma...@apache.org> on 2014/06/19 00:02:54 UTC
Review Request 22761: Create a Tika Translator implementation that uses
JoshuaDecoder
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22761/
-----------------------------------------------------------
Review request for tika.
Bugs: tika-1343
https://issues.apache.org/jira/browse/tika-1343
Repository: tika
Description
-------
The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github:
http://joshua-decoder.org/
Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English, and a few others.
https://github.com/joshua-decoder/joshua/
It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this:
* the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day
* there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this.
* Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it.
Diffs
-----
Diff: https://reviews.apache.org/r/22761/diff/
Testing
-------
ran through on my locally built Spanish->English corpus built using http://joshua-decoder.org/data/fisher-callhome-corpus/
My dataset isn't perfect, but it can do basic translations. Also wrote a unit test, part of the patch.
Thanks,
Chris Mattmann
Re: Review Request 22761: Create a Tika Translator implementation
that uses JoshuaDecoder
Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22761/#review76313
-----------------------------------------------------------
./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/JoshuaTranslator.java
<https://reviews.apache.org/r/22761/#comment123859>
Chris, can you provide a sample configuration here? I am struggling to find what this should look like!
- Lewis McGibbney
On June 18, 2014, 10:04 p.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22761/
> -----------------------------------------------------------
>
> (Updated June 18, 2014, 10:04 p.m.)
>
>
> Review request for tika.
>
>
> Bugs: tika-1343
> https://issues.apache.org/jira/browse/tika-1343
>
>
> Repository: tika
>
>
> Description
> -------
>
> The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github:
>
> http://joshua-decoder.org/
>
> Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English, and a few others.
>
> https://github.com/joshua-decoder/joshua/
>
> It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this:
>
> * the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day
> * there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this.
> * Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
>
> Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it.
>
>
> Diffs
> -----
>
> ./trunk/tika-translate/pom.xml 1603529
> ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/JoshuaTranslator.java PRE-CREATION
> ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/JoshuaTranslatorTest.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/22761/diff/
>
>
> Testing
> -------
>
> ran through on my locally built Spanish->English corpus built using http://joshua-decoder.org/data/fisher-callhome-corpus/
> My dataset isn't perfect, but it can do basic translations. Also wrote a unit test, part of the patch.
>
>
> Thanks,
>
> Chris Mattmann
>
>
Re: Review Request 22761: Create a Tika Translator implementation
that uses JoshuaDecoder
Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22761/#review76314
-----------------------------------------------------------
./trunk/tika-translate/pom.xml
<https://reviews.apache.org/r/22761/#comment123860>
Hi Chris did you just build this locally?
- Lewis McGibbney
On June 18, 2014, 10:04 p.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22761/
> -----------------------------------------------------------
>
> (Updated June 18, 2014, 10:04 p.m.)
>
>
> Review request for tika.
>
>
> Bugs: tika-1343
> https://issues.apache.org/jira/browse/tika-1343
>
>
> Repository: tika
>
>
> Description
> -------
>
> The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github:
>
> http://joshua-decoder.org/
>
> Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English, and a few others.
>
> https://github.com/joshua-decoder/joshua/
>
> It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this:
>
> * the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day
> * there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this.
> * Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
>
> Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it.
>
>
> Diffs
> -----
>
> ./trunk/tika-translate/pom.xml 1603529
> ./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/JoshuaTranslator.java PRE-CREATION
> ./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/JoshuaTranslatorTest.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/22761/diff/
>
>
> Testing
> -------
>
> ran through on my locally built Spanish->English corpus built using http://joshua-decoder.org/data/fisher-callhome-corpus/
> My dataset isn't perfect, but it can do basic translations. Also wrote a unit test, part of the patch.
>
>
> Thanks,
>
> Chris Mattmann
>
>
Re: Review Request 22761: Create a Tika Translator implementation that uses
JoshuaDecoder
Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22761/
-----------------------------------------------------------
(Updated June 18, 2014, 10:04 p.m.)
Review request for tika.
Bugs: tika-1343
https://issues.apache.org/jira/browse/tika-1343
Repository: tika
Description
-------
The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github:
http://joshua-decoder.org/
Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English, and a few others.
https://github.com/joshua-decoder/joshua/
It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this:
* the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day
* there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this.
* Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it.
Diffs (updated)
-----
./trunk/tika-translate/pom.xml 1603529
./trunk/tika-translate/src/main/java/org/apache/tika/language/translate/JoshuaTranslator.java PRE-CREATION
./trunk/tika-translate/src/test/java/org/apache/tika/language/translate/JoshuaTranslatorTest.java PRE-CREATION
Diff: https://reviews.apache.org/r/22761/diff/
Testing
-------
ran through on my locally built Spanish->English corpus built using http://joshua-decoder.org/data/fisher-callhome-corpus/
My dataset isn't perfect, but it can do basic translations. Also wrote a unit test, part of the patch.
Thanks,
Chris Mattmann