You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Michael Bryant <mb...@cs.brown.edu> on 2011/06/27 15:14:58 UTC

additional language development

I'm using the tika library to do language detection for a proofreading tool.
We support several languages that are not yet included in tika, so I created
n-gram profiles from Wikipedia dumps and added language profiles. I think
the resulting detection works pretty well (approx 92% average accuracy,
compared to 94% with the language set from the original tika).

Is this additional language support something that tika developers would be
interested in adding to the package?

The languages I added support for are: Belarusian, Catalan, Esperanto,
Galician, Lithuanian, Romanian, Slovak, Slovenian, and Ukrainian. I can
provide more details regarding my methods and tests if desired.

- Michael

Re: additional language development

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Michael,

Yes, thank you! 

Please file an issue at: https://issues.apache.org/jira/browse/TIKA

And attach a patch with your contributed language profiles. Then one of the Tika committers can jump in and help shepherd your patch into the sources.

To create a patch, you would:

1. svn co latest Tika (e.g., http://svn.apache.org/repos/asf/tika/trunk)
2. add your new language profiles and updates to the check out
3. svn add your new files
4. svn status (make sure looks right)
5. svn diff > TIKA-xxx.<your last name>.<yyMMdd>.patch.txt
6. attach #5 to the JIRA issue (TIKA-xxx)

Thanks!

Cheers,
Chris

On Jun 27, 2011, at 6:14 AM, Michael Bryant wrote:

> I'm using the tika library to do language detection for a proofreading tool.
> We support several languages that are not yet included in tika, so I created
> n-gram profiles from Wikipedia dumps and added language profiles. I think
> the resulting detection works pretty well (approx 92% average accuracy,
> compared to 94% with the language set from the original tika).
> 
> Is this additional language support something that tika developers would be
> interested in adding to the package?
> 
> The languages I added support for are: Belarusian, Catalan, Esperanto,
> Galician, Lithuanian, Romanian, Slovak, Slovenian, and Ukrainian. I can
> provide more details regarding my methods and tests if desired.
> 
> - Michael


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++