You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2015/07/29 02:30:00 UTC

Bayesian N-Gram Language Detection

FYI the code is ALv2:

https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md


I’m going to test this out and see how it compares with our own.
Maybe we need to make the Language Detector pluggable too.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




Re: Bayesian N-Gram Language Detection

Posted by "Ramirez, Paul M (398M)" <pa...@jpl.nasa.gov>.
Just so I get this right is it then a one to one mapping with LanguageProfile and training data? The code I'm looking at now allows one to train on multiple languages.

Thanks,
Pual

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Paul Ramirez, M.S.
Technical Group Supervisor
Computer Science for Data Intensive Applications (398M)
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 158-264, Mailstop: 158-242
Email: paul.m.ramirez@jpl.nasa.gov<ma...@jpl.nasa.gov>
Office: 818-354-1015
Cell: 818-395-8194
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Aug 3, 2015, at 7:37 PM, "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>>
 wrote:

Thanks Oleg

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Oleg Tikhonov <ol...@gmail.com>>
Reply-To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Date: Wednesday, July 29, 2015 at 12:01 AM
To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Subject: Re: Bayesian N-Gram Language Detection

+1 !!!
My two cents.
Please also add ability to change/retrain/tote language profiles.

Thanks !!!
BR,
Oleg

On Wed, Jul 29, 2015 at 3:59 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov<ma...@jpl.nasa.gov>> wrote:

Cool. Well with this one I found, along with language-detector,
along with Ramirez and the work with Joe Campbell’s group at MIT-LL
and the Julia stuff, I for one am going to take the step to make it
pluggable.

I’ll try and take this on over the next week. I’ll use a ServiceLoader
approach similar to Translators, Detectors, Parsers, etc.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Ken Krugler <kk...@transpac.com>>
Reply-To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Date: Tuesday, July 28, 2015 at 5:39 PM
To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Subject: RE: Bayesian N-Gram Language Detection

I think switching to language-detector is a reasonable first step (more
languages, faster, better accuracy), after which we can evaluate the
need
to make it pluggable.

There were some code & resource packaging issues with the original
project, but the fork I've been trying out seems much better.

See https://github.com/optimaize/language-detector

Still ALv2, and already in the Maven central repo.

-- Ken

From: Mattmann, Chris A (3980)
Sent: July 28, 2015 5:30:00pm PDT
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Bayesian N-Gram Language Detection

FYI the code is ALv2:

https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md


I’m going to test this out and see how it compares with our own.
Maybe we need to make the Language Detector pluggable too.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr










Re: Bayesian N-Gram Language Detection

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Thanks Oleg

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Oleg Tikhonov <ol...@gmail.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Wednesday, July 29, 2015 at 12:01 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: Bayesian N-Gram Language Detection

>+1 !!!
>My two cents.
>Please also add ability to change/retrain/tote language profiles.
>
>Thanks !!!
>BR,
>Oleg
>
>On Wed, Jul 29, 2015 at 3:59 AM, Mattmann, Chris A (3980) <
>chris.a.mattmann@jpl.nasa.gov> wrote:
>
>> Cool. Well with this one I found, along with language-detector,
>> along with Ramirez and the work with Joe Campbell’s group at MIT-LL
>> and the Julia stuff, I for one am going to take the step to make it
>> pluggable.
>>
>> I’ll try and take this on over the next week. I’ll use a ServiceLoader
>> approach similar to Translators, Detectors, Parsers, etc.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Ken Krugler <kk...@transpac.com>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Date: Tuesday, July 28, 2015 at 5:39 PM
>> To: "dev@tika.apache.org" <de...@tika.apache.org>
>> Subject: RE: Bayesian N-Gram Language Detection
>>
>> >I think switching to language-detector is a reasonable first step (more
>> >languages, faster, better accuracy), after which we can evaluate the
>>need
>> >to make it pluggable.
>> >
>> >There were some code & resource packaging issues with the original
>> >project, but the fork I've been trying out seems much better.
>> >
>> >See https://github.com/optimaize/language-detector
>> >
>> >Still ALv2, and already in the Maven central repo.
>> >
>> >-- Ken
>> >
>> >> From: Mattmann, Chris A (3980)
>> >> Sent: July 28, 2015 5:30:00pm PDT
>> >> To: dev@tika.apache.org
>> >> Subject: Bayesian N-Gram Language Detection
>> >>
>> >> FYI the code is ALv2:
>> >>
>> >> https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
>> >>
>> >>
>> >> I’m going to test this out and see how it compares with our own.
>> >> Maybe we need to make the Language Detector pluggable too.
>> >>
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Chris Mattmann, Ph.D.
>> >> Chief Architect
>> >> Instrument Software and Science Data Systems Section (398)
>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >> Office: 168-519, Mailstop: 168-527
>> >> Email: chris.a.mattmann@nasa.gov
>> >> WWW:  http://sunset.usc.edu/~mattmann/
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> Adjunct Associate Professor, Computer Science Department
>> >> University of Southern California, Los Angeles, CA 90089 USA
>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>
>> >>
>> >
>> >--------------------------
>> >Ken Krugler
>> >+1 530-210-6378
>> >http://www.scaleunlimited.com
>> >custom big data solutions & training
>> >Hadoop, Cascading, Cassandra & Solr
>> >
>> >
>> >
>> >
>> >
>>
>>


Re: Bayesian N-Gram Language Detection

Posted by Oleg Tikhonov <ol...@gmail.com>.
+1 !!!
My two cents.
Please also add ability to change/retrain/tote language profiles.

Thanks !!!
BR,
Oleg

On Wed, Jul 29, 2015 at 3:59 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Cool. Well with this one I found, along with language-detector,
> along with Ramirez and the work with Joe Campbell’s group at MIT-LL
> and the Julia stuff, I for one am going to take the step to make it
> pluggable.
>
> I’ll try and take this on over the next week. I’ll use a ServiceLoader
> approach similar to Translators, Detectors, Parsers, etc.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Ken Krugler <kk...@transpac.com>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
> Date: Tuesday, July 28, 2015 at 5:39 PM
> To: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: RE: Bayesian N-Gram Language Detection
>
> >I think switching to language-detector is a reasonable first step (more
> >languages, faster, better accuracy), after which we can evaluate the need
> >to make it pluggable.
> >
> >There were some code & resource packaging issues with the original
> >project, but the fork I've been trying out seems much better.
> >
> >See https://github.com/optimaize/language-detector
> >
> >Still ALv2, and already in the Maven central repo.
> >
> >-- Ken
> >
> >> From: Mattmann, Chris A (3980)
> >> Sent: July 28, 2015 5:30:00pm PDT
> >> To: dev@tika.apache.org
> >> Subject: Bayesian N-Gram Language Detection
> >>
> >> FYI the code is ALv2:
> >>
> >> https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
> >>
> >>
> >> I’m going to test this out and see how it compares with our own.
> >> Maybe we need to make the Language Detector pluggable too.
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >
> >--------------------------
> >Ken Krugler
> >+1 530-210-6378
> >http://www.scaleunlimited.com
> >custom big data solutions & training
> >Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
>
>

Re: Bayesian N-Gram Language Detection

Posted by "Ramirez, Paul M (398M)" <pa...@jpl.nasa.gov>.
I hadn't entered an issue on the tika list as of yet but in the near future MIT-LL will also have language detection for video and audio streams. Chris if you're already going to make this pluggable this may be something to consider.

--Paul

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Paul Ramirez, M.S.
Technical Group Supervisor
Computer Science for Data Intensive Applications (398M)
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 158-264, Mailstop: 158-242
Email: paul.m.ramirez@jpl.nasa.gov<ma...@jpl.nasa.gov>
Office: 818-354-1015
Cell: 818-395-8194
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On Jul 28, 2015, at 5:59 PM, "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>>
 wrote:

Cool. Well with this one I found, along with language-detector,
along with Ramirez and the work with Joe Campbell’s group at MIT-LL
and the Julia stuff, I for one am going to take the step to make it
pluggable.

I’ll try and take this on over the next week. I’ll use a ServiceLoader
approach similar to Translators, Detectors, Parsers, etc.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Ken Krugler <kk...@transpac.com>>
Reply-To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Date: Tuesday, July 28, 2015 at 5:39 PM
To: "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Subject: RE: Bayesian N-Gram Language Detection

I think switching to language-detector is a reasonable first step (more
languages, faster, better accuracy), after which we can evaluate the need
to make it pluggable.

There were some code & resource packaging issues with the original
project, but the fork I've been trying out seems much better.

See https://github.com/optimaize/language-detector

Still ALv2, and already in the Maven central repo.

-- Ken

From: Mattmann, Chris A (3980)
Sent: July 28, 2015 5:30:00pm PDT
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Bayesian N-Gram Language Detection

FYI the code is ALv2:

https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md


I’m going to test this out and see how it compares with our own.
Maybe we need to make the Language Detector pluggable too.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr








Re: Bayesian N-Gram Language Detection

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Cool. Well with this one I found, along with language-detector,
along with Ramirez and the work with Joe Campbell’s group at MIT-LL
and the Julia stuff, I for one am going to take the step to make it
pluggable.

I’ll try and take this on over the next week. I’ll use a ServiceLoader
approach similar to Translators, Detectors, Parsers, etc.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Ken Krugler <kk...@transpac.com>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Tuesday, July 28, 2015 at 5:39 PM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: Bayesian N-Gram Language Detection

>I think switching to language-detector is a reasonable first step (more
>languages, faster, better accuracy), after which we can evaluate the need
>to make it pluggable.
>
>There were some code & resource packaging issues with the original
>project, but the fork I've been trying out seems much better.
>
>See https://github.com/optimaize/language-detector
>
>Still ALv2, and already in the Maven central repo.
>
>-- Ken
>
>> From: Mattmann, Chris A (3980)
>> Sent: July 28, 2015 5:30:00pm PDT
>> To: dev@tika.apache.org
>> Subject: Bayesian N-Gram Language Detection
>> 
>> FYI the code is ALv2:
>> 
>> https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
>> 
>> 
>> I’m going to test this out and see how it compares with our own.
>> Maybe we need to make the Language Detector pluggable too.
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>
>--------------------------
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>


RE: Bayesian N-Gram Language Detection

Posted by Ken Krugler <kk...@transpac.com>.
I think switching to language-detector is a reasonable first step (more languages, faster, better accuracy), after which we can evaluate the need to make it pluggable.

There were some code & resource packaging issues with the original project, but the fork I've been trying out seems much better.

See https://github.com/optimaize/language-detector

Still ALv2, and already in the Maven central repo.

-- Ken

> From: Mattmann, Chris A (3980)
> Sent: July 28, 2015 5:30:00pm PDT
> To: dev@tika.apache.org
> Subject: Bayesian N-Gram Language Detection
> 
> FYI the code is ALv2:
> 
> https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
> 
> 
> I’m going to test this out and see how it compares with our own.
> Maybe we need to make the Language Detector pluggable too.
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr