You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Andy McMurry <mc...@gmail.com> on 2013/04/28 03:24:38 UTC

roadmap for Apache cTakes "big data" processing

I'm writing to gauge community interest and intent for parallel processing with cTakes. 

Apache UIMA is planning "Async Scaleout" as a replacement for CPM. 
http://uima.apache.org/doc-uimaas-what.html

Apache Mahout is likely to become the defacto apache package for machine learning. 
http://mahout.apache.org/

I believe cTakes will embrace both of these in due time.  
Do you agree or do you have a different view?

Re: roadmap for Apache cTakes "big data" processing

Posted by Karthik Sarma <ks...@ksarma.com>.

AS compatibility is a good idea, but I suspect there will be a fair number
of problems to solve on the way. I do think it is certainly doable, though.

On Saturday, April 27, 2013, Andy McMurry wrote:

> I'm writing to gauge community interest and intent for parallel processing
> with cTakes.
>
> Apache UIMA is planning "Async Scaleout" as a replacement for CPM.
> http://uima.apache.org/doc-uimaas-what.html
>
> Apache Mahout is likely to become the defacto apache package for machine
> learning.
> http://mahout.apache.org/
>
> I believe cTakes will embrace both of these in due time.
> Do you agree or do you have a different view?
>
>
>
>
>
>

-- 
Sent from Gmail Mobile

RE: roadmap for Apache cTakes "big data" processing

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

This could be a nice possible project for GSoC... If you have time to help mentor a student, feel free to create a Jira and tag it with gsoc2013.
--Pei

> -----Original Message-----
> From: Andy McMurry [mailto:mcmurry.andy@gmail.com]
> Sent: Sunday, April 28, 2013 9:40 PM
> To: dev@ctakes.apache.org
> Subject: Re: roadmap for Apache cTakes "big data" processing
> 
> Good point Pei.
> 
> We would need to do a spike (short sprint) in the future to see if Mahout
> would be a good fit.
> I'm just wondering because I'm planning out how I will be using cTakes, and
> was wondering how others are planning as well.
> 
> 
> Cheers,
> --ANdy
> 
> 
> On Apr 28, 2013, at 5:39 PM, "Chen, Pei" <Pe...@childrens.harvard.edu>
> wrote:
> 
> > Has anyone tried Mahout recently?
> > Last time I tried, it was still closely tied to the Hadoop file system.
> >
> > Sent from my iPhone
> >
> > On Apr 28, 2013, at 7:44 PM, "Andy McMurry"
> <mc...@gmail.com> wrote:
> >
> >> I encourage committers to checkout Apache Mahout
> >> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> >>
> >> Why Apache Mahout?
> >> 1. provides ML classifiers and functions not available through UIMA
> >> 2. parallel by design, transparently invokes Hadoop
> >> 3. Java and Apache license (every other known toolkit is GPL!)
> >> 4. likely to become standard ML package for Apache
> >>
> >> Why would we use mahout in cTakes?
> >> cTakes models are "provided", for example PoS tagging.
> >> Retraining these models on your own compute cluster would be difficult
> (in my opinion).
> >> LibSVM is nice, but it is only one classification method.
> >>
> >> When ?
> >> No rush, however, I suggest we dont invest time in porting SINGLE-CPU
> classifier functions that we will have to parallelize, later.
> >>
> >> Summary:
> >> UIMA + mahout = pipelines + classification
> >>
> >>
> >>
> >>
> >> On Apr 28, 2013, at 4:26 PM, "Savova, Guergana"
> <Gu...@childrens.harvard.edu> wrote:
> >>
> >>> +1
> >>> --guergana
> >>>
> >>> -----Original Message-----
> >>> From: Kaggal, Vinod C. [mailto:Kaggal.Vinod@mayo.edu]
> >>> Sent: Saturday, April 27, 2013 11:21 PM
> >>> To: <de...@ctakes.apache.org>
> >>> Cc: <de...@ctakes.apache.org>
> >>> Subject: Re: roadmap for Apache cTakes "big data" processing
> >>>
> >>> +1
> >>>
> >>>
> >>> On Apr 27, 2013, at 9:05 PM, "Chen, Pei"
> <Pe...@childrens.harvard.edu> wrote:
> >>>
> >>>> +1 for UIMA-AS
> >>>>
> >>>>
> >>>> On Apr 27, 2013, at 9:25 PM, "Andy McMurry"
> <mc...@gmail.com> wrote:
> >>>>
> >>>>> I'm writing to gauge community interest and intent for parallel
> processing with cTakes.
> >>>>>
> >>>>> Apache UIMA is planning "Async Scaleout" as a replacement for CPM.
> >>>>> http://uima.apache.org/doc-uimaas-what.html
> >>>>>
> >>>>> Apache Mahout is likely to become the defacto apache package for
> machine learning.
> >>>>> http://mahout.apache.org/
> >>>>>
> >>>>> I believe cTakes will embrace both of these in due time.
> >>>>> Do you agree or do you have a different view?
> >>

Re: roadmap for Apache cTakes "big data" processing

Posted by giri vara prasad nambari <gi...@gmail.com>.

It seems still it is tightly tied with Hadoop

https://cwiki.apache.org/confluence/display/MAHOUT/Quickstart


On Sun, Apr 28, 2013 at 9:39 PM, Andy McMurry <mc...@gmail.com>wrote:

> Good point Pei.
>
> We would need to do a spike (short sprint) in the future to see if Mahout
> would be a good fit.
> I'm just wondering because I'm planning out how I will be using cTakes,
> and was wondering how others are planning as well.
>
>
> Cheers,
> --ANdy
>
>
> On Apr 28, 2013, at 5:39 PM, "Chen, Pei" <Pe...@childrens.harvard.edu>
> wrote:
>
> > Has anyone tried Mahout recently?
> > Last time I tried, it was still closely tied to the Hadoop file system.
> >
> > Sent from my iPhone
> >
> > On Apr 28, 2013, at 7:44 PM, "Andy McMurry" <mc...@gmail.com>
> wrote:
> >
> >> I encourage committers to checkout Apache Mahout
> >> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> >>
> >> Why Apache Mahout?
> >> 1. provides ML classifiers and functions not available through UIMA
> >> 2. parallel by design, transparently invokes Hadoop
> >> 3. Java and Apache license (every other known toolkit is GPL!)
> >> 4. likely to become standard ML package for Apache
> >>
> >> Why would we use mahout in cTakes?
> >> cTakes models are "provided", for example PoS tagging.
> >> Retraining these models on your own compute cluster would be difficult
>  (in my opinion).
> >> LibSVM is nice, but it is only one classification method.
> >>
> >> When ?
> >> No rush, however, I suggest we dont invest time in porting SINGLE-CPU
> classifier functions that we will have to parallelize, later.
> >>
> >> Summary:
> >> UIMA + mahout = pipelines + classification
> >>
> >>
> >>
> >>
> >> On Apr 28, 2013, at 4:26 PM, "Savova, Guergana" <
> Guergana.Savova@childrens.harvard.edu> wrote:
> >>
> >>> +1
> >>> --guergana
> >>>
> >>> -----Original Message-----
> >>> From: Kaggal, Vinod C. [mailto:Kaggal.Vinod@mayo.edu]
> >>> Sent: Saturday, April 27, 2013 11:21 PM
> >>> To: <de...@ctakes.apache.org>
> >>> Cc: <de...@ctakes.apache.org>
> >>> Subject: Re: roadmap for Apache cTakes "big data" processing
> >>>
> >>> +1
> >>>
> >>>
> >>> On Apr 27, 2013, at 9:05 PM, "Chen, Pei" <
> Pei.Chen@childrens.harvard.edu> wrote:
> >>>
> >>>> +1 for UIMA-AS
> >>>>
> >>>>
> >>>> On Apr 27, 2013, at 9:25 PM, "Andy McMurry" <mc...@gmail.com>
> wrote:
> >>>>
> >>>>> I'm writing to gauge community interest and intent for parallel
> processing with cTakes.
> >>>>>
> >>>>> Apache UIMA is planning "Async Scaleout" as a replacement for CPM.
> >>>>> http://uima.apache.org/doc-uimaas-what.html
> >>>>>
> >>>>> Apache Mahout is likely to become the defacto apache package for
> machine learning.
> >>>>> http://mahout.apache.org/
> >>>>>
> >>>>> I believe cTakes will embrace both of these in due time.
> >>>>> Do you agree or do you have a different view?
> >>
>
>

Re: roadmap for Apache cTakes "big data" processing

Posted by Andy McMurry <mc...@gmail.com>.

Good point Pei. 

We would need to do a spike (short sprint) in the future to see if Mahout would be a good fit. 
I'm just wondering because I'm planning out how I will be using cTakes, and was wondering how others are planning as well.


Cheers, 
--ANdy 


On Apr 28, 2013, at 5:39 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> Has anyone tried Mahout recently?
> Last time I tried, it was still closely tied to the Hadoop file system. 
> 
> Sent from my iPhone
> 
> On Apr 28, 2013, at 7:44 PM, "Andy McMurry" <mc...@gmail.com> wrote:
> 
>> I encourage committers to checkout Apache Mahout 
>> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
>> 
>> Why Apache Mahout? 
>> 1. provides ML classifiers and functions not available through UIMA
>> 2. parallel by design, transparently invokes Hadoop  
>> 3. Java and Apache license (every other known toolkit is GPL!) 
>> 4. likely to become standard ML package for Apache 
>> 
>> Why would we use mahout in cTakes? 
>> cTakes models are "provided", for example PoS tagging. 
>> Retraining these models on your own compute cluster would be difficult  (in my opinion). 
>> LibSVM is nice, but it is only one classification method. 
>> 
>> When ? 
>> No rush, however, I suggest we dont invest time in porting SINGLE-CPU classifier functions that we will have to parallelize, later. 
>> 
>> Summary: 
>> UIMA + mahout = pipelines + classification 
>> 
>> 
>> 
>> 
>> On Apr 28, 2013, at 4:26 PM, "Savova, Guergana" <Gu...@childrens.harvard.edu> wrote:
>> 
>>> +1 
>>> --guergana
>>> 
>>> -----Original Message-----
>>> From: Kaggal, Vinod C. [mailto:Kaggal.Vinod@mayo.edu] 
>>> Sent: Saturday, April 27, 2013 11:21 PM
>>> To: <de...@ctakes.apache.org>
>>> Cc: <de...@ctakes.apache.org>
>>> Subject: Re: roadmap for Apache cTakes "big data" processing
>>> 
>>> +1
>>> 
>>> 
>>> On Apr 27, 2013, at 9:05 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>>> 
>>>> +1 for UIMA-AS
>>>> 
>>>> 
>>>> On Apr 27, 2013, at 9:25 PM, "Andy McMurry" <mc...@gmail.com> wrote:
>>>> 
>>>>> I'm writing to gauge community interest and intent for parallel processing with cTakes. 
>>>>> 
>>>>> Apache UIMA is planning "Async Scaleout" as a replacement for CPM. 
>>>>> http://uima.apache.org/doc-uimaas-what.html
>>>>> 
>>>>> Apache Mahout is likely to become the defacto apache package for machine learning. 
>>>>> http://mahout.apache.org/
>>>>> 
>>>>> I believe cTakes will embrace both of these in due time.  
>>>>> Do you agree or do you have a different view?
>>

Re: roadmap for Apache cTakes "big data" processing

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

Has anyone tried Mahout recently?
Last time I tried, it was still closely tied to the Hadoop file system. 

Sent from my iPhone

On Apr 28, 2013, at 7:44 PM, "Andy McMurry" <mc...@gmail.com> wrote:

> I encourage committers to checkout Apache Mahout 
> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
> 
> Why Apache Mahout? 
> 1. provides ML classifiers and functions not available through UIMA
> 2. parallel by design, transparently invokes Hadoop  
> 3. Java and Apache license (every other known toolkit is GPL!) 
> 4. likely to become standard ML package for Apache 
> 
> Why would we use mahout in cTakes? 
> cTakes models are "provided", for example PoS tagging. 
> Retraining these models on your own compute cluster would be difficult  (in my opinion). 
> LibSVM is nice, but it is only one classification method. 
> 
> When ? 
> No rush, however, I suggest we dont invest time in porting SINGLE-CPU classifier functions that we will have to parallelize, later. 
> 
> Summary: 
> UIMA + mahout = pipelines + classification 
> 
> 
> 
> 
> On Apr 28, 2013, at 4:26 PM, "Savova, Guergana" <Gu...@childrens.harvard.edu> wrote:
> 
>> +1 
>> --guergana
>> 
>> -----Original Message-----
>> From: Kaggal, Vinod C. [mailto:Kaggal.Vinod@mayo.edu] 
>> Sent: Saturday, April 27, 2013 11:21 PM
>> To: <de...@ctakes.apache.org>
>> Cc: <de...@ctakes.apache.org>
>> Subject: Re: roadmap for Apache cTakes "big data" processing
>> 
>> +1
>> 
>> 
>> On Apr 27, 2013, at 9:05 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>> 
>>> +1 for UIMA-AS
>>> 
>>> 
>>> On Apr 27, 2013, at 9:25 PM, "Andy McMurry" <mc...@gmail.com> wrote:
>>> 
>>>> I'm writing to gauge community interest and intent for parallel processing with cTakes. 
>>>> 
>>>> Apache UIMA is planning "Async Scaleout" as a replacement for CPM. 
>>>> http://uima.apache.org/doc-uimaas-what.html
>>>> 
>>>> Apache Mahout is likely to become the defacto apache package for machine learning. 
>>>> http://mahout.apache.org/
>>>> 
>>>> I believe cTakes will embrace both of these in due time.  
>>>> Do you agree or do you have a different view?
>

Re: roadmap for Apache cTakes "big data" processing

Posted by Jörn Kottmann <ko...@gmail.com>.

On 04/29/2013 01:43 AM, Andy McMurry wrote:
> I encourage committers to checkout Apache Mahout
> https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
>
> Why Apache Mahout?
> 1. provides ML classifiers and functions not available through UIMA
> 2. parallel by design, transparently invokes Hadoop
> 3. Java and Apache license (every other known toolkit is GPL!)
> 4. likely to become standard ML package for Apache
>
> Why would we use mahout in cTakes?
> cTakes models are "provided", for example PoS tagging.
> Retraining these models on your own compute cluster would be difficult  (in my opinion).
> LibSVM is nice, but it is only one classification method.
>

The Mahout classifiers will probably soon be integrated into OpenNLP, 
here is the jira issue.
https://issues.apache.org/jira/browse/OPENNLP-574

The idea is to make the ML part in OpenNLP plugable, so that all kind of 
classification libraries can be supported.

Also interesting might be Mahouts Clustering and LDA capability, which 
can probably be performed on the
entire document database.

Jörn

Re: roadmap for Apache cTakes "big data" processing

Posted by Andy McMurry <mc...@gmail.com>.

I encourage committers to checkout Apache Mahout 
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Why Apache Mahout? 
1. provides ML classifiers and functions not available through UIMA
2. parallel by design, transparently invokes Hadoop  
3. Java and Apache license (every other known toolkit is GPL!) 
4. likely to become standard ML package for Apache 

Why would we use mahout in cTakes? 
cTakes models are "provided", for example PoS tagging. 
Retraining these models on your own compute cluster would be difficult  (in my opinion). 
LibSVM is nice, but it is only one classification method. 

When ? 
No rush, however, I suggest we dont invest time in porting SINGLE-CPU classifier functions that we will have to parallelize, later. 

Summary: 
UIMA + mahout = pipelines + classification 

On Apr 28, 2013, at 4:26 PM, "Savova, Guergana" <Gu...@childrens.harvard.edu> wrote:

> +1 
> --guergana
> 
> -----Original Message-----
> From: Kaggal, Vinod C. [mailto:Kaggal.Vinod@mayo.edu] 
> Sent: Saturday, April 27, 2013 11:21 PM
> To: <de...@ctakes.apache.org>
> Cc: <de...@ctakes.apache.org>
> Subject: Re: roadmap for Apache cTakes "big data" processing
> 
> +1
> 
> 
> On Apr 27, 2013, at 9:05 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> 
>> +1 for UIMA-AS
>> 
>> 
>> On Apr 27, 2013, at 9:25 PM, "Andy McMurry" <mc...@gmail.com> wrote:
>> 
>>> I'm writing to gauge community interest and intent for parallel processing with cTakes. 
>>> 
>>> Apache UIMA is planning "Async Scaleout" as a replacement for CPM. 
>>> http://uima.apache.org/doc-uimaas-what.html
>>> 
>>> Apache Mahout is likely to become the defacto apache package for machine learning. 
>>> http://mahout.apache.org/
>>> 
>>> I believe cTakes will embrace both of these in due time.  
>>> Do you agree or do you have a different view? 
>>> 
>>> 
>>> 
>>> 
>>>

RE: roadmap for Apache cTakes "big data" processing

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.

+1 
--guergana

-----Original Message-----
From: Kaggal, Vinod C. [mailto:Kaggal.Vinod@mayo.edu] 
Sent: Saturday, April 27, 2013 11:21 PM
To: <de...@ctakes.apache.org>
Cc: <de...@ctakes.apache.org>
Subject: Re: roadmap for Apache cTakes "big data" processing

+1


On Apr 27, 2013, at 9:05 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> +1 for UIMA-AS
> 
> 
> On Apr 27, 2013, at 9:25 PM, "Andy McMurry" <mc...@gmail.com> wrote:
> 
>> I'm writing to gauge community interest and intent for parallel processing with cTakes. 
>> 
>> Apache UIMA is planning "Async Scaleout" as a replacement for CPM. 
>> http://uima.apache.org/doc-uimaas-what.html
>> 
>> Apache Mahout is likely to become the defacto apache package for machine learning. 
>> http://mahout.apache.org/
>> 
>> I believe cTakes will embrace both of these in due time.  
>> Do you agree or do you have a different view? 
>> 
>> 
>> 
>> 
>>

Re: roadmap for Apache cTakes "big data" processing

Posted by "Kaggal, Vinod C." <Ka...@mayo.edu>.

+1


On Apr 27, 2013, at 9:05 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> +1 for UIMA-AS
> 
> 
> On Apr 27, 2013, at 9:25 PM, "Andy McMurry" <mc...@gmail.com> wrote:
> 
>> I'm writing to gauge community interest and intent for parallel processing with cTakes. 
>> 
>> Apache UIMA is planning "Async Scaleout" as a replacement for CPM. 
>> http://uima.apache.org/doc-uimaas-what.html
>> 
>> Apache Mahout is likely to become the defacto apache package for machine learning. 
>> http://mahout.apache.org/
>> 
>> I believe cTakes will embrace both of these in due time.  
>> Do you agree or do you have a different view? 
>> 
>> 
>> 
>> 
>>

Re: roadmap for Apache cTakes "big data" processing

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

+1 for UIMA-AS


On Apr 27, 2013, at 9:25 PM, "Andy McMurry" <mc...@gmail.com> wrote:

> I'm writing to gauge community interest and intent for parallel processing with cTakes. 
> 
> Apache UIMA is planning "Async Scaleout" as a replacement for CPM. 
> http://uima.apache.org/doc-uimaas-what.html
> 
> Apache Mahout is likely to become the defacto apache package for machine learning. 
> http://mahout.apache.org/
> 
> I believe cTakes will embrace both of these in due time.  
> Do you agree or do you have a different view? 
> 
> 
> 
> 
>