You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by "Chen, Pei" <Pe...@childrens.harvard.edu> on 2013/04/12 02:55:59 UTC

Next cTAKES release (3.1)?

Hi,
I just wanted to gauge the interest of creating the next release of cTAKES (3.1) which is currently marked for May in Jira-

There have already been 22/53 issues [1] marked as fixed or closed.  Plenty of bug fixes and new components including:
- New CEM Instance Template population
- New Dependency Parser/Semantic Role Labeler
- New optional Clear POSTagger
- New regression testing component

Should we wait for the Temporal component?

[1] https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES


Re: Next cTAKES release (3.1)?

Posted by Steven Bethard <st...@Colorado.EDU>.
On 26 Jun2013, at 10:51 , "Masanz, James J." <Ma...@mayo.edu> wrote:
> I will hold off the build.

Ok, I think relation-extractor should be good to go now.

Steve

> 
> I looked again at CTAKES-190, and after doing a search on the code in trunk for "MedicationEventMention" I still see places outside of relation extractor where it is being used (instead of the newer MedicationMention).
> 
> I'll update those places, probably tomorrow.
> 
> Also, Steve, when you get a chance, if you could look at what I wrote in CTAKES-190 on May 06, and handle or let me know what pieces still need to be looked at in the relation extractor, that would be great.
> 
> I'll post again when I have a next target date for a build. Meanwhile anyone else who has input related to the timing of the build, please don't hesitate to post.
> 
> -- James
> 
> -----Original Message-----
> From: dev-return-1704-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1704-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
> Sent: Wednesday, June 26, 2013 10:35 AM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> On 26 Jun2013, at 8:35 , "Masanz, James J." <Ma...@mayo.edu> wrote:
>> I had originally suggested today, June 26, as the start of the cTAKES 3.1 build.
>> 
>> Now I am thinking of doing a build (my first) tomorrow, June 27.
>> Hopefully that will turn into RC1 of Apache cTAKES 3.1
>> 
>> Q1) Any reasons not to do that (any reason to delay a bit)?
> 
> As far as I know, we still have not yet updated the relation extractor since the CTAKES-190 changes to the dictionary lookup [1]. I will try to make some time for that this week, but I have several other commitments, so I'm not sure exactly when I'll be able to complete it.
> 
> [1] https://issues.apache.org/jira/browse/CTAKES-190
> 
> Steve
> 
>> I will follow [1] and update that page if anything has changed
>> 
>> I plan to create a branch in SVN and build from the branch. As [1] states, as release manager I will be responsible for getting any fixes into the branch (or perhaps just create a new branch depending on activity), so you can continue to just work from trunk.
>> 
>> A new version of the Wiki pages for the user and developer guides needs to be created
>> Q2) Is there a faster way to duplicate a set of Wiki pages than copying them one by one?
>> 
>> [1] http://ctakes.apache.org/ctakes-release-guide.html
>> 
>> 
>> -- James
>> 
>> 
>> 
>> -----Original Message-----
>> From: dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
>> Sent: Friday, May 31, 2013 2:27 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Next cTAKES release (3.1)?
>> 
>> Yes I think it can be done by then. But even if not, my understanding is 
>> that the version turned on by default is not cleartk-based and the 
>> cleartk one is still under development.
>> Tim
>> 
>> On 05/31/2013 03:25 PM, Masanz, James J. wrote:
>>> I'll be release manager for 3.1 (unless someone else is anxious to be and just hasn't seen this thread yet)
>>> 
>>> I'd suggest we target Wed June 26 to have an RC built.
>>> 
>>> Steve, would that seem reasonable for the relation extractor changes due to [1], plus those that would be needed if we upgrade ClearTK dependency to 1.4.0.
>>> 
>>> Anyone know enough about assertion code to make an educated guess of whether the "upgrade ClearTK dependency to 1.4.0" could be done by then too?
>>> 
>>> -----Original Message-----
>>> From: dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
>>> Sent: Friday, May 31, 2013 2:18 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Next cTAKES release (3.1)?
>>> 
>>> As a result of the CTAKES-190 changes to the dictionary lookup [1], the relation extractor needs some refactoring and retraining. Probably we won't have a chance to get to that until after NAACL (June 9-15). So it would be best for us to target the 3.1 release towards the end of June.
>>> 
>>> Steve
>>> 
>>> [1] https://issues.apache.org/jira/browse/CTAKES-190
>>> 
>>> On May 31, 2013, at 1:01 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>>> 
>>>> https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
>>>> 25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.
>>>> 
>>>> To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)
>>>> 
>>>> --Pei
>>>> 
>>>>> -----Original Message-----
>>>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>>>> Sent: Friday, April 12, 2013 4:41 PM
>>>>> To: 'dev@ctakes.apache.org'
>>>>> Subject: RE: Next cTAKES release (3.1)?
>>>>> 
>>>>> 
>>>>> The new CEM Instance Template population is not complete yet, but if 3.1 is
>>>>> late May or June, it will be.
>>>>> 
>>>>> Also, is the GUI close enough to being ready for prime time that it would
>>>>> have a chance to be in 3.1?
>>>>> 
>>>>> -- James
>>>>> 
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
>>>>>> [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
>>>>> On
>>>>>> Behalf Of Chen, Pei
>>>>>> Sent: Thursday, April 11, 2013 7:56 PM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Next cTAKES release (3.1)?
>>>>>> 
>>>>>> Hi,
>>>>>> I just wanted to gauge the interest of creating the next release of
>>>>>> cTAKES
>>>>>> (3.1) which is currently marked for May in Jira-
>>>>>> 
>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>> Plenty of bug fixes and new components including:
>>>>>> - New CEM Instance Template population
>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>> - New optional Clear POSTagger
>>>>>> - New regression testing component
>>>>>> 
>>>>>> Should we wait for the Temporal component?
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
>>>>>> 2%20
>>>>>> AND%20project%20%3D%20CTAKES
>> 
> 


RE: Next cTAKES release (3.1)?

Posted by "Masanz, James J." <Ma...@mayo.edu>.
I will hold off the build.

I looked again at CTAKES-190, and after doing a search on the code in trunk for "MedicationEventMention" I still see places outside of relation extractor where it is being used (instead of the newer MedicationMention).

I'll update those places, probably tomorrow.

Also, Steve, when you get a chance, if you could look at what I wrote in CTAKES-190 on May 06, and handle or let me know what pieces still need to be looked at in the relation extractor, that would be great.

I'll post again when I have a next target date for a build. Meanwhile anyone else who has input related to the timing of the build, please don't hesitate to post.

-- James

-----Original Message-----
From: dev-return-1704-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1704-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
Sent: Wednesday, June 26, 2013 10:35 AM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

On 26 Jun2013, at 8:35 , "Masanz, James J." <Ma...@mayo.edu> wrote:
> I had originally suggested today, June 26, as the start of the cTAKES 3.1 build.
> 
> Now I am thinking of doing a build (my first) tomorrow, June 27.
> Hopefully that will turn into RC1 of Apache cTAKES 3.1
> 
> Q1) Any reasons not to do that (any reason to delay a bit)?

As far as I know, we still have not yet updated the relation extractor since the CTAKES-190 changes to the dictionary lookup [1]. I will try to make some time for that this week, but I have several other commitments, so I'm not sure exactly when I'll be able to complete it.

[1] https://issues.apache.org/jira/browse/CTAKES-190

Steve

> I will follow [1] and update that page if anything has changed
> 
> I plan to create a branch in SVN and build from the branch. As [1] states, as release manager I will be responsible for getting any fixes into the branch (or perhaps just create a new branch depending on activity), so you can continue to just work from trunk.
> 
> A new version of the Wiki pages for the user and developer guides needs to be created
> Q2) Is there a faster way to duplicate a set of Wiki pages than copying them one by one?
> 
> [1] http://ctakes.apache.org/ctakes-release-guide.html
> 
> 
> -- James
> 
> 
> 
> -----Original Message-----
> From: dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
> Sent: Friday, May 31, 2013 2:27 PM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> Yes I think it can be done by then. But even if not, my understanding is 
> that the version turned on by default is not cleartk-based and the 
> cleartk one is still under development.
> Tim
> 
> On 05/31/2013 03:25 PM, Masanz, James J. wrote:
>> I'll be release manager for 3.1 (unless someone else is anxious to be and just hasn't seen this thread yet)
>> 
>> I'd suggest we target Wed June 26 to have an RC built.
>> 
>> Steve, would that seem reasonable for the relation extractor changes due to [1], plus those that would be needed if we upgrade ClearTK dependency to 1.4.0.
>> 
>> Anyone know enough about assertion code to make an educated guess of whether the "upgrade ClearTK dependency to 1.4.0" could be done by then too?
>> 
>> -----Original Message-----
>> From: dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
>> Sent: Friday, May 31, 2013 2:18 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Next cTAKES release (3.1)?
>> 
>> As a result of the CTAKES-190 changes to the dictionary lookup [1], the relation extractor needs some refactoring and retraining. Probably we won't have a chance to get to that until after NAACL (June 9-15). So it would be best for us to target the 3.1 release towards the end of June.
>> 
>> Steve
>> 
>> [1] https://issues.apache.org/jira/browse/CTAKES-190
>> 
>> On May 31, 2013, at 1:01 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>> 
>>> https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
>>> 25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.
>>> 
>>> To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)
>>> 
>>> --Pei
>>> 
>>>> -----Original Message-----
>>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>>> Sent: Friday, April 12, 2013 4:41 PM
>>>> To: 'dev@ctakes.apache.org'
>>>> Subject: RE: Next cTAKES release (3.1)?
>>>> 
>>>> 
>>>> The new CEM Instance Template population is not complete yet, but if 3.1 is
>>>> late May or June, it will be.
>>>> 
>>>> Also, is the GUI close enough to being ready for prime time that it would
>>>> have a chance to be in 3.1?
>>>> 
>>>> -- James
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
>>>>> [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
>>>> On
>>>>> Behalf Of Chen, Pei
>>>>> Sent: Thursday, April 11, 2013 7:56 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Next cTAKES release (3.1)?
>>>>> 
>>>>> Hi,
>>>>> I just wanted to gauge the interest of creating the next release of
>>>>> cTAKES
>>>>> (3.1) which is currently marked for May in Jira-
>>>>> 
>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>> Plenty of bug fixes and new components including:
>>>>> - New CEM Instance Template population
>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>> - New optional Clear POSTagger
>>>>> - New regression testing component
>>>>> 
>>>>> Should we wait for the Temporal component?
>>>>> 
>>>>> [1]
>>>>> 
>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
>>>>> 2%20
>>>>> AND%20project%20%3D%20CTAKES
> 


Re: Next cTAKES release (3.1)?

Posted by Steven Bethard <st...@Colorado.EDU>.
On 26 Jun2013, at 8:35 , "Masanz, James J." <Ma...@mayo.edu> wrote:
> I had originally suggested today, June 26, as the start of the cTAKES 3.1 build.
> 
> Now I am thinking of doing a build (my first) tomorrow, June 27.
> Hopefully that will turn into RC1 of Apache cTAKES 3.1
> 
> Q1) Any reasons not to do that (any reason to delay a bit)?

As far as I know, we still have not yet updated the relation extractor since the CTAKES-190 changes to the dictionary lookup [1]. I will try to make some time for that this week, but I have several other commitments, so I'm not sure exactly when I'll be able to complete it.

[1] https://issues.apache.org/jira/browse/CTAKES-190

Steve

> I will follow [1] and update that page if anything has changed
> 
> I plan to create a branch in SVN and build from the branch. As [1] states, as release manager I will be responsible for getting any fixes into the branch (or perhaps just create a new branch depending on activity), so you can continue to just work from trunk.
> 
> A new version of the Wiki pages for the user and developer guides needs to be created
> Q2) Is there a faster way to duplicate a set of Wiki pages than copying them one by one?
> 
> [1] http://ctakes.apache.org/ctakes-release-guide.html
> 
> 
> -- James
> 
> 
> 
> -----Original Message-----
> From: dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
> Sent: Friday, May 31, 2013 2:27 PM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> Yes I think it can be done by then. But even if not, my understanding is 
> that the version turned on by default is not cleartk-based and the 
> cleartk one is still under development.
> Tim
> 
> On 05/31/2013 03:25 PM, Masanz, James J. wrote:
>> I'll be release manager for 3.1 (unless someone else is anxious to be and just hasn't seen this thread yet)
>> 
>> I'd suggest we target Wed June 26 to have an RC built.
>> 
>> Steve, would that seem reasonable for the relation extractor changes due to [1], plus those that would be needed if we upgrade ClearTK dependency to 1.4.0.
>> 
>> Anyone know enough about assertion code to make an educated guess of whether the "upgrade ClearTK dependency to 1.4.0" could be done by then too?
>> 
>> -----Original Message-----
>> From: dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
>> Sent: Friday, May 31, 2013 2:18 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Next cTAKES release (3.1)?
>> 
>> As a result of the CTAKES-190 changes to the dictionary lookup [1], the relation extractor needs some refactoring and retraining. Probably we won't have a chance to get to that until after NAACL (June 9-15). So it would be best for us to target the 3.1 release towards the end of June.
>> 
>> Steve
>> 
>> [1] https://issues.apache.org/jira/browse/CTAKES-190
>> 
>> On May 31, 2013, at 1:01 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>> 
>>> https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
>>> 25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.
>>> 
>>> To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)
>>> 
>>> --Pei
>>> 
>>>> -----Original Message-----
>>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>>> Sent: Friday, April 12, 2013 4:41 PM
>>>> To: 'dev@ctakes.apache.org'
>>>> Subject: RE: Next cTAKES release (3.1)?
>>>> 
>>>> 
>>>> The new CEM Instance Template population is not complete yet, but if 3.1 is
>>>> late May or June, it will be.
>>>> 
>>>> Also, is the GUI close enough to being ready for prime time that it would
>>>> have a chance to be in 3.1?
>>>> 
>>>> -- James
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
>>>>> [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
>>>> On
>>>>> Behalf Of Chen, Pei
>>>>> Sent: Thursday, April 11, 2013 7:56 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Next cTAKES release (3.1)?
>>>>> 
>>>>> Hi,
>>>>> I just wanted to gauge the interest of creating the next release of
>>>>> cTAKES
>>>>> (3.1) which is currently marked for May in Jira-
>>>>> 
>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>> Plenty of bug fixes and new components including:
>>>>> - New CEM Instance Template population
>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>> - New optional Clear POSTagger
>>>>> - New regression testing component
>>>>> 
>>>>> Should we wait for the Temporal component?
>>>>> 
>>>>> [1]
>>>>> 
>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
>>>>> 2%20
>>>>> AND%20project%20%3D%20CTAKES
> 


RE: Next cTAKES release (3.1)?

Posted by "Masanz, James J." <Ma...@mayo.edu>.
Hi all,

I had originally suggested today, June 26, as the start of the cTAKES 3.1 build.

Now I am thinking of doing a build (my first) tomorrow, June 27.
Hopefully that will turn into RC1 of Apache cTAKES 3.1

Q1) Any reasons not to do that (any reason to delay a bit)?

I will follow [1] and update that page if anything has changed

I plan to create a branch in SVN and build from the branch. As [1] states, as release manager I will be responsible for getting any fixes into the branch (or perhaps just create a new branch depending on activity), so you can continue to just work from trunk.

A new version of the Wiki pages for the user and developer guides needs to be created
Q2) Is there a faster way to duplicate a set of Wiki pages than copying them one by one?

[1] http://ctakes.apache.org/ctakes-release-guide.html


-- James



-----Original Message-----
From: dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1654-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
Sent: Friday, May 31, 2013 2:27 PM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

Yes I think it can be done by then. But even if not, my understanding is 
that the version turned on by default is not cleartk-based and the 
cleartk one is still under development.
Tim

On 05/31/2013 03:25 PM, Masanz, James J. wrote:
> I'll be release manager for 3.1 (unless someone else is anxious to be and just hasn't seen this thread yet)
>
> I'd suggest we target Wed June 26 to have an RC built.
>
> Steve, would that seem reasonable for the relation extractor changes due to [1], plus those that would be needed if we upgrade ClearTK dependency to 1.4.0.
>
> Anyone know enough about assertion code to make an educated guess of whether the "upgrade ClearTK dependency to 1.4.0" could be done by then too?
>
> -----Original Message-----
> From: dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
> Sent: Friday, May 31, 2013 2:18 PM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
>
> As a result of the CTAKES-190 changes to the dictionary lookup [1], the relation extractor needs some refactoring and retraining. Probably we won't have a chance to get to that until after NAACL (June 9-15). So it would be best for us to target the 3.1 release towards the end of June.
>
> Steve
>
> [1] https://issues.apache.org/jira/browse/CTAKES-190
>
> On May 31, 2013, at 1:01 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>
>> https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
>> 25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.
>>
>> To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)
>>
>> --Pei
>>
>>> -----Original Message-----
>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>> Sent: Friday, April 12, 2013 4:41 PM
>>> To: 'dev@ctakes.apache.org'
>>> Subject: RE: Next cTAKES release (3.1)?
>>>
>>>
>>> The new CEM Instance Template population is not complete yet, but if 3.1 is
>>> late May or June, it will be.
>>>
>>> Also, is the GUI close enough to being ready for prime time that it would
>>> have a chance to be in 3.1?
>>>
>>> -- James
>>>
>>>
>>>> -----Original Message-----
>>>> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
>>>> [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
>>> On
>>>> Behalf Of Chen, Pei
>>>> Sent: Thursday, April 11, 2013 7:56 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Next cTAKES release (3.1)?
>>>>
>>>> Hi,
>>>> I just wanted to gauge the interest of creating the next release of
>>>> cTAKES
>>>> (3.1) which is currently marked for May in Jira-
>>>>
>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>> Plenty of bug fixes and new components including:
>>>> - New CEM Instance Template population
>>>> - New Dependency Parser/Semantic Role Labeler
>>>> - New optional Clear POSTagger
>>>> - New regression testing component
>>>>
>>>> Should we wait for the Temporal component?
>>>>
>>>> [1]
>>>>
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
>>>> 2%20
>>>> AND%20project%20%3D%20CTAKES


Re: Next cTAKES release (3.1)?

Posted by Tim Miller <ti...@childrens.harvard.edu>.
Yes I think it can be done by then. But even if not, my understanding is 
that the version turned on by default is not cleartk-based and the 
cleartk one is still under development.
Tim

On 05/31/2013 03:25 PM, Masanz, James J. wrote:
> I'll be release manager for 3.1 (unless someone else is anxious to be and just hasn't seen this thread yet)
>
> I'd suggest we target Wed June 26 to have an RC built.
>
> Steve, would that seem reasonable for the relation extractor changes due to [1], plus those that would be needed if we upgrade ClearTK dependency to 1.4.0.
>
> Anyone know enough about assertion code to make an educated guess of whether the "upgrade ClearTK dependency to 1.4.0" could be done by then too?
>
> -----Original Message-----
> From: dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
> Sent: Friday, May 31, 2013 2:18 PM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
>
> As a result of the CTAKES-190 changes to the dictionary lookup [1], the relation extractor needs some refactoring and retraining. Probably we won't have a chance to get to that until after NAACL (June 9-15). So it would be best for us to target the 3.1 release towards the end of June.
>
> Steve
>
> [1] https://issues.apache.org/jira/browse/CTAKES-190
>
> On May 31, 2013, at 1:01 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>
>> https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
>> 25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.
>>
>> To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)
>>
>> --Pei
>>
>>> -----Original Message-----
>>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>>> Sent: Friday, April 12, 2013 4:41 PM
>>> To: 'dev@ctakes.apache.org'
>>> Subject: RE: Next cTAKES release (3.1)?
>>>
>>>
>>> The new CEM Instance Template population is not complete yet, but if 3.1 is
>>> late May or June, it will be.
>>>
>>> Also, is the GUI close enough to being ready for prime time that it would
>>> have a chance to be in 3.1?
>>>
>>> -- James
>>>
>>>
>>>> -----Original Message-----
>>>> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
>>>> [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
>>> On
>>>> Behalf Of Chen, Pei
>>>> Sent: Thursday, April 11, 2013 7:56 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Next cTAKES release (3.1)?
>>>>
>>>> Hi,
>>>> I just wanted to gauge the interest of creating the next release of
>>>> cTAKES
>>>> (3.1) which is currently marked for May in Jira-
>>>>
>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>> Plenty of bug fixes and new components including:
>>>> - New CEM Instance Template population
>>>> - New Dependency Parser/Semantic Role Labeler
>>>> - New optional Clear POSTagger
>>>> - New regression testing component
>>>>
>>>> Should we wait for the Temporal component?
>>>>
>>>> [1]
>>>>
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
>>>> 2%20
>>>> AND%20project%20%3D%20CTAKES


RE: Next cTAKES release (3.1)?

Posted by "Masanz, James J." <Ma...@mayo.edu>.
I'll be release manager for 3.1 (unless someone else is anxious to be and just hasn't seen this thread yet)

I'd suggest we target Wed June 26 to have an RC built.  

Steve, would that seem reasonable for the relation extractor changes due to [1], plus those that would be needed if we upgrade ClearTK dependency to 1.4.0.

Anyone know enough about assertion code to make an educated guess of whether the "upgrade ClearTK dependency to 1.4.0" could be done by then too?

-----Original Message-----
From: dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1652-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Steven Bethard
Sent: Friday, May 31, 2013 2:18 PM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

As a result of the CTAKES-190 changes to the dictionary lookup [1], the relation extractor needs some refactoring and retraining. Probably we won't have a chance to get to that until after NAACL (June 9-15). So it would be best for us to target the 3.1 release towards the end of June.

Steve

[1] https://issues.apache.org/jira/browse/CTAKES-190

On May 31, 2013, at 1:01 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
> 25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.
> 
> To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)
> 
> --Pei
> 
>> -----Original Message-----
>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>> Sent: Friday, April 12, 2013 4:41 PM
>> To: 'dev@ctakes.apache.org'
>> Subject: RE: Next cTAKES release (3.1)?
>> 
>> 
>> The new CEM Instance Template population is not complete yet, but if 3.1 is
>> late May or June, it will be.
>> 
>> Also, is the GUI close enough to being ready for prime time that it would
>> have a chance to be in 3.1?
>> 
>> -- James
>> 
>> 
>>> -----Original Message-----
>>> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
>>> [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
>> On
>>> Behalf Of Chen, Pei
>>> Sent: Thursday, April 11, 2013 7:56 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Next cTAKES release (3.1)?
>>> 
>>> Hi,
>>> I just wanted to gauge the interest of creating the next release of
>>> cTAKES
>>> (3.1) which is currently marked for May in Jira-
>>> 
>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>> Plenty of bug fixes and new components including:
>>> - New CEM Instance Template population
>>> - New Dependency Parser/Semantic Role Labeler
>>> - New optional Clear POSTagger
>>> - New regression testing component
>>> 
>>> Should we wait for the Temporal component?
>>> 
>>> [1]
>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
>>> 2%20
>>> AND%20project%20%3D%20CTAKES
> 


Re: Next cTAKES release (3.1)?

Posted by Steven Bethard <st...@Colorado.EDU>.
As a result of the CTAKES-190 changes to the dictionary lookup [1], the relation extractor needs some refactoring and retraining. Probably we won't have a chance to get to that until after NAACL (June 9-15). So it would be best for us to target the 3.1 release towards the end of June.

Steve

[1] https://issues.apache.org/jira/browse/CTAKES-190

On May 31, 2013, at 1:01 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
> 25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.
> 
> To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)
> 
> --Pei
> 
>> -----Original Message-----
>> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
>> Sent: Friday, April 12, 2013 4:41 PM
>> To: 'dev@ctakes.apache.org'
>> Subject: RE: Next cTAKES release (3.1)?
>> 
>> 
>> The new CEM Instance Template population is not complete yet, but if 3.1 is
>> late May or June, it will be.
>> 
>> Also, is the GUI close enough to being ready for prime time that it would
>> have a chance to be in 3.1?
>> 
>> -- James
>> 
>> 
>>> -----Original Message-----
>>> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
>>> [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
>> On
>>> Behalf Of Chen, Pei
>>> Sent: Thursday, April 11, 2013 7:56 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Next cTAKES release (3.1)?
>>> 
>>> Hi,
>>> I just wanted to gauge the interest of creating the next release of
>>> cTAKES
>>> (3.1) which is currently marked for May in Jira-
>>> 
>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>> Plenty of bug fixes and new components including:
>>> - New CEM Instance Template population
>>> - New Dependency Parser/Semantic Role Labeler
>>> - New optional Clear POSTagger
>>> - New regression testing component
>>> 
>>> Should we wait for the Temporal component?
>>> 
>>> [1]
>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
>>> 2%20
>>> AND%20project%20%3D%20CTAKES
> 


RE: Next cTAKES release (3.1)?

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
https://issues.apache.org/jira/browse/CTAKES/fixforversion/12323276#selectedTab=com.atlassian.jira.plugin.system.project%3Aversion-issues-panel
25/58 are either closed/resolved; there were a decent number of simple patch fixes I think.

To spread the knowledge, perhaps another committer could be the release manager (RM) for the next release.  Hint hint *James? ;)

--Pei

> -----Original Message-----
> From: Masanz, James J. [mailto:Masanz.James@mayo.edu]
> Sent: Friday, April 12, 2013 4:41 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: Next cTAKES release (3.1)?
> 
> 
> The new CEM Instance Template population is not complete yet, but if 3.1 is
> late May or June, it will be.
> 
> Also, is the GUI close enough to being ready for prime time that it would
> have a chance to be in 3.1?
> 
> -- James
> 
> 
> > -----Original Message-----
> > From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
> > [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
> On
> > Behalf Of Chen, Pei
> > Sent: Thursday, April 11, 2013 7:56 PM
> > To: dev@ctakes.apache.org
> > Subject: Next cTAKES release (3.1)?
> >
> > Hi,
> > I just wanted to gauge the interest of creating the next release of
> > cTAKES
> > (3.1) which is currently marked for May in Jira-
> >
> > There have already been 22/53 issues [1] marked as fixed or closed.
> > Plenty of bug fixes and new components including:
> > - New CEM Instance Template population
> > - New Dependency Parser/Semantic Role Labeler
> > - New optional Clear POSTagger
> > - New regression testing component
> >
> > Should we wait for the Temporal component?
> >
> > [1]
> >
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
> > 2%20
> > AND%20project%20%3D%20CTAKES


RE: Next cTAKES release (3.1)?

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
Hi Giri,
The idea was to take a gui [1] that was built largely as a POC/prototype and move it into mainstream trunk.
The original intent was to allow end users configure a pipeline together dynamically and have uimaFIT build and run it... configuration/output is self-contained.

--Pei
[1] http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-gui/

> -----Original Message-----
> From: giri vara prasad nambari [mailto:girinambari@gmail.com]
> Sent: Wednesday, May 08, 2013 10:53 PM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> Hi All,
> 
> Is this code available on public domain?
> 
> Thank you,
> Giri
> 
> 
> On Wed, May 8, 2013 at 3:53 PM, Kannan Thiagarajan
> <ka...@gmail.com>wrote:
> 
> > Hello,
> >
> > Have you guys looked at Twitter Bootstrap -  its based on jQuery and
> > it gives pretty neat set of UI capabilities
> > http://twitter.github.io/bootstrap/
> >
> > BTW, I have used ExtJS in the past and like it very much but recently
> > stumbled upon this and like it very much.
> >
> > Cheers
> >
> >
> > On Wed, May 8, 2013 at 2:19 PM, Chen, Pei
> > <Pei.Chen@childrens.harvard.edu
> > >wrote:
> >
> > > Regarding the GUI-- fyi we may have to rewrite some of the
> > > javascript
> > code
> > > (or use an alternative such as jQuery) as the ASF community
> > > essentially advises to stay away from the Sencha lib for license
> incompatibilities.
> > >
> > > See thread:
> > >
> > >
> > http://mail-archives.apache.org/mod_mbox/www-legal-
> discuss/201304.mbox
> > /%3CE306DA35-A3C1-4525-B0F7-81F6DC0450BC%40gmail.com%3E
> > >
> > > --Pei
> > >
> > >
> > > > -----Original Message-----
> > > > From: Chen, Pei [mailto:Pei.Chen@childrens.harvard.edu]
> > > > Sent: Sunday, April 14, 2013 5:08 PM
> > > > To: dev@ctakes.apache.org
> > > > Subject: RE: Next cTAKES release (3.1)?
> > > >
> > > > That's a good idea;  I'll see if I can port the web gui over (even
> > > > it's
> > > running the
> > > > pipeline in-process).  Hopefully, it'll be a start of something better.
> > > >
> > > > ________________________________________
> > > > From: Masanz, James J. [Masanz.James@mayo.edu]
> > > > Sent: Friday, April 12, 2013 4:40 PM
> > > > To: 'dev@ctakes.apache.org'
> > > > Subject: RE: Next cTAKES release (3.1)?
> > > >
> > > > The new CEM Instance Template population is not complete yet, but
> > > > if
> > 3.1
> > > is
> > > > late May or June, it will be.
> > > >
> > > > Also, is the GUI close enough to being ready for prime time that
> > > > it
> > would
> > > > have a chance to be in 3.1?
> > > >
> > > > -- James
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: dev-return-1506-
> Masanz.James=mayo.edu@ctakes.apache.org
> > > > > [mailto:dev-
> > > > > return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
> > > > On
> > > > > Behalf Of Chen, Pei
> > > > > Sent: Thursday, April 11, 2013 7:56 PM
> > > > > To: dev@ctakes.apache.org
> > > > > Subject: Next cTAKES release (3.1)?
> > > > >
> > > > > Hi,
> > > > > I just wanted to gauge the interest of creating the next release
> > > > > of cTAKES
> > > > > (3.1) which is currently marked for May in Jira-
> > > > >
> > > > > There have already been 22/53 issues [1] marked as fixed or closed.
> > > > > Plenty of bug fixes and new components including:
> > > > > - New CEM Instance Template population
> > > > > - New Dependency Parser/Semantic Role Labeler
> > > > > - New optional Clear POSTagger
> > > > > - New regression testing component
> > > > >
> > > > > Should we wait for the Temporal component?
> > > > >
> > > > > [1]
> > > > >
> > > > https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223
> > > > .1%2
> > > > > 2%20
> > > > > AND%20project%20%3D%20CTAKES
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Kannan Thiagarajan
> >

Re: Next cTAKES release (3.1)?

Posted by giri vara prasad nambari <gi...@gmail.com>.
Hi All,

Is this code available on public domain?

Thank you,
Giri


On Wed, May 8, 2013 at 3:53 PM, Kannan Thiagarajan <ka...@gmail.com>wrote:

> Hello,
>
> Have you guys looked at Twitter Bootstrap -  its based on jQuery and it
> gives pretty neat set of UI capabilities
> http://twitter.github.io/bootstrap/
>
> BTW, I have used ExtJS in the past and like it very much but recently
> stumbled upon this and like it very much.
>
> Cheers
>
>
> On Wed, May 8, 2013 at 2:19 PM, Chen, Pei <Pei.Chen@childrens.harvard.edu
> >wrote:
>
> > Regarding the GUI-- fyi we may have to rewrite some of the javascript
> code
> > (or use an alternative such as jQuery) as the ASF community essentially
> > advises to stay away from the Sencha lib for license incompatibilities.
> >
> > See thread:
> >
> >
> http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201304.mbox/%3CE306DA35-A3C1-4525-B0F7-81F6DC0450BC%40gmail.com%3E
> >
> > --Pei
> >
> >
> > > -----Original Message-----
> > > From: Chen, Pei [mailto:Pei.Chen@childrens.harvard.edu]
> > > Sent: Sunday, April 14, 2013 5:08 PM
> > > To: dev@ctakes.apache.org
> > > Subject: RE: Next cTAKES release (3.1)?
> > >
> > > That's a good idea;  I'll see if I can port the web gui over (even it's
> > running the
> > > pipeline in-process).  Hopefully, it'll be a start of something better.
> > >
> > > ________________________________________
> > > From: Masanz, James J. [Masanz.James@mayo.edu]
> > > Sent: Friday, April 12, 2013 4:40 PM
> > > To: 'dev@ctakes.apache.org'
> > > Subject: RE: Next cTAKES release (3.1)?
> > >
> > > The new CEM Instance Template population is not complete yet, but if
> 3.1
> > is
> > > late May or June, it will be.
> > >
> > > Also, is the GUI close enough to being ready for prime time that it
> would
> > > have a chance to be in 3.1?
> > >
> > > -- James
> > >
> > >
> > > > -----Original Message-----
> > > > From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
> > > > [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
> > > On
> > > > Behalf Of Chen, Pei
> > > > Sent: Thursday, April 11, 2013 7:56 PM
> > > > To: dev@ctakes.apache.org
> > > > Subject: Next cTAKES release (3.1)?
> > > >
> > > > Hi,
> > > > I just wanted to gauge the interest of creating the next release of
> > > > cTAKES
> > > > (3.1) which is currently marked for May in Jira-
> > > >
> > > > There have already been 22/53 issues [1] marked as fixed or closed.
> > > > Plenty of bug fixes and new components including:
> > > > - New CEM Instance Template population
> > > > - New Dependency Parser/Semantic Role Labeler
> > > > - New optional Clear POSTagger
> > > > - New regression testing component
> > > >
> > > > Should we wait for the Temporal component?
> > > >
> > > > [1]
> > > >
> > > https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
> > > > 2%20
> > > > AND%20project%20%3D%20CTAKES
> >
> >
>
>
> --
> Best Regards
> Kannan Thiagarajan
>

Re: Next cTAKES release (3.1)?

Posted by Kannan Thiagarajan <ka...@gmail.com>.
Hello,

Have you guys looked at Twitter Bootstrap -  its based on jQuery and it
gives pretty neat set of UI capabilities
http://twitter.github.io/bootstrap/

BTW, I have used ExtJS in the past and like it very much but recently
stumbled upon this and like it very much.

Cheers


On Wed, May 8, 2013 at 2:19 PM, Chen, Pei <Pe...@childrens.harvard.edu>wrote:

> Regarding the GUI-- fyi we may have to rewrite some of the javascript code
> (or use an alternative such as jQuery) as the ASF community essentially
> advises to stay away from the Sencha lib for license incompatibilities.
>
> See thread:
>
> http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201304.mbox/%3CE306DA35-A3C1-4525-B0F7-81F6DC0450BC%40gmail.com%3E
>
> --Pei
>
>
> > -----Original Message-----
> > From: Chen, Pei [mailto:Pei.Chen@childrens.harvard.edu]
> > Sent: Sunday, April 14, 2013 5:08 PM
> > To: dev@ctakes.apache.org
> > Subject: RE: Next cTAKES release (3.1)?
> >
> > That's a good idea;  I'll see if I can port the web gui over (even it's
> running the
> > pipeline in-process).  Hopefully, it'll be a start of something better.
> >
> > ________________________________________
> > From: Masanz, James J. [Masanz.James@mayo.edu]
> > Sent: Friday, April 12, 2013 4:40 PM
> > To: 'dev@ctakes.apache.org'
> > Subject: RE: Next cTAKES release (3.1)?
> >
> > The new CEM Instance Template population is not complete yet, but if 3.1
> is
> > late May or June, it will be.
> >
> > Also, is the GUI close enough to being ready for prime time that it would
> > have a chance to be in 3.1?
> >
> > -- James
> >
> >
> > > -----Original Message-----
> > > From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
> > > [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
> > On
> > > Behalf Of Chen, Pei
> > > Sent: Thursday, April 11, 2013 7:56 PM
> > > To: dev@ctakes.apache.org
> > > Subject: Next cTAKES release (3.1)?
> > >
> > > Hi,
> > > I just wanted to gauge the interest of creating the next release of
> > > cTAKES
> > > (3.1) which is currently marked for May in Jira-
> > >
> > > There have already been 22/53 issues [1] marked as fixed or closed.
> > > Plenty of bug fixes and new components including:
> > > - New CEM Instance Template population
> > > - New Dependency Parser/Semantic Role Labeler
> > > - New optional Clear POSTagger
> > > - New regression testing component
> > >
> > > Should we wait for the Temporal component?
> > >
> > > [1]
> > >
> > https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
> > > 2%20
> > > AND%20project%20%3D%20CTAKES
>
>


-- 
Best Regards
Kannan Thiagarajan

RE: Next cTAKES release (3.1)?

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
Regarding the GUI-- fyi we may have to rewrite some of the javascript code (or use an alternative such as jQuery) as the ASF community essentially advises to stay away from the Sencha lib for license incompatibilities.

See thread:
http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201304.mbox/%3CE306DA35-A3C1-4525-B0F7-81F6DC0450BC%40gmail.com%3E

--Pei


> -----Original Message-----
> From: Chen, Pei [mailto:Pei.Chen@childrens.harvard.edu]
> Sent: Sunday, April 14, 2013 5:08 PM
> To: dev@ctakes.apache.org
> Subject: RE: Next cTAKES release (3.1)?
> 
> That's a good idea;  I'll see if I can port the web gui over (even it's running the
> pipeline in-process).  Hopefully, it'll be a start of something better.
> 
> ________________________________________
> From: Masanz, James J. [Masanz.James@mayo.edu]
> Sent: Friday, April 12, 2013 4:40 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: Next cTAKES release (3.1)?
> 
> The new CEM Instance Template population is not complete yet, but if 3.1 is
> late May or June, it will be.
> 
> Also, is the GUI close enough to being ready for prime time that it would
> have a chance to be in 3.1?
> 
> -- James
> 
> 
> > -----Original Message-----
> > From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
> > [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org]
> On
> > Behalf Of Chen, Pei
> > Sent: Thursday, April 11, 2013 7:56 PM
> > To: dev@ctakes.apache.org
> > Subject: Next cTAKES release (3.1)?
> >
> > Hi,
> > I just wanted to gauge the interest of creating the next release of
> > cTAKES
> > (3.1) which is currently marked for May in Jira-
> >
> > There have already been 22/53 issues [1] marked as fixed or closed.
> > Plenty of bug fixes and new components including:
> > - New CEM Instance Template population
> > - New Dependency Parser/Semantic Role Labeler
> > - New optional Clear POSTagger
> > - New regression testing component
> >
> > Should we wait for the Temporal component?
> >
> > [1]
> >
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
> > 2%20
> > AND%20project%20%3D%20CTAKES


RE: Next cTAKES release (3.1)?

Posted by "Masanz, James J." <Ma...@mayo.edu>.
I think that would be great. 

-- James


> -----Original Message-----
> From: dev-return-1516-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-
> return-1516-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Chen,
> Pei
> Sent: Sunday, April 14, 2013 4:08 PM
> To: dev@ctakes.apache.org
> Subject: RE: Next cTAKES release (3.1)?
> 
> That's a good idea;  I'll see if I can port the web gui over (even it's
> running the pipeline in-process).  Hopefully, it'll be a start of
> something better.
> 
> ________________________________________
> From: Masanz, James J. [Masanz.James@mayo.edu]
> Sent: Friday, April 12, 2013 4:40 PM
> To: 'dev@ctakes.apache.org'
> Subject: RE: Next cTAKES release (3.1)?
> 
> The new CEM Instance Template population is not complete yet, but if 3.1
> is late May or June, it will be.
> 
> Also, is the GUI close enough to being ready for prime time that it would
> have a chance to be in 3.1?
> 
> -- James
> 
> 
> > -----Original Message-----
> > From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org
> > [mailto:dev- return-1506-Masanz.James=mayo.edu@ctakes.apache.org] On
> > Behalf Of Chen, Pei
> > Sent: Thursday, April 11, 2013 7:56 PM
> > To: dev@ctakes.apache.org
> > Subject: Next cTAKES release (3.1)?
> >
> > Hi,
> > I just wanted to gauge the interest of creating the next release of
> > cTAKES
> > (3.1) which is currently marked for May in Jira-
> >
> > There have already been 22/53 issues [1] marked as fixed or closed.
> > Plenty of bug fixes and new components including:
> > - New CEM Instance Template population
> > - New Dependency Parser/Semantic Role Labeler
> > - New optional Clear POSTagger
> > - New regression testing component
> >
> > Should we wait for the Temporal component?
> >
> > [1]
> > https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%2
> > 2%20
> > AND%20project%20%3D%20CTAKES


RE: Next cTAKES release (3.1)?

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
That's a good idea;  I'll see if I can port the web gui over (even it's running the pipeline in-process).  Hopefully, it'll be a start of something better.

________________________________________
From: Masanz, James J. [Masanz.James@mayo.edu]
Sent: Friday, April 12, 2013 4:40 PM
To: 'dev@ctakes.apache.org'
Subject: RE: Next cTAKES release (3.1)?

The new CEM Instance Template population is not complete yet, but if 3.1 is late May or June, it will be.

Also, is the GUI close enough to being ready for prime time that it would have a chance to be in 3.1?

-- James


> -----Original Message-----
> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-
> return-1506-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Chen,
> Pei
> Sent: Thursday, April 11, 2013 7:56 PM
> To: dev@ctakes.apache.org
> Subject: Next cTAKES release (3.1)?
>
> Hi,
> I just wanted to gauge the interest of creating the next release of cTAKES
> (3.1) which is currently marked for May in Jira-
>
> There have already been 22/53 issues [1] marked as fixed or closed.
> Plenty of bug fixes and new components including:
> - New CEM Instance Template population
> - New Dependency Parser/Semantic Role Labeler
> - New optional Clear POSTagger
> - New regression testing component
>
> Should we wait for the Temporal component?
>
> [1]
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20
> AND%20project%20%3D%20CTAKES


RE: Next cTAKES release (3.1)?

Posted by "Masanz, James J." <Ma...@mayo.edu>.
The new CEM Instance Template population is not complete yet, but if 3.1 is late May or June, it will be.

Also, is the GUI close enough to being ready for prime time that it would have a chance to be in 3.1?

-- James


> -----Original Message-----
> From: dev-return-1506-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-
> return-1506-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Chen,
> Pei
> Sent: Thursday, April 11, 2013 7:56 PM
> To: dev@ctakes.apache.org
> Subject: Next cTAKES release (3.1)?
> 
> Hi,
> I just wanted to gauge the interest of creating the next release of cTAKES
> (3.1) which is currently marked for May in Jira-
> 
> There have already been 22/53 issues [1] marked as fixed or closed.
> Plenty of bug fixes and new components including:
> - New CEM Instance Template population
> - New Dependency Parser/Semantic Role Labeler
> - New optional Clear POSTagger
> - New regression testing component
> 
> Should we wait for the Temporal component?
> 
> [1]
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20
> AND%20project%20%3D%20CTAKES


Re: Next cTAKES release (3.1)?

Posted by Andy McMurry <mc...@gmail.com>.
Hi Dr. Green:

Your clinical knowledge is almost certainly greater than most or all of the cTakes developers here. 
You could make a valuable contribution! 

There are many instances where we programmer types simply dont approach a problem because we dont have the medical experience. 
MEDS are a huge challenge for us -- we never went to med school -- so we can't really say if an NER extracted medication makes sense given the disease and procedure. 

Example: 
Consider the following list of extracted (NER) medical concepts 
* Medication: Vioxx 
* Procedure: Heart surgery / CABG 
* Smoking status: not smoker  
* Diagnoses: acute conditions heart and/or lung? (Acute MI or COPD? ) 

These kinds of questions are very difficult to answer with NLP and worthy of research.  
I'm assuming the answer should be heart/AMI  and not lung/COPD. But this is the end of the road of my medical understanding. 

Speaking for myself, these insights are typically beyond the reasoning of programmers. 
In my work, I typically rely on existing expert medical ontologies to reason about these things, such as diagnoses and procedures trees 
http://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp

I wonder if there is someway in which you could guide us in making better use of the medical knowledge sources (ontologies) that are available. 

Curious if this stirs any thoughts for you, 
--Andy 


 

On Jul 2, 2013, at 5:19 PM, John Green <jo...@gmail.com> wrote:

> Hi all,
> 
> Ive been following this mail list for a couple of months. Im a third year medical student rounding the bend toward my MD. I used to be a computer programmer, however, and continue my own projects. Im very interested in contributing eventually to cTakes development. In the meantime, given the current talk of examples, if any domain specific examples needed generated I am domain knowledgable enough that I could pound out a few free text notes made to order.
> 
> Let me know, you all may already have docs on hand willing todo this, but if not...
> 
> John Green
> 
> Sent from my iPhone
> 
> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
> 
>> I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
>> I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.
>> 
>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>> have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>> have clearly labeled examples in every module that set up a pipeline, run on
>>> sample notes (could be the same sample notes from the tests), and do
>>> something with the results.
>>> iii) Follow Giri's recommendation to have example training data for people
>>> who want to take the next step and train their own models
>> 
>> I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.  
>> If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.
>> 
>> --Pei
>> 
>>> -----Original Message-----
>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>> Sent: Friday, June 28, 2013 8:00 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Next cTAKES release (3.1)?
>>> 
>>> Very interesting discussion. I think Giri is right about giving example training
>>> data in the format that our training code can read. While our ultimate goal
>>> would be to build and release models that are completely domain-
>>> independent, in the real world it is almost always better to use some
>>> domain-specific data and we should think more about how to facilitate that.
>>> 
>>> As for making it easier to get started, it is not totally clear to me what this
>>> means/how to do it so it might be useful to get specific about what this
>>> means. I think our biggest hurdle is
>>> 
>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>> 
>>> Since UIMAFit is officially becoming part of UIMA that will be easier, and
>>> hopefully people will just learn the easier (in my opinion) UIMAFit way than
>>> the standard UIMA way of doing things. Is there something we can be doing
>>> to make understanding UIMA easier? Or do we just need to say upfront that
>>> this is a prerequisite and hope that people don't give up due to this thing that
>>> is out of our control?
>>> 
>>> Another hurdle is:
>>> 
>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>> 
>>> So it's not just a matter of hiding complexity -- at some point people have to
>>> understand their problem, understand cTAKES' capabilities, and start coding.
>>> Pei's GUI will help for some common use cases but will not remove the
>>> requirement that someone at the organization knows cTAKES.
>>> I think one part of this problem is the fact that the typesystem is not well
>>> documented. A developer needs to know what the output is (objects from
>>> the typesystem), how to get them (which modules/pipelines), and what
>>> information is in them. So maybe on this end my recommendation would be:
>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>> have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>> have clearly labeled examples in every module that set up a pipeline, run on
>>> sample notes (could be the same sample notes from the tests), and do
>>> something with the results.
>>> iii) Follow Giri's recommendation to have example training data for people
>>> who want to take the next step and train their own models
>>> 
>>> This is quite a bit of developer overhead, so it's worth asking whether you
>>> agree with my "diagnosis" and "treatment" or whether you think there are
>>> different problems/solutions that should be higher priority.
>>> 
>>> Tim
>>> 
>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>> Hi Vijay and Andy,
>>>> 
>>>> Thanks for sharing those examples.
>>>> 
>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>> 
>>>> Agree with this statement and this is very valid concern.
>>>> 
>>>> In "getting started examples", I think we should just have couple of
>>>> entries (5-10 small entries), not more than that (with explicit
>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>> understand
>>>> handcrafting these may not be easy because we are not medical domain
>>>> experts, but I feel worth time, because it brings in more user community.
>>>> 
>>>> Thank you,
>>>> Giri
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>> <mc...@gmail.com>wrote:
>>>> 
>>>>> GREAT !
>>>>> 
>>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>>> request access to it since it is "semi private"
>>>>> 
>>>>> 
>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>>> 
>>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>>> and I2B2
>>>>>> 2008 challenges:
>>>>>> 
>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>>> o
>>>>> ducing_results_on_CMC_2007_challenge
>>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>>> 
>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>>> first version of ARC:
>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>>> 
>>>>>> These are both publicly available datasets, and represent real-world
>>>>>> problems (in general I believe when publishing a paper the code
>>>>>> should be reproducible and made publicly available, but that's a different
>>> issue).
>>>>>> 
>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>>> upgrade these samples as well.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> VJ
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>>> <mcmurry.andy@gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>>> +NLP
>>>>>>> datasets.
>>>>>>> 
>>>>>>> I have at least one we can use that was created by our lead
>>>>>>> Pathologist
>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>>> es/train/traincase.xml
>>>>>>> We should provide at least one sample for each domain.
>>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>>> and not copy-pasted from EMR systems.
>>>>>>> 
>>>>>>> --Andy
>>>>>>> 
>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>> girinambari@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> +1 for this observation Andy!
>>>>>>>> 
>>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>>> how
>>>>> to,
>>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>> 
>>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>>> posted similar
>>>>>>> question
>>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>>> depends
>>>>> on
>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>>> most
>>>>> of
>>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>>> posting
>>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>>> think
>>>>>>> we
>>>>>>>> need to have some high level documentation about these (with links
>>>>>>>> to corresponding forums).
>>>>>>>> 
>>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>>> as
>>>>>>> ctakes
>>>>>>>> progress), for users I think documentation is lot better though
>>>>>>>> some improvements need to be done.
>>>>>>>> 
>>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>>> am
>>>>> still
>>>>>>>> struggling in this area even though I browsed all relevant code),
>>>>> though
>>>>>>>> training class are there. I understood that there are licensing
>>>>>>>> issues
>>>>>>> with
>>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>>> may not
>>>>>>> be
>>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>>> code can
>>>>>>> reverse
>>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>>> when
>>>>> they
>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>> 
>>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> Giri
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>>> still
>>>>>>> hard
>>>>>>>>> to get started.
>>>>>>>>> 
>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>>> obvious
>>>>>>>>> and requires hand holding.
>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>> 
>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>>> better
>>>>>>> bug
>>>>>>>>> reports, FAQ, etc.
>>>>>>>>> 
>>>>>>>>> thoughts?
>>>>>>>>> --Andy
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>>> of
>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>> - New regression testing component
>>>>>>>>>> 
>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>> 
>>>>>>>>>> [1]
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>>> 22%20AND%20project%20%3D%20CTAKES
>> 


Re: Next cTAKES release (3.1)?

Posted by Girivaraprasad Nambari <gi...@gmail.com>.
I think we are near to solve sample data issue,I could help (with help of
Jhon and other team members, when terminology clarification required)
annotating text if some one can provide template (or) some sort of notes on
how to do.

I think this leaves core team concentrating on fine tuning documentation.

Thank you,
Giri



On Jul 3, 2013 7:59 PM, "John Green" <jo...@gmail.com> wrote:

> I see. Its a pretty random collection of formats.
>
> Sent from my iPhone
>
> On Jul 3, 2013, at 18:25, andy mcmurry <mc...@gmail.com> wrote:
>
> > Mtsamples has lots of free public examples already but we aren't using
> them
> > yet.  This is probably because mtsamples don't have the annotations we
> need
> > to use them as training examples.
> > On Jul 3, 2013 2:46 PM, "Hephaestus Studio" <hephaestus.studio@gmail.com
> >
> > wrote:
> >
> >> @Andy - Not a doctor yet, but soon! Thanks for the promotion though, one
> >> more year!
> >>
> >> - Apropos meds or clinical type questions: any developer on here can
> feel
> >> free to shoot me a quick question via the list anytime, Id be happy to
> >> confirm that a drug or anything else makes since given a particular
> >> clinical/note context.
> >>
> >> - "I wonder if there is someway in which you could guide us in making
> >> better use of the medical knowledge sources (ontologies) that are
> >> available." - I'd be happy to brainstorm about using existing resources
> to
> >> help in decision making. We use these all the time in the clinic.
> >>
> >> @ Tim+Andy+Chen - I haven't had a chance to really start chewing into
> the
> >> code, though I hope to over the next year; so, what kind of examples
> would
> >> be most helpful?
> >>    - Any particular disease processes?
> >>    - Are you all familiar with the ubiquitous SOAP style presentation
> >> that doctors use to write free notes? The few examples I clicked
> through in
> >> the repository that Chen pointed me too are very sparse. Would we want
> >> gradations? E.g., a scale for "well done" notes to "very quick
> >> I-dont-care-because-I'm-in-a-rush" notes?
> >>
> >> @ Chen - Thank you for the kind words. It's nice to be welcomed by a
> >> community in which you hope to integrate. And thank you for pointing me
> to
> >> the directory with the current sample notes. This was very helpful in
> >> determining where those are at in there development. I know that each of
> >> your hospitals have a wealth of HIPAA-closed notes, but I'll see what I
> can
> >> do to make some "stereotypical" open-notes for common disease
> >> presentations. Again: maybe a scale, not necessarily just on brevity but
> >> some other metric, whose continuum represented various permutations of
> >> degrees of something, maybe of difficulty in processing? Apropos code,
> >> Chen: I will help where I can but where I want to be is elbow deep in
> the
> >> code :)
> >>
> >> Finally: I haven't had a chance to look into some of the links from
> >> earlier in this thread regarding open access repositories of free text
> >> clinical notes: what do you all feel the quality of these resources are?
> >> Abundant but low quality? Paucity but those that are there are high
> quality?
> >>
> >> Bottom line: no problem either answering contextual questions (can afib
> be
> >> associated with a lower gi bleed??) and no problem writing some notes,
> only
> >> question would be, before I put in any time: what disease/specialty
> domain?
> >> and would we want some system that put them on a continuum of some
> >> variable, say, brevity or "readability"?
> >>
> >> Just thinking before leaping,
> >>
> >> Thanks,
> >> JG
> >>
> >> Sent from my iPhone
> >>
> >> On Jul 2, 2013, at 21:23, "Chen, Pei" <Pe...@childrens.harvard.edu>
> >> wrote:
> >>
> >>> Hi John,
> >>> Welcome!  There are actually many ways to contribute and it's not
> >> limited to just code.  It's always great to hear new ideas and
> suggestions
> >> on how to improve the software.  Therefore even, things like user
> feedback,
> >> documentation, new use cases, essentially anything that will make things
> >> better would be awesome!
> >>>
> >>> To get started, I would suggest subscribing to the email lists.  If you
> >> would like to contribute anything, just create an Jira account (anyone
> >> should be able to do this), and add/review Jira items (add attachments
> if
> >> you like) and we can even help integrate it.
> >>>
> >>> We normally use Jira to keep track of issues:
> >>> [1] https://issues.apache.org/jira/browse/ctakes
> >>>
> >>> Current collection of sample test notes that have been collected over
> >> the years:
> >>
> https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/testdata/input/plaintext/
> >>>
> >>> ________________________________________
> >>> From: Tim Miller [timothy.miller@childrens.harvard.edu]
> >>> Sent: Tuesday, July 02, 2013 6:31 PM
> >>> To: dev@ctakes.apache.org
> >>> Subject: Re: Next cTAKES release (3.1)?
> >>>
> >>> Agreed that you could definitely help out, and that would be a great
> way
> >>> to do so. We don't really have "examples" right now, more like just
> >>> short test sentences for showing simple results and verifying that
> >>> nothing has been broken by changes. I think regular length fake but
> >>> realistic notes would be very useful.
> >>> Tim
> >>>
> >>> On 07/02/2013 05:19 PM, John Green wrote:
> >>>> Hi all,
> >>>>
> >>>> Ive been following this mail list for a couple of months. Im a third
> >> year medical student rounding the bend toward my MD. I used to be a
> >> computer programmer, however, and continue my own projects. Im very
> >> interested in contributing eventually to cTakes development. In the
> >> meantime, given the current talk of examples, if any domain specific
> >> examples needed generated I am domain knowledgable enough that I could
> >> pound out a few free text notes made to order.
> >>>>
> >>>> Let me know, you all may already have docs on hand willing todo this,
> >> but if not...
> >>>>
> >>>> John Green
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pei.Chen@childrens.harvard.edu
> >
> >> wrote:
> >>>>
> >>>>> I completely agree with making cTAKES easier use.  I think it is
> >> exciting to hear the different use cases here and understanding where
> some
> >> of the areas that need improvements are (which we haven't thought about
> >> earlier).
> >>>>> I think Tim's suggestions and the 3 concrete actionable items makes a
> >> lot of sense.  Hopefully it should attract new users, adopters, and
> perhaps
> >> more committers.
> >>>>>
> >>>>>> i) Make the typesystem forefront in documentation -- generate
> >> javadocs and
> >>>>>> have as a link on the ctakes frontpage/sidebar
> >>>>>> ii) Similar to the way that we are aiming to have tests in every
> >> module, also
> >>>>>> have clearly labeled examples in every module that set up a
> pipeline,
> >> run on
> >>>>>> sample notes (could be the same sample notes from the tests), and do
> >>>>>> something with the results.
> >>>>>> iii) Follow Giri's recommendation to have example training data for
> >> people
> >>>>>> who want to take the next step and train their own models
> >>>>> I think Java developers are accustomed to including a library as a
> >> dependency/jar, have an API to pass input, and get the results via
> pojos;
> >> So the examples could initially shield the complexity of wiring a
> pipeline
> >> together etc.
> >>>>> If we can improve the API's and how it gets integrated with other
> >> apps, we can add any GUI/CLI tools on top of this afterwards.
> >>>>>
> >>>>> --Pei
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> >>>>>> Sent: Friday, June 28, 2013 8:00 AM
> >>>>>> To: dev@ctakes.apache.org
> >>>>>> Subject: Re: Next cTAKES release (3.1)?
> >>>>>>
> >>>>>> Very interesting discussion. I think Giri is right about giving
> >> example training
> >>>>>> data in the format that our training code can read. While our
> >> ultimate goal
> >>>>>> would be to build and release models that are completely domain-
> >>>>>> independent, in the real world it is almost always better to use
> some
> >>>>>> domain-specific data and we should think more about how to
> facilitate
> >> that.
> >>>>>>
> >>>>>> As for making it easier to get started, it is not totally clear to
> me
> >> what this
> >>>>>> means/how to do it so it might be useful to get specific about what
> >> this
> >>>>>> means. I think our biggest hurdle is
> >>>>>>
> >>>>>> 1) Prerequisite of understanding UIMA/UIMAFit
> >>>>>>
> >>>>>> Since UIMAFit is officially becoming part of UIMA that will be
> >> easier, and
> >>>>>> hopefully people will just learn the easier (in my opinion) UIMAFit
> >> way than
> >>>>>> the standard UIMA way of doing things. Is there something we can be
> >> doing
> >>>>>> to make understanding UIMA easier? Or do we just need to say upfront
> >> that
> >>>>>> this is a prerequisite and hope that people don't give up due to
> this
> >> thing that
> >>>>>> is out of our control?
> >>>>>>
> >>>>>> Another hurdle is:
> >>>>>>
> >>>>>> 2) cTAKES is a multi-purpose developer-aimed tool
> >>>>>>
> >>>>>> So it's not just a matter of hiding complexity -- at some point
> >> people have to
> >>>>>> understand their problem, understand cTAKES' capabilities, and start
> >> coding.
> >>>>>> Pei's GUI will help for some common use cases but will not remove
> the
> >>>>>> requirement that someone at the organization knows cTAKES.
> >>>>>> I think one part of this problem is the fact that the typesystem is
> >> not well
> >>>>>> documented. A developer needs to know what the output is (objects
> from
> >>>>>> the typesystem), how to get them (which modules/pipelines), and what
> >>>>>> information is in them. So maybe on this end my recommendation would
> >> be:
> >>>>>> i) Make the typesystem forefront in documentation -- generate
> >> javadocs and
> >>>>>> have as a link on the ctakes frontpage/sidebar
> >>>>>> ii) Similar to the way that we are aiming to have tests in every
> >> module, also
> >>>>>> have clearly labeled examples in every module that set up a
> pipeline,
> >> run on
> >>>>>> sample notes (could be the same sample notes from the tests), and do
> >>>>>> something with the results.
> >>>>>> iii) Follow Giri's recommendation to have example training data for
> >> people
> >>>>>> who want to take the next step and train their own models
> >>>>>>
> >>>>>> This is quite a bit of developer overhead, so it's worth asking
> >> whether you
> >>>>>> agree with my "diagnosis" and "treatment" or whether you think there
> >> are
> >>>>>> different problems/solutions that should be higher priority.
> >>>>>>
> >>>>>> Tim
> >>>>>>
> >>>>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
> >>>>>>> Hi Vijay and Andy,
> >>>>>>>
> >>>>>>> Thanks for sharing those examples.
> >>>>>>>
> >>>>>>> "Trouble is, privacy requires that these examples be made up by
> hand"
> >>>>>>>
> >>>>>>> Agree with this statement and this is very valid concern.
> >>>>>>>
> >>>>>>> In "getting started examples", I think we should just have couple
> of
> >>>>>>> entries (5-10 small entries), not more than that (with explicit
> >>>>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
> >>>>>> understand
> >>>>>>> handcrafting these may not be easy because we are not medical
> domain
> >>>>>>> experts, but I feel worth time, because it brings in more user
> >> community.
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Giri
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
> >>>>>> <mc...@gmail.com>wrote:
> >>>>>>>> GREAT !
> >>>>>>>>
> >>>>>>>> The i2b2 data though isn't publicly distributable, you still need
> to
> >>>>>>>> request access to it since it is "semi private"
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>>> We released code on using cTAKES to annotate clinical text and
> SVMs
> >>>>>>>>> that use the annotations to classify clinical text from the CMC
> >> 2007
> >>>>>>>>> and I2B2
> >>>>>>>>> 2008 challenges:
> >>>>>>>>>
> >>>>>>>>> We did the cmd 2007 with cTAKES 2.5:
> >>>>>>
> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
> >>>>>> o
> >>>>>>>> ducing_results_on_CMC_2007_challenge
> >>>>>>>> <https://code.google.com/p/ytex/downloads/list>
> >>>>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
> >>>>>>>>> first version of ARC:
> >>>>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> >>>>>>>>>
> >>>>>>>>> These are both publicly available datasets, and represent
> >> real-world
> >>>>>>>>> problems (in general I believe when publishing a paper the code
> >>>>>>>>> should be reproducible and made publicly available, but that's a
> >> different
> >>>>>> issue).
> >>>>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like
> >> to
> >>>>>>>>> upgrade these samples as well.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> VJ
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
> >>>>>>>>> <mcmurry.andy@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> +1 suggestion for documenting many examples of "getting started"
> >>>>>>>>>> +NLP
> >>>>>>>>>> datasets.
> >>>>>>>>>>
> >>>>>>>>>> I have at least one we can use that was created by our lead
> >>>>>>>>>> Pathologist
> >>>>>>
> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
> >>>>>>>> es/train/traincase.xml
> >>>>>>>>>> We should provide at least one sample for each domain.
> >>>>>>>>>> Trouble is, privacy requires that these examples be made up by
> >> hand
> >>>>>>>>>> and not copy-pasted from EMR systems.
> >>>>>>>>>>
> >>>>>>>>>> --Andy
> >>>>>>>>>>
> >>>>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
> >>>>>>>> girinambari@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> +1 for this observation Andy!
> >>>>>>>>>>>
> >>>>>>>>>>> Lowering time will motive users in writing blogs about
> features,
> >>>>>>>>>>> how
> >>>>>>>> to,
> >>>>>>>>>>> etc., which reduces core team work load on documentation.
> >>>>>>>>>>>
> >>>>>>>>>>> I have been trying to write a small "how to write standalone
> >>>>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
> >>>>>>>>>>> posted similar
> >>>>>>>>>> question
> >>>>>>>>>>> in last 2 months), but not getting enough time because ctakes
> >>>>>>>>>>> depends
> >>>>>>>> on
> >>>>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework
> etc.,),
> >>>>>>>>>>> most
> >>>>>>>> of
> >>>>>>>>>>> my spare time is being spent on juggling between these
> >> frameworks,
> >>>>>>>>>> posting
> >>>>>>>>>>> and browsing those forums, relating observations to ctakes
> code.
> >> I
> >>>>>>>> think
> >>>>>>>>>> we
> >>>>>>>>>>> need to have some high level documentation about these (with
> >> links
> >>>>>>>>>>> to corresponding forums).
> >>>>>>>>>>>
> >>>>>>>>>>> Above case is for developers (I think this will be more user
> base
> >>>>>>>>>>> as
> >>>>>>>>>> ctakes
> >>>>>>>>>>> progress), for users I think documentation is lot better though
> >>>>>>>>>>> some improvements need to be done.
> >>>>>>>>>>>
> >>>>>>>>>>> As a developer I felt tough with lack of sample training data
> (I
> >>>>>>>>>>> am
> >>>>>>>> still
> >>>>>>>>>>> struggling in this area even though I browsed all relevant
> code),
> >>>>>>>> though
> >>>>>>>>>>> training class are there. I understood that there are licensing
> >>>>>>>>>>> issues
> >>>>>>>>>> with
> >>>>>>>>>>> REAL data, but at least some hand made example sentences, which
> >>>>>>>>>>> may not
> >>>>>>>>>> be
> >>>>>>>>>>> real but helps developers in understanding the type/structure
> of
> >>>>>>>>>>> input TRAINING classes expecting. This way people who browse
> the
> >>>>>>>>>>> code can
> >>>>>>>>>> reverse
> >>>>>>>>>>> engineer and develop their own models. Sorry if you guys feel
> >> this
> >>>>>>>>>>> as novice issue, but I feel most of the developers will be
> novice
> >>>>>>>>>>> when
> >>>>>>>> they
> >>>>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
> >>>>>>>>>>> documentation in this area will same lot of time for us.
> >>>>>>>>>>>
> >>>>>>>>>>> I wish there will be some activity in this area from ctakes
> core
> >> team.
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you,
> >>>>>>>>>>> Giri
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
> >>>>>>>>>>> <mcmurry.andy@gmail.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
> >>>>>>>>>>>> still
> >>>>>>>>>> hard
> >>>>>>>>>>>> to get started.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is
> not
> >>>>>>>> obvious
> >>>>>>>>>>>> and requires hand holding.
> >>>>>>>>>>>> This is very typical in early FOSS projects.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Lowering the time to get invested in ctakes gets more users
> AND
> >>>>>>>>>>>> better
> >>>>>>>>>> bug
> >>>>>>>>>>>> reports, FAQ, etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>> thoughts?
> >>>>>>>>>>>> --Andy
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> >>>>>>>>>> Pei.Chen@childrens.harvard.edu>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>> I just wanted to gauge the interest of creating the next
> >> release
> >>>>>>>>>>>>> of
> >>>>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
> >>>>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or
> >> closed.
> >>>>>>>>>>>> Plenty of bug fixes and new components including:
> >>>>>>>>>>>>> - New CEM Instance Template population
> >>>>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
> >>>>>>>>>>>>> - New optional Clear POSTagger
> >>>>>>>>>>>>> - New regression testing component
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Should we wait for the Temporal component?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1]
> >>>>>>
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
> >>>>>>>> 22%20AND%20project%20%3D%20CTAKES
> >>
>

Re: Next cTAKES release (3.1)?

Posted by John Green <jo...@gmail.com>.
I see. Its a pretty random collection of formats. 

Sent from my iPhone

On Jul 3, 2013, at 18:25, andy mcmurry <mc...@gmail.com> wrote:

> Mtsamples has lots of free public examples already but we aren't using them
> yet.  This is probably because mtsamples don't have the annotations we need
> to use them as training examples.
> On Jul 3, 2013 2:46 PM, "Hephaestus Studio" <he...@gmail.com>
> wrote:
> 
>> @Andy - Not a doctor yet, but soon! Thanks for the promotion though, one
>> more year!
>> 
>> - Apropos meds or clinical type questions: any developer on here can feel
>> free to shoot me a quick question via the list anytime, Id be happy to
>> confirm that a drug or anything else makes since given a particular
>> clinical/note context.
>> 
>> - "I wonder if there is someway in which you could guide us in making
>> better use of the medical knowledge sources (ontologies) that are
>> available." - I'd be happy to brainstorm about using existing resources to
>> help in decision making. We use these all the time in the clinic.
>> 
>> @ Tim+Andy+Chen - I haven't had a chance to really start chewing into the
>> code, though I hope to over the next year; so, what kind of examples would
>> be most helpful?
>>    - Any particular disease processes?
>>    - Are you all familiar with the ubiquitous SOAP style presentation
>> that doctors use to write free notes? The few examples I clicked through in
>> the repository that Chen pointed me too are very sparse. Would we want
>> gradations? E.g., a scale for "well done" notes to "very quick
>> I-dont-care-because-I'm-in-a-rush" notes?
>> 
>> @ Chen - Thank you for the kind words. It's nice to be welcomed by a
>> community in which you hope to integrate. And thank you for pointing me to
>> the directory with the current sample notes. This was very helpful in
>> determining where those are at in there development. I know that each of
>> your hospitals have a wealth of HIPAA-closed notes, but I'll see what I can
>> do to make some "stereotypical" open-notes for common disease
>> presentations. Again: maybe a scale, not necessarily just on brevity but
>> some other metric, whose continuum represented various permutations of
>> degrees of something, maybe of difficulty in processing? Apropos code,
>> Chen: I will help where I can but where I want to be is elbow deep in the
>> code :)
>> 
>> Finally: I haven't had a chance to look into some of the links from
>> earlier in this thread regarding open access repositories of free text
>> clinical notes: what do you all feel the quality of these resources are?
>> Abundant but low quality? Paucity but those that are there are high quality?
>> 
>> Bottom line: no problem either answering contextual questions (can afib be
>> associated with a lower gi bleed??) and no problem writing some notes, only
>> question would be, before I put in any time: what disease/specialty domain?
>> and would we want some system that put them on a continuum of some
>> variable, say, brevity or "readability"?
>> 
>> Just thinking before leaping,
>> 
>> Thanks,
>> JG
>> 
>> Sent from my iPhone
>> 
>> On Jul 2, 2013, at 21:23, "Chen, Pei" <Pe...@childrens.harvard.edu>
>> wrote:
>> 
>>> Hi John,
>>> Welcome!  There are actually many ways to contribute and it's not
>> limited to just code.  It's always great to hear new ideas and suggestions
>> on how to improve the software.  Therefore even, things like user feedback,
>> documentation, new use cases, essentially anything that will make things
>> better would be awesome!
>>> 
>>> To get started, I would suggest subscribing to the email lists.  If you
>> would like to contribute anything, just create an Jira account (anyone
>> should be able to do this), and add/review Jira items (add attachments if
>> you like) and we can even help integrate it.
>>> 
>>> We normally use Jira to keep track of issues:
>>> [1] https://issues.apache.org/jira/browse/ctakes
>>> 
>>> Current collection of sample test notes that have been collected over
>> the years:
>> https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/testdata/input/plaintext/
>>> 
>>> ________________________________________
>>> From: Tim Miller [timothy.miller@childrens.harvard.edu]
>>> Sent: Tuesday, July 02, 2013 6:31 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Next cTAKES release (3.1)?
>>> 
>>> Agreed that you could definitely help out, and that would be a great way
>>> to do so. We don't really have "examples" right now, more like just
>>> short test sentences for showing simple results and verifying that
>>> nothing has been broken by changes. I think regular length fake but
>>> realistic notes would be very useful.
>>> Tim
>>> 
>>> On 07/02/2013 05:19 PM, John Green wrote:
>>>> Hi all,
>>>> 
>>>> Ive been following this mail list for a couple of months. Im a third
>> year medical student rounding the bend toward my MD. I used to be a
>> computer programmer, however, and continue my own projects. Im very
>> interested in contributing eventually to cTakes development. In the
>> meantime, given the current talk of examples, if any domain specific
>> examples needed generated I am domain knowledgable enough that I could
>> pound out a few free text notes made to order.
>>>> 
>>>> Let me know, you all may already have docs on hand willing todo this,
>> but if not...
>>>> 
>>>> John Green
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu>
>> wrote:
>>>> 
>>>>> I completely agree with making cTAKES easier use.  I think it is
>> exciting to hear the different use cases here and understanding where some
>> of the areas that need improvements are (which we haven't thought about
>> earlier).
>>>>> I think Tim's suggestions and the 3 concrete actionable items makes a
>> lot of sense.  Hopefully it should attract new users, adopters, and perhaps
>> more committers.
>>>>> 
>>>>>> i) Make the typesystem forefront in documentation -- generate
>> javadocs and
>>>>>> have as a link on the ctakes frontpage/sidebar
>>>>>> ii) Similar to the way that we are aiming to have tests in every
>> module, also
>>>>>> have clearly labeled examples in every module that set up a pipeline,
>> run on
>>>>>> sample notes (could be the same sample notes from the tests), and do
>>>>>> something with the results.
>>>>>> iii) Follow Giri's recommendation to have example training data for
>> people
>>>>>> who want to take the next step and train their own models
>>>>> I think Java developers are accustomed to including a library as a
>> dependency/jar, have an API to pass input, and get the results via pojos;
>> So the examples could initially shield the complexity of wiring a pipeline
>> together etc.
>>>>> If we can improve the API's and how it gets integrated with other
>> apps, we can add any GUI/CLI tools on top of this afterwards.
>>>>> 
>>>>> --Pei
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>>>>> Sent: Friday, June 28, 2013 8:00 AM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Re: Next cTAKES release (3.1)?
>>>>>> 
>>>>>> Very interesting discussion. I think Giri is right about giving
>> example training
>>>>>> data in the format that our training code can read. While our
>> ultimate goal
>>>>>> would be to build and release models that are completely domain-
>>>>>> independent, in the real world it is almost always better to use some
>>>>>> domain-specific data and we should think more about how to facilitate
>> that.
>>>>>> 
>>>>>> As for making it easier to get started, it is not totally clear to me
>> what this
>>>>>> means/how to do it so it might be useful to get specific about what
>> this
>>>>>> means. I think our biggest hurdle is
>>>>>> 
>>>>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>>>>> 
>>>>>> Since UIMAFit is officially becoming part of UIMA that will be
>> easier, and
>>>>>> hopefully people will just learn the easier (in my opinion) UIMAFit
>> way than
>>>>>> the standard UIMA way of doing things. Is there something we can be
>> doing
>>>>>> to make understanding UIMA easier? Or do we just need to say upfront
>> that
>>>>>> this is a prerequisite and hope that people don't give up due to this
>> thing that
>>>>>> is out of our control?
>>>>>> 
>>>>>> Another hurdle is:
>>>>>> 
>>>>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>>>>> 
>>>>>> So it's not just a matter of hiding complexity -- at some point
>> people have to
>>>>>> understand their problem, understand cTAKES' capabilities, and start
>> coding.
>>>>>> Pei's GUI will help for some common use cases but will not remove the
>>>>>> requirement that someone at the organization knows cTAKES.
>>>>>> I think one part of this problem is the fact that the typesystem is
>> not well
>>>>>> documented. A developer needs to know what the output is (objects from
>>>>>> the typesystem), how to get them (which modules/pipelines), and what
>>>>>> information is in them. So maybe on this end my recommendation would
>> be:
>>>>>> i) Make the typesystem forefront in documentation -- generate
>> javadocs and
>>>>>> have as a link on the ctakes frontpage/sidebar
>>>>>> ii) Similar to the way that we are aiming to have tests in every
>> module, also
>>>>>> have clearly labeled examples in every module that set up a pipeline,
>> run on
>>>>>> sample notes (could be the same sample notes from the tests), and do
>>>>>> something with the results.
>>>>>> iii) Follow Giri's recommendation to have example training data for
>> people
>>>>>> who want to take the next step and train their own models
>>>>>> 
>>>>>> This is quite a bit of developer overhead, so it's worth asking
>> whether you
>>>>>> agree with my "diagnosis" and "treatment" or whether you think there
>> are
>>>>>> different problems/solutions that should be higher priority.
>>>>>> 
>>>>>> Tim
>>>>>> 
>>>>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>>>>> Hi Vijay and Andy,
>>>>>>> 
>>>>>>> Thanks for sharing those examples.
>>>>>>> 
>>>>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>>>>> 
>>>>>>> Agree with this statement and this is very valid concern.
>>>>>>> 
>>>>>>> In "getting started examples", I think we should just have couple of
>>>>>>> entries (5-10 small entries), not more than that (with explicit
>>>>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>>>>> understand
>>>>>>> handcrafting these may not be easy because we are not medical domain
>>>>>>> experts, but I feel worth time, because it brings in more user
>> community.
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Giri
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>>>>> <mc...@gmail.com>wrote:
>>>>>>>> GREAT !
>>>>>>>> 
>>>>>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>>>>>> request access to it since it is "semi private"
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>>>>>> that use the annotations to classify clinical text from the CMC
>> 2007
>>>>>>>>> and I2B2
>>>>>>>>> 2008 challenges:
>>>>>>>>> 
>>>>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>>>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>>>>>> o
>>>>>>>> ducing_results_on_CMC_2007_challenge
>>>>>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>>>>>> first version of ARC:
>>>>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>>>>>> 
>>>>>>>>> These are both publicly available datasets, and represent
>> real-world
>>>>>>>>> problems (in general I believe when publishing a paper the code
>>>>>>>>> should be reproducible and made publicly available, but that's a
>> different
>>>>>> issue).
>>>>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like
>> to
>>>>>>>>> upgrade these samples as well.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> VJ
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>>>>>> +NLP
>>>>>>>>>> datasets.
>>>>>>>>>> 
>>>>>>>>>> I have at least one we can use that was created by our lead
>>>>>>>>>> Pathologist
>>>>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>>>>>> es/train/traincase.xml
>>>>>>>>>> We should provide at least one sample for each domain.
>>>>>>>>>> Trouble is, privacy requires that these examples be made up by
>> hand
>>>>>>>>>> and not copy-pasted from EMR systems.
>>>>>>>>>> 
>>>>>>>>>> --Andy
>>>>>>>>>> 
>>>>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>>>>> girinambari@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> +1 for this observation Andy!
>>>>>>>>>>> 
>>>>>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>>>>>> how
>>>>>>>> to,
>>>>>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>>>>> 
>>>>>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>>>>>> posted similar
>>>>>>>>>> question
>>>>>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>>>>>> depends
>>>>>>>> on
>>>>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>>>>>> most
>>>>>>>> of
>>>>>>>>>>> my spare time is being spent on juggling between these
>> frameworks,
>>>>>>>>>> posting
>>>>>>>>>>> and browsing those forums, relating observations to ctakes code.
>> I
>>>>>>>> think
>>>>>>>>>> we
>>>>>>>>>>> need to have some high level documentation about these (with
>> links
>>>>>>>>>>> to corresponding forums).
>>>>>>>>>>> 
>>>>>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>>>>>> as
>>>>>>>>>> ctakes
>>>>>>>>>>> progress), for users I think documentation is lot better though
>>>>>>>>>>> some improvements need to be done.
>>>>>>>>>>> 
>>>>>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>>>>>> am
>>>>>>>> still
>>>>>>>>>>> struggling in this area even though I browsed all relevant code),
>>>>>>>> though
>>>>>>>>>>> training class are there. I understood that there are licensing
>>>>>>>>>>> issues
>>>>>>>>>> with
>>>>>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>>>>>> may not
>>>>>>>>>> be
>>>>>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>>>>>> code can
>>>>>>>>>> reverse
>>>>>>>>>>> engineer and develop their own models. Sorry if you guys feel
>> this
>>>>>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>>>>>> when
>>>>>>>> they
>>>>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>>>>> 
>>>>>>>>>>> I wish there will be some activity in this area from ctakes core
>> team.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Giri
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>>>>>> still
>>>>>>>>>> hard
>>>>>>>>>>>> to get started.
>>>>>>>>>>>> 
>>>>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>>>>>> obvious
>>>>>>>>>>>> and requires hand holding.
>>>>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>>>>> 
>>>>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>>>>>> better
>>>>>>>>>> bug
>>>>>>>>>>>> reports, FAQ, etc.
>>>>>>>>>>>> 
>>>>>>>>>>>> thoughts?
>>>>>>>>>>>> --Andy
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> I just wanted to gauge the interest of creating the next
>> release
>>>>>>>>>>>>> of
>>>>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or
>> closed.
>>>>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>>>>> - New regression testing component
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>>>>>> 22%20AND%20project%20%3D%20CTAKES
>> 

RE: Next cTAKES release (3.1)?

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.
Actually, MTsamples is what iDASH downloaded for their notes repository.
--Guergana

-----Original Message-----
From: andy mcmurry [mailto:mcmurry.andy@gmail.com] 
Sent: Wednesday, July 03, 2013 7:26 PM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

Mtsamples has lots of free public examples already but we aren't using them yet.  This is probably because mtsamples don't have the annotations we need to use them as training examples.
On Jul 3, 2013 2:46 PM, "Hephaestus Studio" <he...@gmail.com>
wrote:

> @Andy - Not a doctor yet, but soon! Thanks for the promotion though, 
> one more year!
>
> - Apropos meds or clinical type questions: any developer on here can 
> feel free to shoot me a quick question via the list anytime, Id be 
> happy to confirm that a drug or anything else makes since given a 
> particular clinical/note context.
>
> - "I wonder if there is someway in which you could guide us in making 
> better use of the medical knowledge sources (ontologies) that are 
> available." - I'd be happy to brainstorm about using existing 
> resources to help in decision making. We use these all the time in the clinic.
>
> @ Tim+Andy+Chen - I haven't had a chance to really start chewing into 
> the code, though I hope to over the next year; so, what kind of 
> examples would be most helpful?
>     - Any particular disease processes?
>     - Are you all familiar with the ubiquitous SOAP style presentation 
> that doctors use to write free notes? The few examples I clicked 
> through in the repository that Chen pointed me too are very sparse. 
> Would we want gradations? E.g., a scale for "well done" notes to "very 
> quick I-dont-care-because-I'm-in-a-rush" notes?
>
> @ Chen - Thank you for the kind words. It's nice to be welcomed by a 
> community in which you hope to integrate. And thank you for pointing 
> me to the directory with the current sample notes. This was very 
> helpful in determining where those are at in there development. I know 
> that each of your hospitals have a wealth of HIPAA-closed notes, but 
> I'll see what I can do to make some "stereotypical" open-notes for 
> common disease presentations. Again: maybe a scale, not necessarily 
> just on brevity but some other metric, whose continuum represented 
> various permutations of degrees of something, maybe of difficulty in 
> processing? Apropos code,
> Chen: I will help where I can but where I want to be is elbow deep in 
> the code :)
>
> Finally: I haven't had a chance to look into some of the links from 
> earlier in this thread regarding open access repositories of free text 
> clinical notes: what do you all feel the quality of these resources are?
> Abundant but low quality? Paucity but those that are there are high quality?
>
> Bottom line: no problem either answering contextual questions (can 
> afib be associated with a lower gi bleed??) and no problem writing 
> some notes, only question would be, before I put in any time: what disease/specialty domain?
> and would we want some system that put them on a continuum of some 
> variable, say, brevity or "readability"?
>
> Just thinking before leaping,
>
> Thanks,
> JG
>
> Sent from my iPhone
>
> On Jul 2, 2013, at 21:23, "Chen, Pei" <Pe...@childrens.harvard.edu>
> wrote:
>
> > Hi John,
> > Welcome!  There are actually many ways to contribute and it's not
> limited to just code.  It's always great to hear new ideas and 
> suggestions on how to improve the software.  Therefore even, things 
> like user feedback, documentation, new use cases, essentially anything 
> that will make things better would be awesome!
> >
> > To get started, I would suggest subscribing to the email lists.  If 
> > you
> would like to contribute anything, just create an Jira account (anyone 
> should be able to do this), and add/review Jira items (add attachments 
> if you like) and we can even help integrate it.
> >
> > We normally use Jira to keep track of issues:
> > [1] https://issues.apache.org/jira/browse/ctakes
> >
> > Current collection of sample test notes that have been collected 
> > over
> the years:
> >
> https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/t
> estdata/input/plaintext/
> >
> > ________________________________________
> > From: Tim Miller [timothy.miller@childrens.harvard.edu]
> > Sent: Tuesday, July 02, 2013 6:31 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Next cTAKES release (3.1)?
> >
> > Agreed that you could definitely help out, and that would be a great 
> > way to do so. We don't really have "examples" right now, more like 
> > just short test sentences for showing simple results and verifying 
> > that nothing has been broken by changes. I think regular length fake 
> > but realistic notes would be very useful.
> > Tim
> >
> > On 07/02/2013 05:19 PM, John Green wrote:
> >> Hi all,
> >>
> >> Ive been following this mail list for a couple of months. Im a 
> >> third
> year medical student rounding the bend toward my MD. I used to be a 
> computer programmer, however, and continue my own projects. Im very 
> interested in contributing eventually to cTakes development. In the 
> meantime, given the current talk of examples, if any domain specific 
> examples needed generated I am domain knowledgable enough that I could 
> pound out a few free text notes made to order.
> >>
> >> Let me know, you all may already have docs on hand willing todo 
> >> this,
> but if not...
> >>
> >> John Green
> >>
> >> Sent from my iPhone
> >>
> >> On Jun 28, 2013, at 8:59, "Chen, Pei" 
> >> <Pe...@childrens.harvard.edu>
> wrote:
> >>
> >>> I completely agree with making cTAKES easier use.  I think it is
> exciting to hear the different use cases here and understanding where 
> some of the areas that need improvements are (which we haven't thought 
> about earlier).
> >>> I think Tim's suggestions and the 3 concrete actionable items 
> >>> makes a
> lot of sense.  Hopefully it should attract new users, adopters, and 
> perhaps more committers.
> >>>
> >>>> i) Make the typesystem forefront in documentation -- generate
> javadocs and
> >>>> have as a link on the ctakes frontpage/sidebar
> >>>> ii) Similar to the way that we are aiming to have tests in every
> module, also
> >>>> have clearly labeled examples in every module that set up a 
> >>>> pipeline,
> run on
> >>>> sample notes (could be the same sample notes from the tests), and 
> >>>> do something with the results.
> >>>> iii) Follow Giri's recommendation to have example training data 
> >>>> for
> people
> >>>> who want to take the next step and train their own models
> >>> I think Java developers are accustomed to including a library as a
> dependency/jar, have an API to pass input, and get the results via 
> pojos;  So the examples could initially shield the complexity of 
> wiring a pipeline together etc.
> >>> If we can improve the API's and how it gets integrated with other
> apps, we can add any GUI/CLI tools on top of this afterwards.
> >>>
> >>> --Pei
> >>>
> >>>> -----Original Message-----
> >>>> From: Miller, Timothy 
> >>>> [mailto:Timothy.Miller@childrens.harvard.edu]
> >>>> Sent: Friday, June 28, 2013 8:00 AM
> >>>> To: dev@ctakes.apache.org
> >>>> Subject: Re: Next cTAKES release (3.1)?
> >>>>
> >>>> Very interesting discussion. I think Giri is right about giving
> example training
> >>>> data in the format that our training code can read. While our
> ultimate goal
> >>>> would be to build and release models that are completely domain- 
> >>>> independent, in the real world it is almost always better to use 
> >>>> some domain-specific data and we should think more about how to 
> >>>> facilitate
> that.
> >>>>
> >>>> As for making it easier to get started, it is not totally clear 
> >>>> to me
> what this
> >>>> means/how to do it so it might be useful to get specific about 
> >>>> what
> this
> >>>> means. I think our biggest hurdle is
> >>>>
> >>>> 1) Prerequisite of understanding UIMA/UIMAFit
> >>>>
> >>>> Since UIMAFit is officially becoming part of UIMA that will be
> easier, and
> >>>> hopefully people will just learn the easier (in my opinion) 
> >>>> UIMAFit
> way than
> >>>> the standard UIMA way of doing things. Is there something we can 
> >>>> be
> doing
> >>>> to make understanding UIMA easier? Or do we just need to say 
> >>>> upfront
> that
> >>>> this is a prerequisite and hope that people don't give up due to 
> >>>> this
> thing that
> >>>> is out of our control?
> >>>>
> >>>> Another hurdle is:
> >>>>
> >>>> 2) cTAKES is a multi-purpose developer-aimed tool
> >>>>
> >>>> So it's not just a matter of hiding complexity -- at some point
> people have to
> >>>> understand their problem, understand cTAKES' capabilities, and 
> >>>> start
> coding.
> >>>> Pei's GUI will help for some common use cases but will not remove 
> >>>> the requirement that someone at the organization knows cTAKES.
> >>>> I think one part of this problem is the fact that the typesystem 
> >>>> is
> not well
> >>>> documented. A developer needs to know what the output is (objects 
> >>>> from the typesystem), how to get them (which modules/pipelines), 
> >>>> and what information is in them. So maybe on this end my 
> >>>> recommendation would
> be:
> >>>> i) Make the typesystem forefront in documentation -- generate
> javadocs and
> >>>> have as a link on the ctakes frontpage/sidebar
> >>>> ii) Similar to the way that we are aiming to have tests in every
> module, also
> >>>> have clearly labeled examples in every module that set up a 
> >>>> pipeline,
> run on
> >>>> sample notes (could be the same sample notes from the tests), and 
> >>>> do something with the results.
> >>>> iii) Follow Giri's recommendation to have example training data 
> >>>> for
> people
> >>>> who want to take the next step and train their own models
> >>>>
> >>>> This is quite a bit of developer overhead, so it's worth asking
> whether you
> >>>> agree with my "diagnosis" and "treatment" or whether you think 
> >>>> there
> are
> >>>> different problems/solutions that should be higher priority.
> >>>>
> >>>> Tim
> >>>>
> >>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
> >>>>> Hi Vijay and Andy,
> >>>>>
> >>>>> Thanks for sharing those examples.
> >>>>>
> >>>>> "Trouble is, privacy requires that these examples be made up by hand"
> >>>>>
> >>>>> Agree with this statement and this is very valid concern.
> >>>>>
> >>>>> In "getting started examples", I think we should just have 
> >>>>> couple of entries (5-10 small entries), not more than that (with 
> >>>>> explicit statement like "ONLY EXAMPLE", NOT GOOD FOR REAL 
> >>>>> USAGE). I
> >>>> understand
> >>>>> handcrafting these may not be easy because we are not medical 
> >>>>> domain experts, but I feel worth time, because it brings in more 
> >>>>> user
> community.
> >>>>>
> >>>>> Thank you,
> >>>>> Giri
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
> >>>> <mc...@gmail.com>wrote:
> >>>>>> GREAT !
> >>>>>>
> >>>>>> The i2b2 data though isn't publicly distributable, you still 
> >>>>>> need to request access to it since it is "semi private"
> >>>>>>
> >>>>>>
> >>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
> >>>>>>
> >>>>>>> We released code on using cTAKES to annotate clinical text and 
> >>>>>>> SVMs that use the annotations to classify clinical text from 
> >>>>>>> the CMC
> 2007
> >>>>>>> and I2B2
> >>>>>>> 2008 challenges:
> >>>>>>>
> >>>>>>> We did the cmd 2007 with cTAKES 2.5:
> >>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#R
> >>>> epr
> >>>> o
> >>>>>> ducing_results_on_CMC_2007_challenge
> >>>>>> <https://code.google.com/p/ytex/downloads/list>
> >>>>>>> And the i2b2 2008 with the version of cTAKES distributed with 
> >>>>>>> the first version of ARC:
> >>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> >>>>>>>
> >>>>>>> These are both publicly available datasets, and represent
> real-world
> >>>>>>> problems (in general I believe when publishing a paper the 
> >>>>>>> code should be reproducible and made publicly available, but 
> >>>>>>> that's a
> different
> >>>> issue).
> >>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would 
> >>>>>>> like
> to
> >>>>>>> upgrade these samples as well.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> VJ
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry 
> >>>>>>> <mcmurry.andy@gmail.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> +1 suggestion for documenting many examples of "getting started"
> >>>>>>>> +NLP
> >>>>>>>> datasets.
> >>>>>>>>
> >>>>>>>> I have at least one we can use that was created by our lead 
> >>>>>>>> Pathologist
> >>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input
> >>>> /cas
> >>>>>> es/train/traincase.xml
> >>>>>>>> We should provide at least one sample for each domain.
> >>>>>>>> Trouble is, privacy requires that these examples be made up 
> >>>>>>>> by
> hand
> >>>>>>>> and not copy-pasted from EMR systems.
> >>>>>>>>
> >>>>>>>> --Andy
> >>>>>>>>
> >>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
> >>>>>> girinambari@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> +1 for this observation Andy!
> >>>>>>>>>
> >>>>>>>>> Lowering time will motive users in writing blogs about 
> >>>>>>>>> features, how
> >>>>>> to,
> >>>>>>>>> etc., which reduces core team work load on documentation.
> >>>>>>>>>
> >>>>>>>>> I have been trying to write a small "how to write standalone 
> >>>>>>>>> client for ctakes" with my experience (I saw at least 4 
> >>>>>>>>> users posted similar
> >>>>>>>> question
> >>>>>>>>> in last 2 months), but not getting enough time because 
> >>>>>>>>> ctakes depends
> >>>>>> on
> >>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework 
> >>>>>>>>> etc.,), most
> >>>>>> of
> >>>>>>>>> my spare time is being spent on juggling between these
> frameworks,
> >>>>>>>> posting
> >>>>>>>>> and browsing those forums, relating observations to ctakes code.
> I
> >>>>>> think
> >>>>>>>> we
> >>>>>>>>> need to have some high level documentation about these (with
> links
> >>>>>>>>> to corresponding forums).
> >>>>>>>>>
> >>>>>>>>> Above case is for developers (I think this will be more user 
> >>>>>>>>> base as
> >>>>>>>> ctakes
> >>>>>>>>> progress), for users I think documentation is lot better 
> >>>>>>>>> though some improvements need to be done.
> >>>>>>>>>
> >>>>>>>>> As a developer I felt tough with lack of sample training 
> >>>>>>>>> data (I am
> >>>>>> still
> >>>>>>>>> struggling in this area even though I browsed all relevant 
> >>>>>>>>> code),
> >>>>>> though
> >>>>>>>>> training class are there. I understood that there are 
> >>>>>>>>> licensing issues
> >>>>>>>> with
> >>>>>>>>> REAL data, but at least some hand made example sentences, 
> >>>>>>>>> which may not
> >>>>>>>> be
> >>>>>>>>> real but helps developers in understanding the 
> >>>>>>>>> type/structure of input TRAINING classes expecting. This way 
> >>>>>>>>> people who browse the code can
> >>>>>>>> reverse
> >>>>>>>>> engineer and develop their own models. Sorry if you guys 
> >>>>>>>>> feel
> this
> >>>>>>>>> as novice issue, but I feel most of the developers will be 
> >>>>>>>>> novice when
> >>>>>> they
> >>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some 
> >>>>>>>>> documentation in this area will same lot of time for us.
> >>>>>>>>>
> >>>>>>>>> I wish there will be some activity in this area from ctakes 
> >>>>>>>>> core
> team.
> >>>>>>>>>
> >>>>>>>>> Thank you,
> >>>>>>>>> Giri
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry 
> >>>>>>>>> <mcmurry.andy@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> ctakes is at a point where we have a LOT of features but it 
> >>>>>>>>>> is still
> >>>>>>>> hard
> >>>>>>>>>> to get started.
> >>>>>>>>>>
> >>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is 
> >>>>>>>>>> not
> >>>>>> obvious
> >>>>>>>>>> and requires hand holding.
> >>>>>>>>>> This is very typical in early FOSS projects.
> >>>>>>>>>>
> >>>>>>>>>> Lowering the time to get invested in ctakes gets more users 
> >>>>>>>>>> AND better
> >>>>>>>> bug
> >>>>>>>>>> reports, FAQ, etc.
> >>>>>>>>>>
> >>>>>>>>>> thoughts?
> >>>>>>>>>> --Andy
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> >>>>>>>> Pei.Chen@childrens.harvard.edu>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>> I just wanted to gauge the interest of creating the next
> release
> >>>>>>>>>>> of
> >>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
> >>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed 
> >>>>>>>>>>> or
> closed.
> >>>>>>>>>> Plenty of bug fixes and new components including:
> >>>>>>>>>>> - New CEM Instance Template population
> >>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
> >>>>>>>>>>> - New optional Clear POSTagger
> >>>>>>>>>>> - New regression testing component
> >>>>>>>>>>>
> >>>>>>>>>>> Should we wait for the Temporal component?
> >>>>>>>>>>>
> >>>>>>>>>>> [1]
> >>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%22
> >>>> 3.1%
> >>>>>> 22%20AND%20project%20%3D%20CTAKES
> >
>

Re: Next cTAKES release (3.1)?

Posted by andy mcmurry <mc...@gmail.com>.
Mtsamples has lots of free public examples already but we aren't using them
yet.  This is probably because mtsamples don't have the annotations we need
to use them as training examples.
On Jul 3, 2013 2:46 PM, "Hephaestus Studio" <he...@gmail.com>
wrote:

> @Andy - Not a doctor yet, but soon! Thanks for the promotion though, one
> more year!
>
> - Apropos meds or clinical type questions: any developer on here can feel
> free to shoot me a quick question via the list anytime, Id be happy to
> confirm that a drug or anything else makes since given a particular
> clinical/note context.
>
> - "I wonder if there is someway in which you could guide us in making
> better use of the medical knowledge sources (ontologies) that are
> available." - I'd be happy to brainstorm about using existing resources to
> help in decision making. We use these all the time in the clinic.
>
> @ Tim+Andy+Chen - I haven't had a chance to really start chewing into the
> code, though I hope to over the next year; so, what kind of examples would
> be most helpful?
>     - Any particular disease processes?
>     - Are you all familiar with the ubiquitous SOAP style presentation
> that doctors use to write free notes? The few examples I clicked through in
> the repository that Chen pointed me too are very sparse. Would we want
> gradations? E.g., a scale for "well done" notes to "very quick
> I-dont-care-because-I'm-in-a-rush" notes?
>
> @ Chen - Thank you for the kind words. It's nice to be welcomed by a
> community in which you hope to integrate. And thank you for pointing me to
> the directory with the current sample notes. This was very helpful in
> determining where those are at in there development. I know that each of
> your hospitals have a wealth of HIPAA-closed notes, but I'll see what I can
> do to make some "stereotypical" open-notes for common disease
> presentations. Again: maybe a scale, not necessarily just on brevity but
> some other metric, whose continuum represented various permutations of
> degrees of something, maybe of difficulty in processing? Apropos code,
> Chen: I will help where I can but where I want to be is elbow deep in the
> code :)
>
> Finally: I haven't had a chance to look into some of the links from
> earlier in this thread regarding open access repositories of free text
> clinical notes: what do you all feel the quality of these resources are?
> Abundant but low quality? Paucity but those that are there are high quality?
>
> Bottom line: no problem either answering contextual questions (can afib be
> associated with a lower gi bleed??) and no problem writing some notes, only
> question would be, before I put in any time: what disease/specialty domain?
> and would we want some system that put them on a continuum of some
> variable, say, brevity or "readability"?
>
> Just thinking before leaping,
>
> Thanks,
> JG
>
> Sent from my iPhone
>
> On Jul 2, 2013, at 21:23, "Chen, Pei" <Pe...@childrens.harvard.edu>
> wrote:
>
> > Hi John,
> > Welcome!  There are actually many ways to contribute and it's not
> limited to just code.  It's always great to hear new ideas and suggestions
> on how to improve the software.  Therefore even, things like user feedback,
> documentation, new use cases, essentially anything that will make things
> better would be awesome!
> >
> > To get started, I would suggest subscribing to the email lists.  If you
> would like to contribute anything, just create an Jira account (anyone
> should be able to do this), and add/review Jira items (add attachments if
> you like) and we can even help integrate it.
> >
> > We normally use Jira to keep track of issues:
> > [1] https://issues.apache.org/jira/browse/ctakes
> >
> > Current collection of sample test notes that have been collected over
> the years:
> >
> https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/testdata/input/plaintext/
> >
> > ________________________________________
> > From: Tim Miller [timothy.miller@childrens.harvard.edu]
> > Sent: Tuesday, July 02, 2013 6:31 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Next cTAKES release (3.1)?
> >
> > Agreed that you could definitely help out, and that would be a great way
> > to do so. We don't really have "examples" right now, more like just
> > short test sentences for showing simple results and verifying that
> > nothing has been broken by changes. I think regular length fake but
> > realistic notes would be very useful.
> > Tim
> >
> > On 07/02/2013 05:19 PM, John Green wrote:
> >> Hi all,
> >>
> >> Ive been following this mail list for a couple of months. Im a third
> year medical student rounding the bend toward my MD. I used to be a
> computer programmer, however, and continue my own projects. Im very
> interested in contributing eventually to cTakes development. In the
> meantime, given the current talk of examples, if any domain specific
> examples needed generated I am domain knowledgable enough that I could
> pound out a few free text notes made to order.
> >>
> >> Let me know, you all may already have docs on hand willing todo this,
> but if not...
> >>
> >> John Green
> >>
> >> Sent from my iPhone
> >>
> >> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu>
> wrote:
> >>
> >>> I completely agree with making cTAKES easier use.  I think it is
> exciting to hear the different use cases here and understanding where some
> of the areas that need improvements are (which we haven't thought about
> earlier).
> >>> I think Tim's suggestions and the 3 concrete actionable items makes a
> lot of sense.  Hopefully it should attract new users, adopters, and perhaps
> more committers.
> >>>
> >>>> i) Make the typesystem forefront in documentation -- generate
> javadocs and
> >>>> have as a link on the ctakes frontpage/sidebar
> >>>> ii) Similar to the way that we are aiming to have tests in every
> module, also
> >>>> have clearly labeled examples in every module that set up a pipeline,
> run on
> >>>> sample notes (could be the same sample notes from the tests), and do
> >>>> something with the results.
> >>>> iii) Follow Giri's recommendation to have example training data for
> people
> >>>> who want to take the next step and train their own models
> >>> I think Java developers are accustomed to including a library as a
> dependency/jar, have an API to pass input, and get the results via pojos;
>  So the examples could initially shield the complexity of wiring a pipeline
> together etc.
> >>> If we can improve the API's and how it gets integrated with other
> apps, we can add any GUI/CLI tools on top of this afterwards.
> >>>
> >>> --Pei
> >>>
> >>>> -----Original Message-----
> >>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> >>>> Sent: Friday, June 28, 2013 8:00 AM
> >>>> To: dev@ctakes.apache.org
> >>>> Subject: Re: Next cTAKES release (3.1)?
> >>>>
> >>>> Very interesting discussion. I think Giri is right about giving
> example training
> >>>> data in the format that our training code can read. While our
> ultimate goal
> >>>> would be to build and release models that are completely domain-
> >>>> independent, in the real world it is almost always better to use some
> >>>> domain-specific data and we should think more about how to facilitate
> that.
> >>>>
> >>>> As for making it easier to get started, it is not totally clear to me
> what this
> >>>> means/how to do it so it might be useful to get specific about what
> this
> >>>> means. I think our biggest hurdle is
> >>>>
> >>>> 1) Prerequisite of understanding UIMA/UIMAFit
> >>>>
> >>>> Since UIMAFit is officially becoming part of UIMA that will be
> easier, and
> >>>> hopefully people will just learn the easier (in my opinion) UIMAFit
> way than
> >>>> the standard UIMA way of doing things. Is there something we can be
> doing
> >>>> to make understanding UIMA easier? Or do we just need to say upfront
> that
> >>>> this is a prerequisite and hope that people don't give up due to this
> thing that
> >>>> is out of our control?
> >>>>
> >>>> Another hurdle is:
> >>>>
> >>>> 2) cTAKES is a multi-purpose developer-aimed tool
> >>>>
> >>>> So it's not just a matter of hiding complexity -- at some point
> people have to
> >>>> understand their problem, understand cTAKES' capabilities, and start
> coding.
> >>>> Pei's GUI will help for some common use cases but will not remove the
> >>>> requirement that someone at the organization knows cTAKES.
> >>>> I think one part of this problem is the fact that the typesystem is
> not well
> >>>> documented. A developer needs to know what the output is (objects from
> >>>> the typesystem), how to get them (which modules/pipelines), and what
> >>>> information is in them. So maybe on this end my recommendation would
> be:
> >>>> i) Make the typesystem forefront in documentation -- generate
> javadocs and
> >>>> have as a link on the ctakes frontpage/sidebar
> >>>> ii) Similar to the way that we are aiming to have tests in every
> module, also
> >>>> have clearly labeled examples in every module that set up a pipeline,
> run on
> >>>> sample notes (could be the same sample notes from the tests), and do
> >>>> something with the results.
> >>>> iii) Follow Giri's recommendation to have example training data for
> people
> >>>> who want to take the next step and train their own models
> >>>>
> >>>> This is quite a bit of developer overhead, so it's worth asking
> whether you
> >>>> agree with my "diagnosis" and "treatment" or whether you think there
> are
> >>>> different problems/solutions that should be higher priority.
> >>>>
> >>>> Tim
> >>>>
> >>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
> >>>>> Hi Vijay and Andy,
> >>>>>
> >>>>> Thanks for sharing those examples.
> >>>>>
> >>>>> "Trouble is, privacy requires that these examples be made up by hand"
> >>>>>
> >>>>> Agree with this statement and this is very valid concern.
> >>>>>
> >>>>> In "getting started examples", I think we should just have couple of
> >>>>> entries (5-10 small entries), not more than that (with explicit
> >>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
> >>>> understand
> >>>>> handcrafting these may not be easy because we are not medical domain
> >>>>> experts, but I feel worth time, because it brings in more user
> community.
> >>>>>
> >>>>> Thank you,
> >>>>> Giri
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
> >>>> <mc...@gmail.com>wrote:
> >>>>>> GREAT !
> >>>>>>
> >>>>>> The i2b2 data though isn't publicly distributable, you still need to
> >>>>>> request access to it since it is "semi private"
> >>>>>>
> >>>>>>
> >>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
> >>>>>>
> >>>>>>> We released code on using cTAKES to annotate clinical text and SVMs
> >>>>>>> that use the annotations to classify clinical text from the CMC
> 2007
> >>>>>>> and I2B2
> >>>>>>> 2008 challenges:
> >>>>>>>
> >>>>>>> We did the cmd 2007 with cTAKES 2.5:
> >>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
> >>>> o
> >>>>>> ducing_results_on_CMC_2007_challenge
> >>>>>> <https://code.google.com/p/ytex/downloads/list>
> >>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
> >>>>>>> first version of ARC:
> >>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> >>>>>>>
> >>>>>>> These are both publicly available datasets, and represent
> real-world
> >>>>>>> problems (in general I believe when publishing a paper the code
> >>>>>>> should be reproducible and made publicly available, but that's a
> different
> >>>> issue).
> >>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like
> to
> >>>>>>> upgrade these samples as well.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> VJ
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
> >>>>>>> <mcmurry.andy@gmail.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> +1 suggestion for documenting many examples of "getting started"
> >>>>>>>> +NLP
> >>>>>>>> datasets.
> >>>>>>>>
> >>>>>>>> I have at least one we can use that was created by our lead
> >>>>>>>> Pathologist
> >>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
> >>>>>> es/train/traincase.xml
> >>>>>>>> We should provide at least one sample for each domain.
> >>>>>>>> Trouble is, privacy requires that these examples be made up by
> hand
> >>>>>>>> and not copy-pasted from EMR systems.
> >>>>>>>>
> >>>>>>>> --Andy
> >>>>>>>>
> >>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
> >>>>>> girinambari@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> +1 for this observation Andy!
> >>>>>>>>>
> >>>>>>>>> Lowering time will motive users in writing blogs about features,
> >>>>>>>>> how
> >>>>>> to,
> >>>>>>>>> etc., which reduces core team work load on documentation.
> >>>>>>>>>
> >>>>>>>>> I have been trying to write a small "how to write standalone
> >>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
> >>>>>>>>> posted similar
> >>>>>>>> question
> >>>>>>>>> in last 2 months), but not getting enough time because ctakes
> >>>>>>>>> depends
> >>>>>> on
> >>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
> >>>>>>>>> most
> >>>>>> of
> >>>>>>>>> my spare time is being spent on juggling between these
> frameworks,
> >>>>>>>> posting
> >>>>>>>>> and browsing those forums, relating observations to ctakes code.
> I
> >>>>>> think
> >>>>>>>> we
> >>>>>>>>> need to have some high level documentation about these (with
> links
> >>>>>>>>> to corresponding forums).
> >>>>>>>>>
> >>>>>>>>> Above case is for developers (I think this will be more user base
> >>>>>>>>> as
> >>>>>>>> ctakes
> >>>>>>>>> progress), for users I think documentation is lot better though
> >>>>>>>>> some improvements need to be done.
> >>>>>>>>>
> >>>>>>>>> As a developer I felt tough with lack of sample training data (I
> >>>>>>>>> am
> >>>>>> still
> >>>>>>>>> struggling in this area even though I browsed all relevant code),
> >>>>>> though
> >>>>>>>>> training class are there. I understood that there are licensing
> >>>>>>>>> issues
> >>>>>>>> with
> >>>>>>>>> REAL data, but at least some hand made example sentences, which
> >>>>>>>>> may not
> >>>>>>>> be
> >>>>>>>>> real but helps developers in understanding the type/structure of
> >>>>>>>>> input TRAINING classes expecting. This way people who browse the
> >>>>>>>>> code can
> >>>>>>>> reverse
> >>>>>>>>> engineer and develop their own models. Sorry if you guys feel
> this
> >>>>>>>>> as novice issue, but I feel most of the developers will be novice
> >>>>>>>>> when
> >>>>>> they
> >>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
> >>>>>>>>> documentation in this area will same lot of time for us.
> >>>>>>>>>
> >>>>>>>>> I wish there will be some activity in this area from ctakes core
> team.
> >>>>>>>>>
> >>>>>>>>> Thank you,
> >>>>>>>>> Giri
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
> >>>>>>>>> <mcmurry.andy@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
> >>>>>>>>>> still
> >>>>>>>> hard
> >>>>>>>>>> to get started.
> >>>>>>>>>>
> >>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
> >>>>>> obvious
> >>>>>>>>>> and requires hand holding.
> >>>>>>>>>> This is very typical in early FOSS projects.
> >>>>>>>>>>
> >>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
> >>>>>>>>>> better
> >>>>>>>> bug
> >>>>>>>>>> reports, FAQ, etc.
> >>>>>>>>>>
> >>>>>>>>>> thoughts?
> >>>>>>>>>> --Andy
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> >>>>>>>> Pei.Chen@childrens.harvard.edu>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>> I just wanted to gauge the interest of creating the next
> release
> >>>>>>>>>>> of
> >>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
> >>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or
> closed.
> >>>>>>>>>> Plenty of bug fixes and new components including:
> >>>>>>>>>>> - New CEM Instance Template population
> >>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
> >>>>>>>>>>> - New optional Clear POSTagger
> >>>>>>>>>>> - New regression testing component
> >>>>>>>>>>>
> >>>>>>>>>>> Should we wait for the Temporal component?
> >>>>>>>>>>>
> >>>>>>>>>>> [1]
> >>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
> >>>>>> 22%20AND%20project%20%3D%20CTAKES
> >
>

Re: Next cTAKES release (3.1)?

Posted by Hephaestus Studio <he...@gmail.com>.
@Andy - Not a doctor yet, but soon! Thanks for the promotion though, one more year! 

- Apropos meds or clinical type questions: any developer on here can feel free to shoot me a quick question via the list anytime, Id be happy to confirm that a drug or anything else makes since given a particular clinical/note context. 

- "I wonder if there is someway in which you could guide us in making better use of the medical knowledge sources (ontologies) that are available." - I'd be happy to brainstorm about using existing resources to help in decision making. We use these all the time in the clinic.

@ Tim+Andy+Chen - I haven't had a chance to really start chewing into the code, though I hope to over the next year; so, what kind of examples would be most helpful?
    - Any particular disease processes? 
    - Are you all familiar with the ubiquitous SOAP style presentation that doctors use to write free notes? The few examples I clicked through in the repository that Chen pointed me too are very sparse. Would we want gradations? E.g., a scale for "well done" notes to "very quick I-dont-care-because-I'm-in-a-rush" notes?

@ Chen - Thank you for the kind words. It's nice to be welcomed by a community in which you hope to integrate. And thank you for pointing me to the directory with the current sample notes. This was very helpful in determining where those are at in there development. I know that each of your hospitals have a wealth of HIPAA-closed notes, but I'll see what I can do to make some "stereotypical" open-notes for common disease presentations. Again: maybe a scale, not necessarily just on brevity but some other metric, whose continuum represented various permutations of degrees of something, maybe of difficulty in processing? Apropos code, Chen: I will help where I can but where I want to be is elbow deep in the code :)

Finally: I haven't had a chance to look into some of the links from earlier in this thread regarding open access repositories of free text clinical notes: what do you all feel the quality of these resources are? Abundant but low quality? Paucity but those that are there are high quality? 

Bottom line: no problem either answering contextual questions (can afib be associated with a lower gi bleed??) and no problem writing some notes, only question would be, before I put in any time: what disease/specialty domain? and would we want some system that put them on a continuum of some variable, say, brevity or "readability"?

Just thinking before leaping,

Thanks,
JG

Sent from my iPhone

On Jul 2, 2013, at 21:23, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> Hi John,
> Welcome!  There are actually many ways to contribute and it's not limited to just code.  It's always great to hear new ideas and suggestions on how to improve the software.  Therefore even, things like user feedback, documentation, new use cases, essentially anything that will make things better would be awesome!
> 
> To get started, I would suggest subscribing to the email lists.  If you would like to contribute anything, just create an Jira account (anyone should be able to do this), and add/review Jira items (add attachments if you like) and we can even help integrate it.
> 
> We normally use Jira to keep track of issues:
> [1] https://issues.apache.org/jira/browse/ctakes
> 
> Current collection of sample test notes that have been collected over the years:
> https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/testdata/input/plaintext/
> 
> ________________________________________
> From: Tim Miller [timothy.miller@childrens.harvard.edu]
> Sent: Tuesday, July 02, 2013 6:31 PM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> Agreed that you could definitely help out, and that would be a great way
> to do so. We don't really have "examples" right now, more like just
> short test sentences for showing simple results and verifying that
> nothing has been broken by changes. I think regular length fake but
> realistic notes would be very useful.
> Tim
> 
> On 07/02/2013 05:19 PM, John Green wrote:
>> Hi all,
>> 
>> Ive been following this mail list for a couple of months. Im a third year medical student rounding the bend toward my MD. I used to be a computer programmer, however, and continue my own projects. Im very interested in contributing eventually to cTakes development. In the meantime, given the current talk of examples, if any domain specific examples needed generated I am domain knowledgable enough that I could pound out a few free text notes made to order.
>> 
>> Let me know, you all may already have docs on hand willing todo this, but if not...
>> 
>> John Green
>> 
>> Sent from my iPhone
>> 
>> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>> 
>>> I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
>>> I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.
>>> 
>>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>>> have as a link on the ctakes frontpage/sidebar
>>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>>> have clearly labeled examples in every module that set up a pipeline, run on
>>>> sample notes (could be the same sample notes from the tests), and do
>>>> something with the results.
>>>> iii) Follow Giri's recommendation to have example training data for people
>>>> who want to take the next step and train their own models
>>> I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.
>>> If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.
>>> 
>>> --Pei
>>> 
>>>> -----Original Message-----
>>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>>> Sent: Friday, June 28, 2013 8:00 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: Next cTAKES release (3.1)?
>>>> 
>>>> Very interesting discussion. I think Giri is right about giving example training
>>>> data in the format that our training code can read. While our ultimate goal
>>>> would be to build and release models that are completely domain-
>>>> independent, in the real world it is almost always better to use some
>>>> domain-specific data and we should think more about how to facilitate that.
>>>> 
>>>> As for making it easier to get started, it is not totally clear to me what this
>>>> means/how to do it so it might be useful to get specific about what this
>>>> means. I think our biggest hurdle is
>>>> 
>>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>>> 
>>>> Since UIMAFit is officially becoming part of UIMA that will be easier, and
>>>> hopefully people will just learn the easier (in my opinion) UIMAFit way than
>>>> the standard UIMA way of doing things. Is there something we can be doing
>>>> to make understanding UIMA easier? Or do we just need to say upfront that
>>>> this is a prerequisite and hope that people don't give up due to this thing that
>>>> is out of our control?
>>>> 
>>>> Another hurdle is:
>>>> 
>>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>>> 
>>>> So it's not just a matter of hiding complexity -- at some point people have to
>>>> understand their problem, understand cTAKES' capabilities, and start coding.
>>>> Pei's GUI will help for some common use cases but will not remove the
>>>> requirement that someone at the organization knows cTAKES.
>>>> I think one part of this problem is the fact that the typesystem is not well
>>>> documented. A developer needs to know what the output is (objects from
>>>> the typesystem), how to get them (which modules/pipelines), and what
>>>> information is in them. So maybe on this end my recommendation would be:
>>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>>> have as a link on the ctakes frontpage/sidebar
>>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>>> have clearly labeled examples in every module that set up a pipeline, run on
>>>> sample notes (could be the same sample notes from the tests), and do
>>>> something with the results.
>>>> iii) Follow Giri's recommendation to have example training data for people
>>>> who want to take the next step and train their own models
>>>> 
>>>> This is quite a bit of developer overhead, so it's worth asking whether you
>>>> agree with my "diagnosis" and "treatment" or whether you think there are
>>>> different problems/solutions that should be higher priority.
>>>> 
>>>> Tim
>>>> 
>>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>>> Hi Vijay and Andy,
>>>>> 
>>>>> Thanks for sharing those examples.
>>>>> 
>>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>>> 
>>>>> Agree with this statement and this is very valid concern.
>>>>> 
>>>>> In "getting started examples", I think we should just have couple of
>>>>> entries (5-10 small entries), not more than that (with explicit
>>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>>> understand
>>>>> handcrafting these may not be easy because we are not medical domain
>>>>> experts, but I feel worth time, because it brings in more user community.
>>>>> 
>>>>> Thank you,
>>>>> Giri
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>>> <mc...@gmail.com>wrote:
>>>>>> GREAT !
>>>>>> 
>>>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>>>> request access to it since it is "semi private"
>>>>>> 
>>>>>> 
>>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>>>> 
>>>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>>>> and I2B2
>>>>>>> 2008 challenges:
>>>>>>> 
>>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>>>> o
>>>>>> ducing_results_on_CMC_2007_challenge
>>>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>>>> first version of ARC:
>>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>>>> 
>>>>>>> These are both publicly available datasets, and represent real-world
>>>>>>> problems (in general I believe when publishing a paper the code
>>>>>>> should be reproducible and made publicly available, but that's a different
>>>> issue).
>>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>>>> upgrade these samples as well.
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> VJ
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>>>> <mcmurry.andy@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>>>> +NLP
>>>>>>>> datasets.
>>>>>>>> 
>>>>>>>> I have at least one we can use that was created by our lead
>>>>>>>> Pathologist
>>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>>>> es/train/traincase.xml
>>>>>>>> We should provide at least one sample for each domain.
>>>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>>>> and not copy-pasted from EMR systems.
>>>>>>>> 
>>>>>>>> --Andy
>>>>>>>> 
>>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>>> girinambari@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> +1 for this observation Andy!
>>>>>>>>> 
>>>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>>>> how
>>>>>> to,
>>>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>>> 
>>>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>>>> posted similar
>>>>>>>> question
>>>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>>>> depends
>>>>>> on
>>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>>>> most
>>>>>> of
>>>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>>>> posting
>>>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>>>> think
>>>>>>>> we
>>>>>>>>> need to have some high level documentation about these (with links
>>>>>>>>> to corresponding forums).
>>>>>>>>> 
>>>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>>>> as
>>>>>>>> ctakes
>>>>>>>>> progress), for users I think documentation is lot better though
>>>>>>>>> some improvements need to be done.
>>>>>>>>> 
>>>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>>>> am
>>>>>> still
>>>>>>>>> struggling in this area even though I browsed all relevant code),
>>>>>> though
>>>>>>>>> training class are there. I understood that there are licensing
>>>>>>>>> issues
>>>>>>>> with
>>>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>>>> may not
>>>>>>>> be
>>>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>>>> code can
>>>>>>>> reverse
>>>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>>>> when
>>>>>> they
>>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>>> 
>>>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>>>> 
>>>>>>>>> Thank you,
>>>>>>>>> Giri
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>>>> still
>>>>>>>> hard
>>>>>>>>>> to get started.
>>>>>>>>>> 
>>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>>>> obvious
>>>>>>>>>> and requires hand holding.
>>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>>> 
>>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>>>> better
>>>>>>>> bug
>>>>>>>>>> reports, FAQ, etc.
>>>>>>>>>> 
>>>>>>>>>> thoughts?
>>>>>>>>>> --Andy
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>>>> of
>>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>>> - New regression testing component
>>>>>>>>>>> 
>>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>>>> 22%20AND%20project%20%3D%20CTAKES
> 

RE: Next cTAKES release (3.1)?

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
Hi John,
Welcome!  There are actually many ways to contribute and it's not limited to just code.  It's always great to hear new ideas and suggestions on how to improve the software.  Therefore even, things like user feedback, documentation, new use cases, essentially anything that will make things better would be awesome!

To get started, I would suggest subscribing to the email lists.  If you would like to contribute anything, just create an Jira account (anyone should be able to do this), and add/review Jira items (add attachments if you like) and we can even help integrate it.

We normally use Jira to keep track of issues:
[1] https://issues.apache.org/jira/browse/ctakes

Current collection of sample test notes that have been collected over the years:
https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/testdata/input/plaintext/

________________________________________
From: Tim Miller [timothy.miller@childrens.harvard.edu]
Sent: Tuesday, July 02, 2013 6:31 PM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

Agreed that you could definitely help out, and that would be a great way
to do so. We don't really have "examples" right now, more like just
short test sentences for showing simple results and verifying that
nothing has been broken by changes. I think regular length fake but
realistic notes would be very useful.
Tim

On 07/02/2013 05:19 PM, John Green wrote:
> Hi all,
>
> Ive been following this mail list for a couple of months. Im a third year medical student rounding the bend toward my MD. I used to be a computer programmer, however, and continue my own projects. Im very interested in contributing eventually to cTakes development. In the meantime, given the current talk of examples, if any domain specific examples needed generated I am domain knowledgable enough that I could pound out a few free text notes made to order.
>
> Let me know, you all may already have docs on hand willing todo this, but if not...
>
> John Green
>
> Sent from my iPhone
>
> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>
>> I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
>> I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.
>>
>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>> have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>> have clearly labeled examples in every module that set up a pipeline, run on
>>> sample notes (could be the same sample notes from the tests), and do
>>> something with the results.
>>> iii) Follow Giri's recommendation to have example training data for people
>>> who want to take the next step and train their own models
>> I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.
>> If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.
>>
>> --Pei
>>
>>> -----Original Message-----
>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>> Sent: Friday, June 28, 2013 8:00 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Next cTAKES release (3.1)?
>>>
>>> Very interesting discussion. I think Giri is right about giving example training
>>> data in the format that our training code can read. While our ultimate goal
>>> would be to build and release models that are completely domain-
>>> independent, in the real world it is almost always better to use some
>>> domain-specific data and we should think more about how to facilitate that.
>>>
>>> As for making it easier to get started, it is not totally clear to me what this
>>> means/how to do it so it might be useful to get specific about what this
>>> means. I think our biggest hurdle is
>>>
>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>>
>>> Since UIMAFit is officially becoming part of UIMA that will be easier, and
>>> hopefully people will just learn the easier (in my opinion) UIMAFit way than
>>> the standard UIMA way of doing things. Is there something we can be doing
>>> to make understanding UIMA easier? Or do we just need to say upfront that
>>> this is a prerequisite and hope that people don't give up due to this thing that
>>> is out of our control?
>>>
>>> Another hurdle is:
>>>
>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>>
>>> So it's not just a matter of hiding complexity -- at some point people have to
>>> understand their problem, understand cTAKES' capabilities, and start coding.
>>> Pei's GUI will help for some common use cases but will not remove the
>>> requirement that someone at the organization knows cTAKES.
>>> I think one part of this problem is the fact that the typesystem is not well
>>> documented. A developer needs to know what the output is (objects from
>>> the typesystem), how to get them (which modules/pipelines), and what
>>> information is in them. So maybe on this end my recommendation would be:
>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>> have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>> have clearly labeled examples in every module that set up a pipeline, run on
>>> sample notes (could be the same sample notes from the tests), and do
>>> something with the results.
>>> iii) Follow Giri's recommendation to have example training data for people
>>> who want to take the next step and train their own models
>>>
>>> This is quite a bit of developer overhead, so it's worth asking whether you
>>> agree with my "diagnosis" and "treatment" or whether you think there are
>>> different problems/solutions that should be higher priority.
>>>
>>> Tim
>>>
>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>> Hi Vijay and Andy,
>>>>
>>>> Thanks for sharing those examples.
>>>>
>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>>
>>>> Agree with this statement and this is very valid concern.
>>>>
>>>> In "getting started examples", I think we should just have couple of
>>>> entries (5-10 small entries), not more than that (with explicit
>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>> understand
>>>> handcrafting these may not be easy because we are not medical domain
>>>> experts, but I feel worth time, because it brings in more user community.
>>>>
>>>> Thank you,
>>>> Giri
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>> <mc...@gmail.com>wrote:
>>>>> GREAT !
>>>>>
>>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>>> request access to it since it is "semi private"
>>>>>
>>>>>
>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>>>
>>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>>> and I2B2
>>>>>> 2008 challenges:
>>>>>>
>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>>> o
>>>>> ducing_results_on_CMC_2007_challenge
>>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>>> first version of ARC:
>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>>>
>>>>>> These are both publicly available datasets, and represent real-world
>>>>>> problems (in general I believe when publishing a paper the code
>>>>>> should be reproducible and made publicly available, but that's a different
>>> issue).
>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>>> upgrade these samples as well.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> VJ
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>>> <mcmurry.andy@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>>> +NLP
>>>>>>> datasets.
>>>>>>>
>>>>>>> I have at least one we can use that was created by our lead
>>>>>>> Pathologist
>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>>> es/train/traincase.xml
>>>>>>> We should provide at least one sample for each domain.
>>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>>> and not copy-pasted from EMR systems.
>>>>>>>
>>>>>>> --Andy
>>>>>>>
>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>> girinambari@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 for this observation Andy!
>>>>>>>>
>>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>>> how
>>>>> to,
>>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>>
>>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>>> posted similar
>>>>>>> question
>>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>>> depends
>>>>> on
>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>>> most
>>>>> of
>>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>>> posting
>>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>>> think
>>>>>>> we
>>>>>>>> need to have some high level documentation about these (with links
>>>>>>>> to corresponding forums).
>>>>>>>>
>>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>>> as
>>>>>>> ctakes
>>>>>>>> progress), for users I think documentation is lot better though
>>>>>>>> some improvements need to be done.
>>>>>>>>
>>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>>> am
>>>>> still
>>>>>>>> struggling in this area even though I browsed all relevant code),
>>>>> though
>>>>>>>> training class are there. I understood that there are licensing
>>>>>>>> issues
>>>>>>> with
>>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>>> may not
>>>>>>> be
>>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>>> code can
>>>>>>> reverse
>>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>>> when
>>>>> they
>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>>
>>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Giri
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>>> still
>>>>>>> hard
>>>>>>>>> to get started.
>>>>>>>>>
>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>>> obvious
>>>>>>>>> and requires hand holding.
>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>>
>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>>> better
>>>>>>> bug
>>>>>>>>> reports, FAQ, etc.
>>>>>>>>>
>>>>>>>>> thoughts?
>>>>>>>>> --Andy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>>> of
>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>> - New regression testing component
>>>>>>>>>>
>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>>
>>>>>>>>>> [1]
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>>> 22%20AND%20project%20%3D%20CTAKES


Re: Next cTAKES release (3.1)?

Posted by John Green <he...@gmail.com>.
@Andy - Not a doctor yet, but soon! Thanks for the promotion though, one
more year!

- Apropos meds or clinical type questions: any developer on here can feel
free to shoot me a quick question via the list anytime, Id be happy to
confirm that a drug or anything else makes since given a particular
clinical/note context.

- "I wonder if there is someway in which you could guide us in making
better use of the medical knowledge sources (ontologies) that are
available." - I'd be happy to brainstorm about using existing resources to
help in decision making. We use these all the time in the clinic.

@ Tim+Andy+Chen - I haven't had a chance to really start chewing into the
code, though I hope to over the next year; so, what kind of examples would
be most helpful?
    - Any particular disease processes?
    - Are you all familiar with the ubiquitous SOAP style presentation that
doctors use to write free notes? The few examples I clicked through in the
repository that Chen pointed me too are very sparse. Would we want
gradations? E.g., a scale for "well done" notes to "very quick
I-dont-care-because-I'm-in-a-rush" notes?

@ Chen - Thank you for the kind words. It's nice to be welcomed by a
community in which you hope to integrate. And thank you for pointing me to
the directory with the current sample notes. This was very helpful in
determining where those are at in there development. I know that each of
your hospitals have a wealth of HIPAA-closed notes, but I'll see what I can
do to make some "stereotypical" open-notes for common disease
presentations. Again: maybe a scale, not necessarily just on brevity but
some other metric, whose continuum represented various permutations of
degrees of something, maybe of difficulty in processing? Apropos code,
Chen: I will help where I can but where I want to be is elbow deep in the
code :)

Finally: I haven't had a chance to look into some of the links from earlier
in this thread regarding open access repositories of free text clinical
notes: what do you all feel the quality of these resources are? Abundant
but low quality? Paucity but those that are there are high quality?

Bottom line: no problem either answering contextual questions (can afib be
associated with a lower gi bleed??) and no problem writing some notes, only
question would be, before I put in any time: what disease/specialty domain?
and would we want some system that put them on a continuum of some
variable, say, brevity or "readability"?

Just thinking before leaping,

Thanks,
JG


On Tue, Jul 2, 2013 at 6:30 PM, Tim Miller <
timothy.miller@childrens.harvard.edu> wrote:

> Agreed that you could definitely help out, and that would be a great way
> to do so. We don't really have "examples" right now, more like just short
> test sentences for showing simple results and verifying that nothing has
> been broken by changes. I think regular length fake but realistic notes
> would be very useful.
> Tim
>
>
> On 07/02/2013 05:19 PM, John Green wrote:
>
>> Hi all,
>>
>> Ive been following this mail list for a couple of months. Im a third year
>> medical student rounding the bend toward my MD. I used to be a computer
>> programmer, however, and continue my own projects. Im very interested in
>> contributing eventually to cTakes development. In the meantime, given the
>> current talk of examples, if any domain specific examples needed generated
>> I am domain knowledgable enough that I could pound out a few free text
>> notes made to order.
>>
>> Let me know, you all may already have docs on hand willing todo this, but
>> if not...
>>
>> John Green
>>
>> Sent from my iPhone
>>
>> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu>>
>> wrote:
>>
>>  I completely agree with making cTAKES easier use.  I think it is
>>> exciting to hear the different use cases here and understanding where some
>>> of the areas that need improvements are (which we haven't thought about
>>> earlier).
>>> I think Tim's suggestions and the 3 concrete actionable items makes a
>>> lot of sense.  Hopefully it should attract new users, adopters, and perhaps
>>> more committers.
>>>
>>>  i) Make the typesystem forefront in documentation -- generate javadocs
>>>> and
>>>> have as a link on the ctakes frontpage/sidebar
>>>> ii) Similar to the way that we are aiming to have tests in every
>>>> module, also
>>>> have clearly labeled examples in every module that set up a pipeline,
>>>> run on
>>>> sample notes (could be the same sample notes from the tests), and do
>>>> something with the results.
>>>> iii) Follow Giri's recommendation to have example training data for
>>>> people
>>>> who want to take the next step and train their own models
>>>>
>>> I think Java developers are accustomed to including a library as a
>>> dependency/jar, have an API to pass input, and get the results via pojos;
>>>  So the examples could initially shield the complexity of wiring a pipeline
>>> together etc.
>>> If we can improve the API's and how it gets integrated with other apps,
>>> we can add any GUI/CLI tools on top of this afterwards.
>>>
>>> --Pei
>>>
>>>  -----Original Message-----
>>>> From: Miller, Timothy [mailto:Timothy.Miller@**childrens.harvard.edu<Ti...@childrens.harvard.edu>
>>>> ]
>>>> Sent: Friday, June 28, 2013 8:00 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: Next cTAKES release (3.1)?
>>>>
>>>> Very interesting discussion. I think Giri is right about giving example
>>>> training
>>>> data in the format that our training code can read. While our ultimate
>>>> goal
>>>> would be to build and release models that are completely domain-
>>>> independent, in the real world it is almost always better to use some
>>>> domain-specific data and we should think more about how to facilitate
>>>> that.
>>>>
>>>> As for making it easier to get started, it is not totally clear to me
>>>> what this
>>>> means/how to do it so it might be useful to get specific about what this
>>>> means. I think our biggest hurdle is
>>>>
>>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>>>
>>>> Since UIMAFit is officially becoming part of UIMA that will be easier,
>>>> and
>>>> hopefully people will just learn the easier (in my opinion) UIMAFit way
>>>> than
>>>> the standard UIMA way of doing things. Is there something we can be
>>>> doing
>>>> to make understanding UIMA easier? Or do we just need to say upfront
>>>> that
>>>> this is a prerequisite and hope that people don't give up due to this
>>>> thing that
>>>> is out of our control?
>>>>
>>>> Another hurdle is:
>>>>
>>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>>>
>>>> So it's not just a matter of hiding complexity -- at some point people
>>>> have to
>>>> understand their problem, understand cTAKES' capabilities, and start
>>>> coding.
>>>> Pei's GUI will help for some common use cases but will not remove the
>>>> requirement that someone at the organization knows cTAKES.
>>>> I think one part of this problem is the fact that the typesystem is not
>>>> well
>>>> documented. A developer needs to know what the output is (objects from
>>>> the typesystem), how to get them (which modules/pipelines), and what
>>>> information is in them. So maybe on this end my recommendation would be:
>>>> i) Make the typesystem forefront in documentation -- generate javadocs
>>>> and
>>>> have as a link on the ctakes frontpage/sidebar
>>>> ii) Similar to the way that we are aiming to have tests in every
>>>> module, also
>>>> have clearly labeled examples in every module that set up a pipeline,
>>>> run on
>>>> sample notes (could be the same sample notes from the tests), and do
>>>> something with the results.
>>>> iii) Follow Giri's recommendation to have example training data for
>>>> people
>>>> who want to take the next step and train their own models
>>>>
>>>> This is quite a bit of developer overhead, so it's worth asking whether
>>>> you
>>>> agree with my "diagnosis" and "treatment" or whether you think there are
>>>> different problems/solutions that should be higher priority.
>>>>
>>>> Tim
>>>>
>>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>>
>>>>> Hi Vijay and Andy,
>>>>>
>>>>> Thanks for sharing those examples.
>>>>>
>>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>>>
>>>>> Agree with this statement and this is very valid concern.
>>>>>
>>>>> In "getting started examples", I think we should just have couple of
>>>>> entries (5-10 small entries), not more than that (with explicit
>>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>>>>
>>>> understand
>>>>
>>>>> handcrafting these may not be easy because we are not medical domain
>>>>> experts, but I feel worth time, because it brings in more user
>>>>> community.
>>>>>
>>>>> Thank you,
>>>>> Giri
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>>>>
>>>> <mc...@gmail.com>wrote:
>>>>
>>>>> GREAT !
>>>>>>
>>>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>>>> request access to it since it is "semi private"
>>>>>>
>>>>>>
>>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>>>>
>>>>>>  We released code on using cTAKES to annotate clinical text and SVMs
>>>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>>>> and I2B2
>>>>>>> 2008 challenges:
>>>>>>>
>>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>>>>>>
>>>>>> https://code.google.com/p/**ytex/wiki/**WordSenseDisambiguation_V08#*
>>>> *Repr<https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr>
>>>> o
>>>>
>>>>> ducing_results_on_CMC_2007_**challenge
>>>>>> <https://code.google.com/p/**ytex/downloads/list<https://code.google.com/p/ytex/downloads/list>
>>>>>> >
>>>>>>
>>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>>>> first version of ARC:
>>>>>>> https://code.google.com/p/**ytex/wiki/FeatEng_V05#i2b2_**2008<https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008>
>>>>>>>
>>>>>>> These are both publicly available datasets, and represent real-world
>>>>>>> problems (in general I believe when publishing a paper the code
>>>>>>> should be reproducible and made publicly available, but that's a
>>>>>>> different
>>>>>>>
>>>>>> issue).
>>>>
>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>>>> upgrade these samples as well.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> VJ
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>>>> <mcmurry.andy@gmail.com
>>>>>>> wrote:
>>>>>>>
>>>>>>>  +1 suggestion for documenting many examples of "getting started"
>>>>>>>> +NLP
>>>>>>>> datasets.
>>>>>>>>
>>>>>>>> I have at least one we can use that was created by our lead
>>>>>>>> Pathologist
>>>>>>>>
>>>>>>> https://open.med.harvard.edu/**svn/scrubber/releases/3.0/**
>>>> data/input/cas<https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas>
>>>>
>>>>> es/train/traincase.xml
>>>>>>
>>>>>>> We should provide at least one sample for each domain.
>>>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>>>> and not copy-pasted from EMR systems.
>>>>>>>>
>>>>>>>> --Andy
>>>>>>>>
>>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>>>>>
>>>>>>> girinambari@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  +1 for this observation Andy!
>>>>>>>>>
>>>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>>>> how
>>>>>>>>>
>>>>>>>> to,
>>>>>>
>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>>>
>>>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>>>> posted similar
>>>>>>>>>
>>>>>>>> question
>>>>>>>>
>>>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>>>> depends
>>>>>>>>>
>>>>>>>> on
>>>>>>
>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>>>> most
>>>>>>>>>
>>>>>>>> of
>>>>>>
>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>>>>>
>>>>>>>> posting
>>>>>>>>
>>>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>>>>>>>
>>>>>>>> think
>>>>>>
>>>>>>> we
>>>>>>>>
>>>>>>>>> need to have some high level documentation about these (with links
>>>>>>>>> to corresponding forums).
>>>>>>>>>
>>>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>>>> as
>>>>>>>>>
>>>>>>>> ctakes
>>>>>>>>
>>>>>>>>> progress), for users I think documentation is lot better though
>>>>>>>>> some improvements need to be done.
>>>>>>>>>
>>>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>>>> am
>>>>>>>>>
>>>>>>>> still
>>>>>>
>>>>>>> struggling in this area even though I browsed all relevant code),
>>>>>>>>>
>>>>>>>> though
>>>>>>
>>>>>>> training class are there. I understood that there are licensing
>>>>>>>>> issues
>>>>>>>>>
>>>>>>>> with
>>>>>>>>
>>>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>>>> may not
>>>>>>>>>
>>>>>>>> be
>>>>>>>>
>>>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>>>> code can
>>>>>>>>>
>>>>>>>> reverse
>>>>>>>>
>>>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>>>> when
>>>>>>>>>
>>>>>>>> they
>>>>>>
>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>>>
>>>>>>>>> I wish there will be some activity in this area from ctakes core
>>>>>>>>> team.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Giri
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  ctakes is at a point where we have a LOT of features but it is
>>>>>>>>>> still
>>>>>>>>>>
>>>>>>>>> hard
>>>>>>>>
>>>>>>>>> to get started.
>>>>>>>>>>
>>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>>>>>>>>
>>>>>>>>> obvious
>>>>>>
>>>>>>> and requires hand holding.
>>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>>>
>>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>>>> better
>>>>>>>>>>
>>>>>>>>> bug
>>>>>>>>
>>>>>>>>> reports, FAQ, etc.
>>>>>>>>>>
>>>>>>>>>> thoughts?
>>>>>>>>>> --Andy
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>>>>>
>>>>>>>>> Pei.Chen@childrens.harvard.edu**>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>>
>>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or
>>>>>>>>>>> closed.
>>>>>>>>>>>
>>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>>
>>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>>> - New regression testing component
>>>>>>>>>>>
>>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/**jira/issues/?jql=fixVersion%**
>>>> 20%3D%20%223.1%
>>>>
>>>>> 22%20AND%20project%20%3D%**20CTAKES
>>>>>>
>>>>>
>

RE: Next cTAKES release (3.1)?

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.
+1 for Dr. Green generating fake but realistically looking notes.

Dr. Green,
If you can generate a few notes that could go in the 3.1 release, that would be wonderful! Thanking you!
--Guergana

-----Original Message-----
From: Tim Miller [mailto:timothy.miller@childrens.harvard.edu] 
Sent: Tuesday, July 02, 2013 6:31 PM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

Agreed that you could definitely help out, and that would be a great way to do so. We don't really have "examples" right now, more like just short test sentences for showing simple results and verifying that nothing has been broken by changes. I think regular length fake but realistic notes would be very useful.
Tim

On 07/02/2013 05:19 PM, John Green wrote:
> Hi all,
>
> Ive been following this mail list for a couple of months. Im a third year medical student rounding the bend toward my MD. I used to be a computer programmer, however, and continue my own projects. Im very interested in contributing eventually to cTakes development. In the meantime, given the current talk of examples, if any domain specific examples needed generated I am domain knowledgable enough that I could pound out a few free text notes made to order.
>
> Let me know, you all may already have docs on hand willing todo this, but if not...
>
> John Green
>
> Sent from my iPhone
>
> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>
>> I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
>> I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.
>>
>>> i) Make the typesystem forefront in documentation -- generate 
>>> javadocs and have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every 
>>> module, also have clearly labeled examples in every module that set 
>>> up a pipeline, run on sample notes (could be the same sample notes 
>>> from the tests), and do something with the results.
>>> iii) Follow Giri's recommendation to have example training data for 
>>> people who want to take the next step and train their own models
>> I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.
>> If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.
>>
>> --Pei
>>
>>> -----Original Message-----
>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>> Sent: Friday, June 28, 2013 8:00 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Next cTAKES release (3.1)?
>>>
>>> Very interesting discussion. I think Giri is right about giving 
>>> example training data in the format that our training code can read. 
>>> While our ultimate goal would be to build and release models that 
>>> are completely domain- independent, in the real world it is almost 
>>> always better to use some domain-specific data and we should think more about how to facilitate that.
>>>
>>> As for making it easier to get started, it is not totally clear to 
>>> me what this means/how to do it so it might be useful to get 
>>> specific about what this means. I think our biggest hurdle is
>>>
>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>>
>>> Since UIMAFit is officially becoming part of UIMA that will be 
>>> easier, and hopefully people will just learn the easier (in my 
>>> opinion) UIMAFit way than the standard UIMA way of doing things. Is 
>>> there something we can be doing to make understanding UIMA easier? 
>>> Or do we just need to say upfront that this is a prerequisite and 
>>> hope that people don't give up due to this thing that is out of our control?
>>>
>>> Another hurdle is:
>>>
>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>>
>>> So it's not just a matter of hiding complexity -- at some point 
>>> people have to understand their problem, understand cTAKES' capabilities, and start coding.
>>> Pei's GUI will help for some common use cases but will not remove 
>>> the requirement that someone at the organization knows cTAKES.
>>> I think one part of this problem is the fact that the typesystem is 
>>> not well documented. A developer needs to know what the output is 
>>> (objects from the typesystem), how to get them (which 
>>> modules/pipelines), and what information is in them. So maybe on this end my recommendation would be:
>>> i) Make the typesystem forefront in documentation -- generate 
>>> javadocs and have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every 
>>> module, also have clearly labeled examples in every module that set 
>>> up a pipeline, run on sample notes (could be the same sample notes 
>>> from the tests), and do something with the results.
>>> iii) Follow Giri's recommendation to have example training data for 
>>> people who want to take the next step and train their own models
>>>
>>> This is quite a bit of developer overhead, so it's worth asking 
>>> whether you agree with my "diagnosis" and "treatment" or whether you 
>>> think there are different problems/solutions that should be higher priority.
>>>
>>> Tim
>>>
>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>> Hi Vijay and Andy,
>>>>
>>>> Thanks for sharing those examples.
>>>>
>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>>
>>>> Agree with this statement and this is very valid concern.
>>>>
>>>> In "getting started examples", I think we should just have couple 
>>>> of entries (5-10 small entries), not more than that (with explicit 
>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>> understand
>>>> handcrafting these may not be easy because we are not medical 
>>>> domain experts, but I feel worth time, because it brings in more user community.
>>>>
>>>> Thank you,
>>>> Giri
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>> <mc...@gmail.com>wrote:
>>>>> GREAT !
>>>>>
>>>>> The i2b2 data though isn't publicly distributable, you still need 
>>>>> to request access to it since it is "semi private"
>>>>>
>>>>>
>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>>>
>>>>>> We released code on using cTAKES to annotate clinical text and 
>>>>>> SVMs that use the annotations to classify clinical text from the 
>>>>>> CMC 2007 and I2B2
>>>>>> 2008 challenges:
>>>>>>
>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>>> o
>>>>> ducing_results_on_CMC_2007_challenge
>>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the 
>>>>>> first version of ARC:
>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>>>
>>>>>> These are both publicly available datasets, and represent 
>>>>>> real-world problems (in general I believe when publishing a paper 
>>>>>> the code should be reproducible and made publicly available, but 
>>>>>> that's a different
>>> issue).
>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like 
>>>>>> to upgrade these samples as well.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> VJ
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry 
>>>>>> <mcmurry.andy@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>>> +NLP
>>>>>>> datasets.
>>>>>>>
>>>>>>> I have at least one we can use that was created by our lead 
>>>>>>> Pathologist
>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/ca
>>> s
>>>>> es/train/traincase.xml
>>>>>>> We should provide at least one sample for each domain.
>>>>>>> Trouble is, privacy requires that these examples be made up by 
>>>>>>> hand and not copy-pasted from EMR systems.
>>>>>>>
>>>>>>> --Andy
>>>>>>>
>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>> girinambari@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 for this observation Andy!
>>>>>>>>
>>>>>>>> Lowering time will motive users in writing blogs about 
>>>>>>>> features, how
>>>>> to,
>>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>>
>>>>>>>> I have been trying to write a small "how to write standalone 
>>>>>>>> client for ctakes" with my experience (I saw at least 4 users 
>>>>>>>> posted similar
>>>>>>> question
>>>>>>>> in last 2 months), but not getting enough time because ctakes 
>>>>>>>> depends
>>>>> on
>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework 
>>>>>>>> etc.,), most
>>>>> of
>>>>>>>> my spare time is being spent on juggling between these 
>>>>>>>> frameworks,
>>>>>>> posting
>>>>>>>> and browsing those forums, relating observations to ctakes 
>>>>>>>> code. I
>>>>> think
>>>>>>> we
>>>>>>>> need to have some high level documentation about these (with 
>>>>>>>> links to corresponding forums).
>>>>>>>>
>>>>>>>> Above case is for developers (I think this will be more user 
>>>>>>>> base as
>>>>>>> ctakes
>>>>>>>> progress), for users I think documentation is lot better though 
>>>>>>>> some improvements need to be done.
>>>>>>>>
>>>>>>>> As a developer I felt tough with lack of sample training data 
>>>>>>>> (I am
>>>>> still
>>>>>>>> struggling in this area even though I browsed all relevant 
>>>>>>>> code),
>>>>> though
>>>>>>>> training class are there. I understood that there are licensing 
>>>>>>>> issues
>>>>>>> with
>>>>>>>> REAL data, but at least some hand made example sentences, which 
>>>>>>>> may not
>>>>>>> be
>>>>>>>> real but helps developers in understanding the type/structure 
>>>>>>>> of input TRAINING classes expecting. This way people who browse 
>>>>>>>> the code can
>>>>>>> reverse
>>>>>>>> engineer and develop their own models. Sorry if you guys feel 
>>>>>>>> this as novice issue, but I feel most of the developers will be 
>>>>>>>> novice when
>>>>> they
>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some 
>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>>
>>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Giri
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry 
>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> ctakes is at a point where we have a LOT of features but it is 
>>>>>>>>> still
>>>>>>> hard
>>>>>>>>> to get started.
>>>>>>>>>
>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is 
>>>>>>>>> not
>>>>> obvious
>>>>>>>>> and requires hand holding.
>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>>
>>>>>>>>> Lowering the time to get invested in ctakes gets more users 
>>>>>>>>> AND better
>>>>>>> bug
>>>>>>>>> reports, FAQ, etc.
>>>>>>>>>
>>>>>>>>> thoughts?
>>>>>>>>> --Andy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I just wanted to gauge the interest of creating the next 
>>>>>>>>>> release of
>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>> - New regression testing component
>>>>>>>>>>
>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>>
>>>>>>>>>> [1]
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1
>>> %
>>>>> 22%20AND%20project%20%3D%20CTAKES


Re: Next cTAKES release (3.1)?

Posted by Tim Miller <ti...@childrens.harvard.edu>.
Agreed that you could definitely help out, and that would be a great way 
to do so. We don't really have "examples" right now, more like just 
short test sentences for showing simple results and verifying that 
nothing has been broken by changes. I think regular length fake but 
realistic notes would be very useful.
Tim

On 07/02/2013 05:19 PM, John Green wrote:
> Hi all,
>
> Ive been following this mail list for a couple of months. Im a third year medical student rounding the bend toward my MD. I used to be a computer programmer, however, and continue my own projects. Im very interested in contributing eventually to cTakes development. In the meantime, given the current talk of examples, if any domain specific examples needed generated I am domain knowledgable enough that I could pound out a few free text notes made to order.
>
> Let me know, you all may already have docs on hand willing todo this, but if not...
>
> John Green
>
> Sent from my iPhone
>
> On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:
>
>> I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
>> I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.
>>
>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>> have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>> have clearly labeled examples in every module that set up a pipeline, run on
>>> sample notes (could be the same sample notes from the tests), and do
>>> something with the results.
>>> iii) Follow Giri's recommendation to have example training data for people
>>> who want to take the next step and train their own models
>> I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.
>> If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.
>>
>> --Pei
>>
>>> -----Original Message-----
>>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>>> Sent: Friday, June 28, 2013 8:00 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Next cTAKES release (3.1)?
>>>
>>> Very interesting discussion. I think Giri is right about giving example training
>>> data in the format that our training code can read. While our ultimate goal
>>> would be to build and release models that are completely domain-
>>> independent, in the real world it is almost always better to use some
>>> domain-specific data and we should think more about how to facilitate that.
>>>
>>> As for making it easier to get started, it is not totally clear to me what this
>>> means/how to do it so it might be useful to get specific about what this
>>> means. I think our biggest hurdle is
>>>
>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>>
>>> Since UIMAFit is officially becoming part of UIMA that will be easier, and
>>> hopefully people will just learn the easier (in my opinion) UIMAFit way than
>>> the standard UIMA way of doing things. Is there something we can be doing
>>> to make understanding UIMA easier? Or do we just need to say upfront that
>>> this is a prerequisite and hope that people don't give up due to this thing that
>>> is out of our control?
>>>
>>> Another hurdle is:
>>>
>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>>
>>> So it's not just a matter of hiding complexity -- at some point people have to
>>> understand their problem, understand cTAKES' capabilities, and start coding.
>>> Pei's GUI will help for some common use cases but will not remove the
>>> requirement that someone at the organization knows cTAKES.
>>> I think one part of this problem is the fact that the typesystem is not well
>>> documented. A developer needs to know what the output is (objects from
>>> the typesystem), how to get them (which modules/pipelines), and what
>>> information is in them. So maybe on this end my recommendation would be:
>>> i) Make the typesystem forefront in documentation -- generate javadocs and
>>> have as a link on the ctakes frontpage/sidebar
>>> ii) Similar to the way that we are aiming to have tests in every module, also
>>> have clearly labeled examples in every module that set up a pipeline, run on
>>> sample notes (could be the same sample notes from the tests), and do
>>> something with the results.
>>> iii) Follow Giri's recommendation to have example training data for people
>>> who want to take the next step and train their own models
>>>
>>> This is quite a bit of developer overhead, so it's worth asking whether you
>>> agree with my "diagnosis" and "treatment" or whether you think there are
>>> different problems/solutions that should be higher priority.
>>>
>>> Tim
>>>
>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>> Hi Vijay and Andy,
>>>>
>>>> Thanks for sharing those examples.
>>>>
>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>>
>>>> Agree with this statement and this is very valid concern.
>>>>
>>>> In "getting started examples", I think we should just have couple of
>>>> entries (5-10 small entries), not more than that (with explicit
>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>> understand
>>>> handcrafting these may not be easy because we are not medical domain
>>>> experts, but I feel worth time, because it brings in more user community.
>>>>
>>>> Thank you,
>>>> Giri
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>> <mc...@gmail.com>wrote:
>>>>> GREAT !
>>>>>
>>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>>> request access to it since it is "semi private"
>>>>>
>>>>>
>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>>>
>>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>>> and I2B2
>>>>>> 2008 challenges:
>>>>>>
>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>>> o
>>>>> ducing_results_on_CMC_2007_challenge
>>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>>> first version of ARC:
>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>>>
>>>>>> These are both publicly available datasets, and represent real-world
>>>>>> problems (in general I believe when publishing a paper the code
>>>>>> should be reproducible and made publicly available, but that's a different
>>> issue).
>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>>> upgrade these samples as well.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> VJ
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>>> <mcmurry.andy@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>>> +NLP
>>>>>>> datasets.
>>>>>>>
>>>>>>> I have at least one we can use that was created by our lead
>>>>>>> Pathologist
>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>>> es/train/traincase.xml
>>>>>>> We should provide at least one sample for each domain.
>>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>>> and not copy-pasted from EMR systems.
>>>>>>>
>>>>>>> --Andy
>>>>>>>
>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>> girinambari@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 for this observation Andy!
>>>>>>>>
>>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>>> how
>>>>> to,
>>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>>
>>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>>> posted similar
>>>>>>> question
>>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>>> depends
>>>>> on
>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>>> most
>>>>> of
>>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>>> posting
>>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>>> think
>>>>>>> we
>>>>>>>> need to have some high level documentation about these (with links
>>>>>>>> to corresponding forums).
>>>>>>>>
>>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>>> as
>>>>>>> ctakes
>>>>>>>> progress), for users I think documentation is lot better though
>>>>>>>> some improvements need to be done.
>>>>>>>>
>>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>>> am
>>>>> still
>>>>>>>> struggling in this area even though I browsed all relevant code),
>>>>> though
>>>>>>>> training class are there. I understood that there are licensing
>>>>>>>> issues
>>>>>>> with
>>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>>> may not
>>>>>>> be
>>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>>> code can
>>>>>>> reverse
>>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>>> when
>>>>> they
>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>>
>>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Giri
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>>> <mcmurry.andy@gmail.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>>> still
>>>>>>> hard
>>>>>>>>> to get started.
>>>>>>>>>
>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>>> obvious
>>>>>>>>> and requires hand holding.
>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>>
>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>>> better
>>>>>>> bug
>>>>>>>>> reports, FAQ, etc.
>>>>>>>>>
>>>>>>>>> thoughts?
>>>>>>>>> --Andy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>>> of
>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>> - New regression testing component
>>>>>>>>>>
>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>>
>>>>>>>>>> [1]
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>>> 22%20AND%20project%20%3D%20CTAKES


Re: Next cTAKES release (3.1)?

Posted by John Green <jo...@gmail.com>.
Hi all,

Ive been following this mail list for a couple of months. Im a third year medical student rounding the bend toward my MD. I used to be a computer programmer, however, and continue my own projects. Im very interested in contributing eventually to cTakes development. In the meantime, given the current talk of examples, if any domain specific examples needed generated I am domain knowledgable enough that I could pound out a few free text notes made to order.

Let me know, you all may already have docs on hand willing todo this, but if not...

John Green

Sent from my iPhone

On Jun 28, 2013, at 8:59, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
> I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.
> 
>> i) Make the typesystem forefront in documentation -- generate javadocs and
>> have as a link on the ctakes frontpage/sidebar
>> ii) Similar to the way that we are aiming to have tests in every module, also
>> have clearly labeled examples in every module that set up a pipeline, run on
>> sample notes (could be the same sample notes from the tests), and do
>> something with the results.
>> iii) Follow Giri's recommendation to have example training data for people
>> who want to take the next step and train their own models
> 
> I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.  
> If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.
> 
> --Pei
> 
>> -----Original Message-----
>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>> Sent: Friday, June 28, 2013 8:00 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: Next cTAKES release (3.1)?
>> 
>> Very interesting discussion. I think Giri is right about giving example training
>> data in the format that our training code can read. While our ultimate goal
>> would be to build and release models that are completely domain-
>> independent, in the real world it is almost always better to use some
>> domain-specific data and we should think more about how to facilitate that.
>> 
>> As for making it easier to get started, it is not totally clear to me what this
>> means/how to do it so it might be useful to get specific about what this
>> means. I think our biggest hurdle is
>> 
>> 1) Prerequisite of understanding UIMA/UIMAFit
>> 
>> Since UIMAFit is officially becoming part of UIMA that will be easier, and
>> hopefully people will just learn the easier (in my opinion) UIMAFit way than
>> the standard UIMA way of doing things. Is there something we can be doing
>> to make understanding UIMA easier? Or do we just need to say upfront that
>> this is a prerequisite and hope that people don't give up due to this thing that
>> is out of our control?
>> 
>> Another hurdle is:
>> 
>> 2) cTAKES is a multi-purpose developer-aimed tool
>> 
>> So it's not just a matter of hiding complexity -- at some point people have to
>> understand their problem, understand cTAKES' capabilities, and start coding.
>> Pei's GUI will help for some common use cases but will not remove the
>> requirement that someone at the organization knows cTAKES.
>> I think one part of this problem is the fact that the typesystem is not well
>> documented. A developer needs to know what the output is (objects from
>> the typesystem), how to get them (which modules/pipelines), and what
>> information is in them. So maybe on this end my recommendation would be:
>> i) Make the typesystem forefront in documentation -- generate javadocs and
>> have as a link on the ctakes frontpage/sidebar
>> ii) Similar to the way that we are aiming to have tests in every module, also
>> have clearly labeled examples in every module that set up a pipeline, run on
>> sample notes (could be the same sample notes from the tests), and do
>> something with the results.
>> iii) Follow Giri's recommendation to have example training data for people
>> who want to take the next step and train their own models
>> 
>> This is quite a bit of developer overhead, so it's worth asking whether you
>> agree with my "diagnosis" and "treatment" or whether you think there are
>> different problems/solutions that should be higher priority.
>> 
>> Tim
>> 
>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>> Hi Vijay and Andy,
>>> 
>>> Thanks for sharing those examples.
>>> 
>>> "Trouble is, privacy requires that these examples be made up by hand"
>>> 
>>> Agree with this statement and this is very valid concern.
>>> 
>>> In "getting started examples", I think we should just have couple of
>>> entries (5-10 small entries), not more than that (with explicit
>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>> understand
>>> handcrafting these may not be easy because we are not medical domain
>>> experts, but I feel worth time, because it brings in more user community.
>>> 
>>> Thank you,
>>> Giri
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>> <mc...@gmail.com>wrote:
>>> 
>>>> GREAT !
>>>> 
>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>> request access to it since it is "semi private"
>>>> 
>>>> 
>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>> 
>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>> and I2B2
>>>>> 2008 challenges:
>>>>> 
>>>>> We did the cmd 2007 with cTAKES 2.5:
>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>> o
>>>> ducing_results_on_CMC_2007_challenge
>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>> 
>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>> first version of ARC:
>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>> 
>>>>> These are both publicly available datasets, and represent real-world
>>>>> problems (in general I believe when publishing a paper the code
>>>>> should be reproducible and made publicly available, but that's a different
>> issue).
>>>>> 
>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>> upgrade these samples as well.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> VJ
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>> <mcmurry.andy@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>> +NLP
>>>>>> datasets.
>>>>>> 
>>>>>> I have at least one we can use that was created by our lead
>>>>>> Pathologist
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>> es/train/traincase.xml
>>>>>> We should provide at least one sample for each domain.
>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>> and not copy-pasted from EMR systems.
>>>>>> 
>>>>>> --Andy
>>>>>> 
>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>> girinambari@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> +1 for this observation Andy!
>>>>>>> 
>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>> how
>>>> to,
>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>> 
>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>> posted similar
>>>>>> question
>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>> depends
>>>> on
>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>> most
>>>> of
>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>> posting
>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>> think
>>>>>> we
>>>>>>> need to have some high level documentation about these (with links
>>>>>>> to corresponding forums).
>>>>>>> 
>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>> as
>>>>>> ctakes
>>>>>>> progress), for users I think documentation is lot better though
>>>>>>> some improvements need to be done.
>>>>>>> 
>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>> am
>>>> still
>>>>>>> struggling in this area even though I browsed all relevant code),
>>>> though
>>>>>>> training class are there. I understood that there are licensing
>>>>>>> issues
>>>>>> with
>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>> may not
>>>>>> be
>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>> code can
>>>>>> reverse
>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>> when
>>>> they
>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>> documentation in this area will same lot of time for us.
>>>>>>> 
>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Giri
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>> <mcmurry.andy@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>> still
>>>>>> hard
>>>>>>>> to get started.
>>>>>>>> 
>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>> obvious
>>>>>>>> and requires hand holding.
>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>> 
>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>> better
>>>>>> bug
>>>>>>>> reports, FAQ, etc.
>>>>>>>> 
>>>>>>>> thoughts?
>>>>>>>> --Andy
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>> of
>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>> - New CEM Instance Template population
>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>> - New regression testing component
>>>>>>>>> 
>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>> 
>>>>>>>>> [1]
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>> 22%20AND%20project%20%3D%20CTAKES
> 

Re: Next cTAKES release (3.1)?

Posted by Andy McMurry <mc...@gmail.com>.
+1 Tim's suggestion  

On Jul 2, 2013, at 10:13 AM, "Masanz, James J." <Ma...@mayo.edu> wrote:

> I agree with Tim's diagnosis and treatment plan.
> 
> -----Original Message-----
> From: dev-return-1714-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1714-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Chen, Pei
> Sent: Friday, June 28, 2013 9:00 AM
> To: dev@ctakes.apache.org
> Subject: RE: Next cTAKES release (3.1)?
> 
> I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
> I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.
> 
>> i) Make the typesystem forefront in documentation -- generate javadocs and
>> have as a link on the ctakes frontpage/sidebar
>> ii) Similar to the way that we are aiming to have tests in every module, also
>> have clearly labeled examples in every module that set up a pipeline, run on
>> sample notes (could be the same sample notes from the tests), and do
>> something with the results.
>> iii) Follow Giri's recommendation to have example training data for people
>> who want to take the next step and train their own models
> 
> I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.  
> If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.
> 
> --Pei
> 
>> -----Original Message-----
>> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
>> Sent: Friday, June 28, 2013 8:00 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: Next cTAKES release (3.1)?
>> 
>> Very interesting discussion. I think Giri is right about giving example training
>> data in the format that our training code can read. While our ultimate goal
>> would be to build and release models that are completely domain-
>> independent, in the real world it is almost always better to use some
>> domain-specific data and we should think more about how to facilitate that.
>> 
>> As for making it easier to get started, it is not totally clear to me what this
>> means/how to do it so it might be useful to get specific about what this
>> means. I think our biggest hurdle is
>> 
>> 1) Prerequisite of understanding UIMA/UIMAFit
>> 
>> Since UIMAFit is officially becoming part of UIMA that will be easier, and
>> hopefully people will just learn the easier (in my opinion) UIMAFit way than
>> the standard UIMA way of doing things. Is there something we can be doing
>> to make understanding UIMA easier? Or do we just need to say upfront that
>> this is a prerequisite and hope that people don't give up due to this thing that
>> is out of our control?
>> 
>> Another hurdle is:
>> 
>> 2) cTAKES is a multi-purpose developer-aimed tool
>> 
>> So it's not just a matter of hiding complexity -- at some point people have to
>> understand their problem, understand cTAKES' capabilities, and start coding.
>> Pei's GUI will help for some common use cases but will not remove the
>> requirement that someone at the organization knows cTAKES.
>> I think one part of this problem is the fact that the typesystem is not well
>> documented. A developer needs to know what the output is (objects from
>> the typesystem), how to get them (which modules/pipelines), and what
>> information is in them. So maybe on this end my recommendation would be:
>> i) Make the typesystem forefront in documentation -- generate javadocs and
>> have as a link on the ctakes frontpage/sidebar
>> ii) Similar to the way that we are aiming to have tests in every module, also
>> have clearly labeled examples in every module that set up a pipeline, run on
>> sample notes (could be the same sample notes from the tests), and do
>> something with the results.
>> iii) Follow Giri's recommendation to have example training data for people
>> who want to take the next step and train their own models
>> 
>> This is quite a bit of developer overhead, so it's worth asking whether you
>> agree with my "diagnosis" and "treatment" or whether you think there are
>> different problems/solutions that should be higher priority.
>> 
>> Tim
>> 
>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>> Hi Vijay and Andy,
>>> 
>>> Thanks for sharing those examples.
>>> 
>>> "Trouble is, privacy requires that these examples be made up by hand"
>>> 
>>> Agree with this statement and this is very valid concern.
>>> 
>>> In "getting started examples", I think we should just have couple of
>>> entries (5-10 small entries), not more than that (with explicit
>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>> understand
>>> handcrafting these may not be easy because we are not medical domain
>>> experts, but I feel worth time, because it brings in more user community.
>>> 
>>> Thank you,
>>> Giri
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>> <mc...@gmail.com>wrote:
>>> 
>>>> GREAT !
>>>> 
>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>> request access to it since it is "semi private"
>>>> 
>>>> 
>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>>> 
>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>> that use the annotations to classify clinical text from the CMC 2007
>>>>> and I2B2
>>>>> 2008 challenges:
>>>>> 
>>>>> We did the cmd 2007 with cTAKES 2.5:
>>>>> 
>>>> 
>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>> o
>>>> ducing_results_on_CMC_2007_challenge
>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>> 
>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>> first version of ARC:
>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>> 
>>>>> These are both publicly available datasets, and represent real-world
>>>>> problems (in general I believe when publishing a paper the code
>>>>> should be reproducible and made publicly available, but that's a different
>> issue).
>>>>> 
>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>>> upgrade these samples as well.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> VJ
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>> <mcmurry.andy@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>> +NLP
>>>>>> datasets.
>>>>>> 
>>>>>> I have at least one we can use that was created by our lead
>>>>>> Pathologist
>>>>>> 
>>>>>> 
>>>> 
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>> es/train/traincase.xml
>>>>>> We should provide at least one sample for each domain.
>>>>>> Trouble is, privacy requires that these examples be made up by hand
>>>>>> and not copy-pasted from EMR systems.
>>>>>> 
>>>>>> --Andy
>>>>>> 
>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>> girinambari@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> +1 for this observation Andy!
>>>>>>> 
>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>> how
>>>> to,
>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>> 
>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>> posted similar
>>>>>> question
>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>> depends
>>>> on
>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>> most
>>>> of
>>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>>> posting
>>>>>>> and browsing those forums, relating observations to ctakes code. I
>>>> think
>>>>>> we
>>>>>>> need to have some high level documentation about these (with links
>>>>>>> to corresponding forums).
>>>>>>> 
>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>> as
>>>>>> ctakes
>>>>>>> progress), for users I think documentation is lot better though
>>>>>>> some improvements need to be done.
>>>>>>> 
>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>> am
>>>> still
>>>>>>> struggling in this area even though I browsed all relevant code),
>>>> though
>>>>>>> training class are there. I understood that there are licensing
>>>>>>> issues
>>>>>> with
>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>> may not
>>>>>> be
>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>> code can
>>>>>> reverse
>>>>>>> engineer and develop their own models. Sorry if you guys feel this
>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>> when
>>>> they
>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>> documentation in this area will same lot of time for us.
>>>>>>> 
>>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Giri
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>> <mcmurry.andy@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>> still
>>>>>> hard
>>>>>>>> to get started.
>>>>>>>> 
>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>> obvious
>>>>>>>> and requires hand holding.
>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>> 
>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>> better
>>>>>> bug
>>>>>>>> reports, FAQ, etc.
>>>>>>>> 
>>>>>>>> thoughts?
>>>>>>>> --Andy
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> I just wanted to gauge the interest of creating the next release
>>>>>>>>> of
>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>> - New CEM Instance Template population
>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>> - New regression testing component
>>>>>>>>> 
>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>> 
>>>>>>>>> [1]
>>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>> 22%20AND%20project%20%3D%20CTAKES
>>>>>>>> 
>>>>>> 
>>>> 
> 


RE: Next cTAKES release (3.1)?

Posted by "Masanz, James J." <Ma...@mayo.edu>.
I agree with Tim's diagnosis and treatment plan.

-----Original Message-----
From: dev-return-1714-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1714-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Chen, Pei
Sent: Friday, June 28, 2013 9:00 AM
To: dev@ctakes.apache.org
Subject: RE: Next cTAKES release (3.1)?

I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
 I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.

> i) Make the typesystem forefront in documentation -- generate javadocs and
> have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module, also
> have clearly labeled examples in every module that set up a pipeline, run on
> sample notes (could be the same sample notes from the tests), and do
> something with the results.
> iii) Follow Giri's recommendation to have example training data for people
> who want to take the next step and train their own models

I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.  
If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.

--Pei

> -----Original Message-----
> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> Sent: Friday, June 28, 2013 8:00 AM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> Very interesting discussion. I think Giri is right about giving example training
> data in the format that our training code can read. While our ultimate goal
> would be to build and release models that are completely domain-
> independent, in the real world it is almost always better to use some
> domain-specific data and we should think more about how to facilitate that.
> 
> As for making it easier to get started, it is not totally clear to me what this
> means/how to do it so it might be useful to get specific about what this
> means. I think our biggest hurdle is
> 
> 1) Prerequisite of understanding UIMA/UIMAFit
> 
> Since UIMAFit is officially becoming part of UIMA that will be easier, and
> hopefully people will just learn the easier (in my opinion) UIMAFit way than
> the standard UIMA way of doing things. Is there something we can be doing
> to make understanding UIMA easier? Or do we just need to say upfront that
> this is a prerequisite and hope that people don't give up due to this thing that
> is out of our control?
> 
> Another hurdle is:
> 
> 2) cTAKES is a multi-purpose developer-aimed tool
> 
> So it's not just a matter of hiding complexity -- at some point people have to
> understand their problem, understand cTAKES' capabilities, and start coding.
> Pei's GUI will help for some common use cases but will not remove the
> requirement that someone at the organization knows cTAKES.
> I think one part of this problem is the fact that the typesystem is not well
> documented. A developer needs to know what the output is (objects from
> the typesystem), how to get them (which modules/pipelines), and what
> information is in them. So maybe on this end my recommendation would be:
> i) Make the typesystem forefront in documentation -- generate javadocs and
> have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module, also
> have clearly labeled examples in every module that set up a pipeline, run on
> sample notes (could be the same sample notes from the tests), and do
> something with the results.
> iii) Follow Giri's recommendation to have example training data for people
> who want to take the next step and train their own models
> 
> This is quite a bit of developer overhead, so it's worth asking whether you
> agree with my "diagnosis" and "treatment" or whether you think there are
> different problems/solutions that should be higher priority.
> 
> Tim
> 
> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
> > Hi Vijay and Andy,
> >
> > Thanks for sharing those examples.
> >
> > "Trouble is, privacy requires that these examples be made up by hand"
> >
> > Agree with this statement and this is very valid concern.
> >
> > In "getting started examples", I think we should just have couple of
> > entries (5-10 small entries), not more than that (with explicit
> > statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
> understand
> > handcrafting these may not be easy because we are not medical domain
> > experts, but I feel worth time, because it brings in more user community.
> >
> > Thank you,
> > Giri
> >
> >
> >
> >
> >
> > On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
> <mc...@gmail.com>wrote:
> >
> >> GREAT !
> >>
> >> The i2b2 data though isn't publicly distributable, you still need to
> >> request access to it since it is "semi private"
> >>
> >>
> >> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
> >>
> >>> We released code on using cTAKES to annotate clinical text and SVMs
> >>> that use the annotations to classify clinical text from the CMC 2007
> >>> and I2B2
> >>> 2008 challenges:
> >>>
> >>> We did the cmd 2007 with cTAKES 2.5:
> >>>
> >>
> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
> o
> >> ducing_results_on_CMC_2007_challenge
> >> <https://code.google.com/p/ytex/downloads/list>
> >>>
> >>> And the i2b2 2008 with the version of cTAKES distributed with the
> >>> first version of ARC:
> >>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> >>>
> >>> These are both publicly available datasets, and represent real-world
> >>> problems (in general I believe when publishing a paper the code
> >>> should be reproducible and made publicly available, but that's a different
> issue).
> >>>
> >>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
> >>> upgrade these samples as well.
> >>>
> >>> Best,
> >>>
> >>> VJ
> >>>
> >>>
> >>>
> >>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
> >>> <mcmurry.andy@gmail.com
> >>> wrote:
> >>>
> >>>> +1 suggestion for documenting many examples of "getting started"
> >>>> +NLP
> >>>> datasets.
> >>>>
> >>>> I have at least one we can use that was created by our lead
> >>>> Pathologist
> >>>>
> >>>>
> >>
> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
> >> es/train/traincase.xml
> >>>> We should provide at least one sample for each domain.
> >>>> Trouble is, privacy requires that these examples be made up by hand
> >>>> and not copy-pasted from EMR systems.
> >>>>
> >>>> --Andy
> >>>>
> >>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
> >> girinambari@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> +1 for this observation Andy!
> >>>>>
> >>>>> Lowering time will motive users in writing blogs about features,
> >>>>> how
> >> to,
> >>>>> etc., which reduces core team work load on documentation.
> >>>>>
> >>>>> I have been trying to write a small "how to write standalone
> >>>>> client for ctakes" with my experience (I saw at least 4 users
> >>>>> posted similar
> >>>> question
> >>>>> in last 2 months), but not getting enough time because ctakes
> >>>>> depends
> >> on
> >>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
> >>>>> most
> >> of
> >>>>> my spare time is being spent on juggling between these frameworks,
> >>>> posting
> >>>>> and browsing those forums, relating observations to ctakes code. I
> >> think
> >>>> we
> >>>>> need to have some high level documentation about these (with links
> >>>>> to corresponding forums).
> >>>>>
> >>>>> Above case is for developers (I think this will be more user base
> >>>>> as
> >>>> ctakes
> >>>>> progress), for users I think documentation is lot better though
> >>>>> some improvements need to be done.
> >>>>>
> >>>>> As a developer I felt tough with lack of sample training data (I
> >>>>> am
> >> still
> >>>>> struggling in this area even though I browsed all relevant code),
> >> though
> >>>>> training class are there. I understood that there are licensing
> >>>>> issues
> >>>> with
> >>>>> REAL data, but at least some hand made example sentences, which
> >>>>> may not
> >>>> be
> >>>>> real but helps developers in understanding the type/structure of
> >>>>> input TRAINING classes expecting. This way people who browse the
> >>>>> code can
> >>>> reverse
> >>>>> engineer and develop their own models. Sorry if you guys feel this
> >>>>> as novice issue, but I feel most of the developers will be novice
> >>>>> when
> >> they
> >>>>> adopt a system and Machine Learning/NLP is ocean. Some
> >>>>> documentation in this area will same lot of time for us.
> >>>>>
> >>>>> I wish there will be some activity in this area from ctakes core team.
> >>>>>
> >>>>> Thank you,
> >>>>> Giri
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
> >>>>> <mcmurry.andy@gmail.com
> >>>>> wrote:
> >>>>>
> >>>>>> ctakes is at a point where we have a LOT of features but it is
> >>>>>> still
> >>>> hard
> >>>>>> to get started.
> >>>>>>
> >>>>>> Judging from the mailing lists a lot of how cTakes works is not
> >> obvious
> >>>>>> and requires hand holding.
> >>>>>> This is very typical in early FOSS projects.
> >>>>>>
> >>>>>> Lowering the time to get invested in ctakes gets more users AND
> >>>>>> better
> >>>> bug
> >>>>>> reports, FAQ, etc.
> >>>>>>
> >>>>>> thoughts?
> >>>>>> --Andy
> >>>>>>
> >>>>>>
> >>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> >>>> Pei.Chen@childrens.harvard.edu>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> I just wanted to gauge the interest of creating the next release
> >>>>>>> of
> >>>>>> cTAKES (3.1) which is currently marked for May in Jira-
> >>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
> >>>>>> Plenty of bug fixes and new components including:
> >>>>>>> - New CEM Instance Template population
> >>>>>>> - New Dependency Parser/Semantic Role Labeler
> >>>>>>> - New optional Clear POSTagger
> >>>>>>> - New regression testing component
> >>>>>>>
> >>>>>>> Should we wait for the Temporal component?
> >>>>>>>
> >>>>>>> [1]
> >>
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
> >> 22%20AND%20project%20%3D%20CTAKES
> >>>>>>
> >>>>
> >>


RE: Next cTAKES release (3.1)?

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
I completely agree with making cTAKES easier use.  I think it is exciting to hear the different use cases here and understanding where some of the areas that need improvements are (which we haven't thought about earlier).
 I think Tim's suggestions and the 3 concrete actionable items makes a lot of sense.  Hopefully it should attract new users, adopters, and perhaps more committers.

> i) Make the typesystem forefront in documentation -- generate javadocs and
> have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module, also
> have clearly labeled examples in every module that set up a pipeline, run on
> sample notes (could be the same sample notes from the tests), and do
> something with the results.
> iii) Follow Giri's recommendation to have example training data for people
> who want to take the next step and train their own models

I think Java developers are accustomed to including a library as a dependency/jar, have an API to pass input, and get the results via pojos;  So the examples could initially shield the complexity of wiring a pipeline together etc.  
If we can improve the API's and how it gets integrated with other apps, we can add any GUI/CLI tools on top of this afterwards.

--Pei

> -----Original Message-----
> From: Miller, Timothy [mailto:Timothy.Miller@childrens.harvard.edu]
> Sent: Friday, June 28, 2013 8:00 AM
> To: dev@ctakes.apache.org
> Subject: Re: Next cTAKES release (3.1)?
> 
> Very interesting discussion. I think Giri is right about giving example training
> data in the format that our training code can read. While our ultimate goal
> would be to build and release models that are completely domain-
> independent, in the real world it is almost always better to use some
> domain-specific data and we should think more about how to facilitate that.
> 
> As for making it easier to get started, it is not totally clear to me what this
> means/how to do it so it might be useful to get specific about what this
> means. I think our biggest hurdle is
> 
> 1) Prerequisite of understanding UIMA/UIMAFit
> 
> Since UIMAFit is officially becoming part of UIMA that will be easier, and
> hopefully people will just learn the easier (in my opinion) UIMAFit way than
> the standard UIMA way of doing things. Is there something we can be doing
> to make understanding UIMA easier? Or do we just need to say upfront that
> this is a prerequisite and hope that people don't give up due to this thing that
> is out of our control?
> 
> Another hurdle is:
> 
> 2) cTAKES is a multi-purpose developer-aimed tool
> 
> So it's not just a matter of hiding complexity -- at some point people have to
> understand their problem, understand cTAKES' capabilities, and start coding.
> Pei's GUI will help for some common use cases but will not remove the
> requirement that someone at the organization knows cTAKES.
> I think one part of this problem is the fact that the typesystem is not well
> documented. A developer needs to know what the output is (objects from
> the typesystem), how to get them (which modules/pipelines), and what
> information is in them. So maybe on this end my recommendation would be:
> i) Make the typesystem forefront in documentation -- generate javadocs and
> have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module, also
> have clearly labeled examples in every module that set up a pipeline, run on
> sample notes (could be the same sample notes from the tests), and do
> something with the results.
> iii) Follow Giri's recommendation to have example training data for people
> who want to take the next step and train their own models
> 
> This is quite a bit of developer overhead, so it's worth asking whether you
> agree with my "diagnosis" and "treatment" or whether you think there are
> different problems/solutions that should be higher priority.
> 
> Tim
> 
> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
> > Hi Vijay and Andy,
> >
> > Thanks for sharing those examples.
> >
> > "Trouble is, privacy requires that these examples be made up by hand"
> >
> > Agree with this statement and this is very valid concern.
> >
> > In "getting started examples", I think we should just have couple of
> > entries (5-10 small entries), not more than that (with explicit
> > statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
> understand
> > handcrafting these may not be easy because we are not medical domain
> > experts, but I feel worth time, because it brings in more user community.
> >
> > Thank you,
> > Giri
> >
> >
> >
> >
> >
> > On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
> <mc...@gmail.com>wrote:
> >
> >> GREAT !
> >>
> >> The i2b2 data though isn't publicly distributable, you still need to
> >> request access to it since it is "semi private"
> >>
> >>
> >> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
> >>
> >>> We released code on using cTAKES to annotate clinical text and SVMs
> >>> that use the annotations to classify clinical text from the CMC 2007
> >>> and I2B2
> >>> 2008 challenges:
> >>>
> >>> We did the cmd 2007 with cTAKES 2.5:
> >>>
> >>
> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
> o
> >> ducing_results_on_CMC_2007_challenge
> >> <https://code.google.com/p/ytex/downloads/list>
> >>>
> >>> And the i2b2 2008 with the version of cTAKES distributed with the
> >>> first version of ARC:
> >>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> >>>
> >>> These are both publicly available datasets, and represent real-world
> >>> problems (in general I believe when publishing a paper the code
> >>> should be reproducible and made publicly available, but that's a different
> issue).
> >>>
> >>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
> >>> upgrade these samples as well.
> >>>
> >>> Best,
> >>>
> >>> VJ
> >>>
> >>>
> >>>
> >>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
> >>> <mcmurry.andy@gmail.com
> >>> wrote:
> >>>
> >>>> +1 suggestion for documenting many examples of "getting started"
> >>>> +NLP
> >>>> datasets.
> >>>>
> >>>> I have at least one we can use that was created by our lead
> >>>> Pathologist
> >>>>
> >>>>
> >>
> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
> >> es/train/traincase.xml
> >>>> We should provide at least one sample for each domain.
> >>>> Trouble is, privacy requires that these examples be made up by hand
> >>>> and not copy-pasted from EMR systems.
> >>>>
> >>>> --Andy
> >>>>
> >>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
> >> girinambari@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> +1 for this observation Andy!
> >>>>>
> >>>>> Lowering time will motive users in writing blogs about features,
> >>>>> how
> >> to,
> >>>>> etc., which reduces core team work load on documentation.
> >>>>>
> >>>>> I have been trying to write a small "how to write standalone
> >>>>> client for ctakes" with my experience (I saw at least 4 users
> >>>>> posted similar
> >>>> question
> >>>>> in last 2 months), but not getting enough time because ctakes
> >>>>> depends
> >> on
> >>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
> >>>>> most
> >> of
> >>>>> my spare time is being spent on juggling between these frameworks,
> >>>> posting
> >>>>> and browsing those forums, relating observations to ctakes code. I
> >> think
> >>>> we
> >>>>> need to have some high level documentation about these (with links
> >>>>> to corresponding forums).
> >>>>>
> >>>>> Above case is for developers (I think this will be more user base
> >>>>> as
> >>>> ctakes
> >>>>> progress), for users I think documentation is lot better though
> >>>>> some improvements need to be done.
> >>>>>
> >>>>> As a developer I felt tough with lack of sample training data (I
> >>>>> am
> >> still
> >>>>> struggling in this area even though I browsed all relevant code),
> >> though
> >>>>> training class are there. I understood that there are licensing
> >>>>> issues
> >>>> with
> >>>>> REAL data, but at least some hand made example sentences, which
> >>>>> may not
> >>>> be
> >>>>> real but helps developers in understanding the type/structure of
> >>>>> input TRAINING classes expecting. This way people who browse the
> >>>>> code can
> >>>> reverse
> >>>>> engineer and develop their own models. Sorry if you guys feel this
> >>>>> as novice issue, but I feel most of the developers will be novice
> >>>>> when
> >> they
> >>>>> adopt a system and Machine Learning/NLP is ocean. Some
> >>>>> documentation in this area will same lot of time for us.
> >>>>>
> >>>>> I wish there will be some activity in this area from ctakes core team.
> >>>>>
> >>>>> Thank you,
> >>>>> Giri
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
> >>>>> <mcmurry.andy@gmail.com
> >>>>> wrote:
> >>>>>
> >>>>>> ctakes is at a point where we have a LOT of features but it is
> >>>>>> still
> >>>> hard
> >>>>>> to get started.
> >>>>>>
> >>>>>> Judging from the mailing lists a lot of how cTakes works is not
> >> obvious
> >>>>>> and requires hand holding.
> >>>>>> This is very typical in early FOSS projects.
> >>>>>>
> >>>>>> Lowering the time to get invested in ctakes gets more users AND
> >>>>>> better
> >>>> bug
> >>>>>> reports, FAQ, etc.
> >>>>>>
> >>>>>> thoughts?
> >>>>>> --Andy
> >>>>>>
> >>>>>>
> >>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> >>>> Pei.Chen@childrens.harvard.edu>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> I just wanted to gauge the interest of creating the next release
> >>>>>>> of
> >>>>>> cTAKES (3.1) which is currently marked for May in Jira-
> >>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
> >>>>>> Plenty of bug fixes and new components including:
> >>>>>>> - New CEM Instance Template population
> >>>>>>> - New Dependency Parser/Semantic Role Labeler
> >>>>>>> - New optional Clear POSTagger
> >>>>>>> - New regression testing component
> >>>>>>>
> >>>>>>> Should we wait for the Temporal component?
> >>>>>>>
> >>>>>>> [1]
> >>
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
> >> 22%20AND%20project%20%3D%20CTAKES
> >>>>>>
> >>>>
> >>


Re: Next cTAKES release (3.1)?

Posted by Andy McMurry <mc...@gmail.com>.
+1 ctakes IS domain specific 
+1 UIMAFit should become a part of UIMA and not the focus of ctakes-dev 

At first glance, people should think of cTakes as the "UIMA medical text library". 

Here are examples that I know users are interested in. 

Suggestions: 

1. ctakes DRUG PROFILE 
http://www.mtsamples.com/site/pages/sample.asp?Type=6-Cardiovascular&Sample=775-H%26P+-+Cardio+(Angina)

2. ctakes NER : 
http://www.mtsamples.com/site/pages/sample.asp?Type=77-rheumatology&Sample=790-Rheumatoid+Arthritis+-+H%26P

3. ctakes SMOKING: 
http://www.mtsamples.com/site/pages/sample.asp?Type=6-Cardiovascular%20/%20Pulmonary&Sample=571-Trouble%20breathing

4. ctakes Lexical features (PoS, sentence boundaries, etc) 
http://www.medicaltranscriptionsamples.com/diabetes-mellitus-followup/







> Very interesting discussion. I think Giri is right about giving example
> training data in the format that our training code can read. While our
> ultimate goal would be to build and release models that are completely
> domain-independent, in the real world it is almost always better to use
> some domain-specific data and we should think more about how to
> facilitate that.



> 
> As for making it easier to get started, it is not totally clear to me
> what this means/how to do it so it might be useful to get specific about
> what this means. I think our biggest hurdle is
> 
> 1) Prerequisite of understanding UIMA/UIMAFit
> 
> Since UIMAFit is officially becoming part of UIMA that will be easier,
> and hopefully people will just learn the easier (in my opinion) UIMAFit
> way than the standard UIMA way of doing things. Is there something we
> can be doing to make understanding UIMA easier? Or do we just need to
> say upfront that this is a prerequisite and hope that people don't give
> up due to this thing that is out of our control?
> 
> Another hurdle is:
> 
> 2) cTAKES is a multi-purpose developer-aimed tool
> 
> So it's not just a matter of hiding complexity -- at some point people
> have to understand their problem, understand cTAKES' capabilities, and
> start coding. Pei's GUI will help for some common use cases but will not
> remove the requirement that someone at the organization knows cTAKES.
> I think one part of this problem is the fact that the typesystem is not
> well documented. A developer needs to know what the output is (objects
> from the typesystem), how to get them (which modules/pipelines), and
> what information is in them. So maybe on this end my recommendation
> would be:
> i) Make the typesystem forefront in documentation -- generate javadocs
> and have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module,
> also have clearly labeled examples in every module that set up a
> pipeline, run on sample notes (could be the same sample notes from the
> tests), and do something with the results.
> iii) Follow Giri's recommendation to have example training data for
> people who want to take the next step and train their own models
> 
> This is quite a bit of developer overhead, so it's worth asking whether
> you agree with my "diagnosis" and "treatment" or whether you think there
> are different problems/solutions that should be higher priority.
> 
> Tim
> 
> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>> Hi Vijay and Andy,
>> 
>> Thanks for sharing those examples.
>> 
>> "Trouble is, privacy requires that these examples be made up by hand"
>> 
>> Agree with this statement and this is very valid concern.
>> 
>> In "getting started examples", I think we should just have couple of
>> entries (5-10 small entries), not more than that (with explicit statement
>> like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand handcrafting
>> these may not be easy because we are not medical domain experts, but I feel
>> worth time, because it brings in more user community.
>> 
>> Thank you,
>> Giri
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <mc...@gmail.com>wrote:
>> 
>>> GREAT !
>>> 
>>> The i2b2 data though isn't publicly distributable, you still need to
>>> request access to it since it is "semi private"
>>> 
>>> 
>>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>> 
>>>> We released code on using cTAKES to annotate clinical text and SVMs that
>>>> use the annotations to classify clinical text from the CMC 2007 and I2B2
>>>> 2008 challenges:
>>>> 
>>>> We did the cmd 2007 with cTAKES 2.5:
>>>> 
>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge
>>> <https://code.google.com/p/ytex/downloads/list>
>>>> 
>>>> And the i2b2 2008 with the version of cTAKES distributed with the first
>>>> version of ARC:
>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>> 
>>>> These are both publicly available datasets, and represent real-world
>>>> problems (in general I believe when publishing a paper the code should be
>>>> reproducible and made publicly available, but that's a different issue).
>>>> 
>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>> upgrade these samples as well.
>>>> 
>>>> Best,
>>>> 
>>>> VJ
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <mcmurry.andy@gmail.com
>>>> wrote:
>>>> 
>>>>> +1 suggestion for documenting many examples of "getting started" NLP
>>>>> datasets.
>>>>> 
>>>>> I have at least one we can use that was created by our lead Pathologist
>>>>> 
>>>>> 
>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml
>>>>> We should provide at least one sample for each domain.
>>>>> Trouble is, privacy requires that these examples be made up by hand and
>>>>> not copy-pasted from EMR systems.
>>>>> 
>>>>> --Andy
>>>>> 
>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>> girinambari@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> +1 for this observation Andy!
>>>>>> 
>>>>>> Lowering time will motive users in writing blogs about features, how
>>> to,
>>>>>> etc., which reduces core team work load on documentation.
>>>>>> 
>>>>>> I have been trying to write a small "how to write standalone client for
>>>>>> ctakes" with my experience (I saw at least 4 users posted similar
>>>>> question
>>>>>> in last 2 months), but not getting enough time because ctakes depends
>>> on
>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most
>>> of
>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>> posting
>>>>>> and browsing those forums, relating observations to ctakes code. I
>>> think
>>>>> we
>>>>>> need to have some high level documentation about these (with links to
>>>>>> corresponding forums).
>>>>>> 
>>>>>> Above case is for developers (I think this will be more user base as
>>>>> ctakes
>>>>>> progress), for users I think documentation is lot better though some
>>>>>> improvements need to be done.
>>>>>> 
>>>>>> As a developer I felt tough with lack of sample training data (I am
>>> still
>>>>>> struggling in this area even though I browsed all relevant code),
>>> though
>>>>>> training class are there. I understood that there are licensing issues
>>>>> with
>>>>>> REAL data, but at least some hand made example sentences, which may not
>>>>> be
>>>>>> real but helps developers in understanding the type/structure of input
>>>>>> TRAINING classes expecting. This way people who browse the code can
>>>>> reverse
>>>>>> engineer and develop their own models. Sorry if you guys feel this as
>>>>>> novice issue, but I feel most of the developers will be novice when
>>> they
>>>>>> adopt a system and Machine Learning/NLP is ocean. Some documentation in
>>>>>> this area will same lot of time for us.
>>>>>> 
>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>> 
>>>>>> Thank you,
>>>>>> Giri
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mcmurry.andy@gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>>> ctakes is at a point where we have a LOT of features but it is still
>>>>> hard
>>>>>>> to get started.
>>>>>>> 
>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>> obvious
>>>>>>> and requires hand holding.
>>>>>>> This is very typical in early FOSS projects.
>>>>>>> 
>>>>>>> Lowering the time to get invested in ctakes gets more users AND better
>>>>> bug
>>>>>>> reports, FAQ, etc.
>>>>>>> 
>>>>>>> thoughts?
>>>>>>> --Andy
>>>>>>> 
>>>>>>> 
>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>> Pei.Chen@childrens.harvard.edu>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I just wanted to gauge the interest of creating the next release of
>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>> - New CEM Instance Template population
>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>> - New optional Clear POSTagger
>>>>>>>> - New regression testing component
>>>>>>>> 
>>>>>>>> Should we wait for the Temporal component?
>>>>>>>> 
>>>>>>>> [1]
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
>>>>>>> 
>>>>> 
>>> 
> 


Re: Next cTAKES release (3.1)?

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Very interesting discussion. I think Giri is right about giving example
training data in the format that our training code can read. While our
ultimate goal would be to build and release models that are completely
domain-independent, in the real world it is almost always better to use
some domain-specific data and we should think more about how to
facilitate that.

As for making it easier to get started, it is not totally clear to me
what this means/how to do it so it might be useful to get specific about
what this means. I think our biggest hurdle is

1) Prerequisite of understanding UIMA/UIMAFit

Since UIMAFit is officially becoming part of UIMA that will be easier,
and hopefully people will just learn the easier (in my opinion) UIMAFit
way than the standard UIMA way of doing things. Is there something we
can be doing to make understanding UIMA easier? Or do we just need to
say upfront that this is a prerequisite and hope that people don't give
up due to this thing that is out of our control?

Another hurdle is:

2) cTAKES is a multi-purpose developer-aimed tool

So it's not just a matter of hiding complexity -- at some point people
have to understand their problem, understand cTAKES' capabilities, and
start coding. Pei's GUI will help for some common use cases but will not
remove the requirement that someone at the organization knows cTAKES.
I think one part of this problem is the fact that the typesystem is not
well documented. A developer needs to know what the output is (objects
from the typesystem), how to get them (which modules/pipelines), and
what information is in them. So maybe on this end my recommendation
would be:
i) Make the typesystem forefront in documentation -- generate javadocs
and have as a link on the ctakes frontpage/sidebar
ii) Similar to the way that we are aiming to have tests in every module,
also have clearly labeled examples in every module that set up a
pipeline, run on sample notes (could be the same sample notes from the
tests), and do something with the results.
iii) Follow Giri's recommendation to have example training data for
people who want to take the next step and train their own models

This is quite a bit of developer overhead, so it's worth asking whether
you agree with my "diagnosis" and "treatment" or whether you think there
are different problems/solutions that should be higher priority.

Tim

On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
> Hi Vijay and Andy,
>
> Thanks for sharing those examples.
>
> "Trouble is, privacy requires that these examples be made up by hand"
>
> Agree with this statement and this is very valid concern.
>
> In "getting started examples", I think we should just have couple of
> entries (5-10 small entries), not more than that (with explicit statement
> like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand handcrafting
> these may not be easy because we are not medical domain experts, but I feel
> worth time, because it brings in more user community.
>
> Thank you,
> Giri
>
>
>
>
>
> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <mc...@gmail.com>wrote:
>
>> GREAT !
>>
>> The i2b2 data though isn't publicly distributable, you still need to
>> request access to it since it is "semi private"
>>
>>
>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>>
>>> We released code on using cTAKES to annotate clinical text and SVMs that
>>> use the annotations to classify clinical text from the CMC 2007 and I2B2
>>> 2008 challenges:
>>>
>>> We did the cmd 2007 with cTAKES 2.5:
>>>
>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge
>> <https://code.google.com/p/ytex/downloads/list>
>>>
>>> And the i2b2 2008 with the version of cTAKES distributed with the first
>>> version of ARC:
>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>
>>> These are both publicly available datasets, and represent real-world
>>> problems (in general I believe when publishing a paper the code should be
>>> reproducible and made publicly available, but that's a different issue).
>>>
>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>> upgrade these samples as well.
>>>
>>> Best,
>>>
>>> VJ
>>>
>>>
>>>
>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <mcmurry.andy@gmail.com
>>> wrote:
>>>
>>>> +1 suggestion for documenting many examples of "getting started" NLP
>>>> datasets.
>>>>
>>>> I have at least one we can use that was created by our lead Pathologist
>>>>
>>>>
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml
>>>> We should provide at least one sample for each domain.
>>>> Trouble is, privacy requires that these examples be made up by hand and
>>>> not copy-pasted from EMR systems.
>>>>
>>>> --Andy
>>>>
>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>> girinambari@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 for this observation Andy!
>>>>>
>>>>> Lowering time will motive users in writing blogs about features, how
>> to,
>>>>> etc., which reduces core team work load on documentation.
>>>>>
>>>>> I have been trying to write a small "how to write standalone client for
>>>>> ctakes" with my experience (I saw at least 4 users posted similar
>>>> question
>>>>> in last 2 months), but not getting enough time because ctakes depends
>> on
>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most
>> of
>>>>> my spare time is being spent on juggling between these frameworks,
>>>> posting
>>>>> and browsing those forums, relating observations to ctakes code. I
>> think
>>>> we
>>>>> need to have some high level documentation about these (with links to
>>>>> corresponding forums).
>>>>>
>>>>> Above case is for developers (I think this will be more user base as
>>>> ctakes
>>>>> progress), for users I think documentation is lot better though some
>>>>> improvements need to be done.
>>>>>
>>>>> As a developer I felt tough with lack of sample training data (I am
>> still
>>>>> struggling in this area even though I browsed all relevant code),
>> though
>>>>> training class are there. I understood that there are licensing issues
>>>> with
>>>>> REAL data, but at least some hand made example sentences, which may not
>>>> be
>>>>> real but helps developers in understanding the type/structure of input
>>>>> TRAINING classes expecting. This way people who browse the code can
>>>> reverse
>>>>> engineer and develop their own models. Sorry if you guys feel this as
>>>>> novice issue, but I feel most of the developers will be novice when
>> they
>>>>> adopt a system and Machine Learning/NLP is ocean. Some documentation in
>>>>> this area will same lot of time for us.
>>>>>
>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>
>>>>> Thank you,
>>>>> Giri
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mcmurry.andy@gmail.com
>>>>> wrote:
>>>>>
>>>>>> ctakes is at a point where we have a LOT of features but it is still
>>>> hard
>>>>>> to get started.
>>>>>>
>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>> obvious
>>>>>> and requires hand holding.
>>>>>> This is very typical in early FOSS projects.
>>>>>>
>>>>>> Lowering the time to get invested in ctakes gets more users AND better
>>>> bug
>>>>>> reports, FAQ, etc.
>>>>>>
>>>>>> thoughts?
>>>>>> --Andy
>>>>>>
>>>>>>
>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>> Pei.Chen@childrens.harvard.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I just wanted to gauge the interest of creating the next release of
>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>> Plenty of bug fixes and new components including:
>>>>>>> - New CEM Instance Template population
>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>> - New optional Clear POSTagger
>>>>>>> - New regression testing component
>>>>>>>
>>>>>>> Should we wait for the Temporal component?
>>>>>>>
>>>>>>> [1]
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
>>>>>>
>>>>
>>


RE: Next cTAKES release (3.1)?

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.
We have 5-6 clinical notes that we got from the web (=publicly available to anyone). We can include them as samples in the 3.1 release. We have been using these notes for demo purposes.
--Guergana

-----Original Message-----
From: Andy McMurry [mailto:mcmurry.andy@gmail.com] 
Sent: Friday, June 28, 2013 10:15 AM
To: dev@ctakes.apache.org
Subject: Re: Next cTAKES release (3.1)?

iDash and others have medical NLP datasets that could be used for ctakes "Getting Started" examples http://idash.ucsd.edu/nlp-and-data-modeling
http://idash.ucsd.edu/nlp/umls-vm

the GOOD: iDash already includes ctakes 
the BAD: iDash references old versions ctakes and points to cabig (which is now defunct)   

Recommendation: we should talk to iDash, create "hello medical world" training examples, and request iDaash point to the cTakes Apache home page. 

Disclaimer: I'm not involved with iDash 

On Jun 27, 2013, at 10:58 PM, Girivaraprasad Nambari <gi...@gmail.com> wrote:

> Hi Vijay and Andy,
> 
> Thanks for sharing those examples.
> 
> "Trouble is, privacy requires that these examples be made up by hand"
> 
> Agree with this statement and this is very valid concern.
> 
> In "getting started examples", I think we should just have couple of 
> entries (5-10 small entries), not more than that (with explicit 
> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand 
> handcrafting these may not be easy because we are not medical domain 
> experts, but I feel worth time, because it brings in more user community.
> 
> Thank you,
> Giri
> 
> 
> 
> 
> 
> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <mc...@gmail.com>wrote:
> 
>> GREAT !
>> 
>> The i2b2 data though isn't publicly distributable, you still need to 
>> request access to it since it is "semi private"
>> 
>> 
>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>> 
>>> We released code on using cTAKES to annotate clinical text and SVMs 
>>> that use the annotations to classify clinical text from the CMC 2007 
>>> and I2B2
>>> 2008 challenges:
>>> 
>>> We did the cmd 2007 with cTAKES 2.5:
>>> 
>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repro
>> ducing_results_on_CMC_2007_challenge
>> <https://code.google.com/p/ytex/downloads/list>
>>> 
>>> 
>>> And the i2b2 2008 with the version of cTAKES distributed with the 
>>> first version of ARC:
>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>> 
>>> These are both publicly available datasets, and represent real-world 
>>> problems (in general I believe when publishing a paper the code 
>>> should be reproducible and made publicly available, but that's a different issue).
>>> 
>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to 
>>> upgrade these samples as well.
>>> 
>>> Best,
>>> 
>>> VJ
>>> 
>>> 
>>> 
>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry 
>>> <mcmurry.andy@gmail.com
>>> wrote:
>>> 
>>>> +1 suggestion for documenting many examples of "getting started" 
>>>> +NLP
>>>> datasets.
>>>> 
>>>> I have at least one we can use that was created by our lead 
>>>> Pathologist
>>>> 
>>>> 
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>> es/train/traincase.xml
>>>> 
>>>> We should provide at least one sample for each domain.
>>>> Trouble is, privacy requires that these examples be made up by hand 
>>>> and not copy-pasted from EMR systems.
>>>> 
>>>> --Andy
>>>> 
>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>> girinambari@gmail.com>
>>>> wrote:
>>>> 
>>>>> +1 for this observation Andy!
>>>>> 
>>>>> Lowering time will motive users in writing blogs about features, 
>>>>> how
>> to,
>>>>> etc., which reduces core team work load on documentation.
>>>>> 
>>>>> I have been trying to write a small "how to write standalone 
>>>>> client for ctakes" with my experience (I saw at least 4 users 
>>>>> posted similar
>>>> question
>>>>> in last 2 months), but not getting enough time because ctakes 
>>>>> depends
>> on
>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), 
>>>>> most
>> of
>>>>> my spare time is being spent on juggling between these frameworks,
>>>> posting
>>>>> and browsing those forums, relating observations to ctakes code. I
>> think
>>>> we
>>>>> need to have some high level documentation about these (with links 
>>>>> to corresponding forums).
>>>>> 
>>>>> Above case is for developers (I think this will be more user base 
>>>>> as
>>>> ctakes
>>>>> progress), for users I think documentation is lot better though 
>>>>> some improvements need to be done.
>>>>> 
>>>>> As a developer I felt tough with lack of sample training data (I 
>>>>> am
>> still
>>>>> struggling in this area even though I browsed all relevant code),
>> though
>>>>> training class are there. I understood that there are licensing 
>>>>> issues
>>>> with
>>>>> REAL data, but at least some hand made example sentences, which 
>>>>> may not
>>>> be
>>>>> real but helps developers in understanding the type/structure of 
>>>>> input TRAINING classes expecting. This way people who browse the 
>>>>> code can
>>>> reverse
>>>>> engineer and develop their own models. Sorry if you guys feel this 
>>>>> as novice issue, but I feel most of the developers will be novice 
>>>>> when
>> they
>>>>> adopt a system and Machine Learning/NLP is ocean. Some 
>>>>> documentation in this area will same lot of time for us.
>>>>> 
>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>> 
>>>>> Thank you,
>>>>> Giri
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry 
>>>>> <mcmurry.andy@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> ctakes is at a point where we have a LOT of features but it is 
>>>>>> still
>>>> hard
>>>>>> to get started.
>>>>>> 
>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>> obvious
>>>>>> and requires hand holding.
>>>>>> This is very typical in early FOSS projects.
>>>>>> 
>>>>>> Lowering the time to get invested in ctakes gets more users AND 
>>>>>> better
>>>> bug
>>>>>> reports, FAQ, etc.
>>>>>> 
>>>>>> thoughts?
>>>>>> --Andy
>>>>>> 
>>>>>> 
>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>> Pei.Chen@childrens.harvard.edu>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I just wanted to gauge the interest of creating the next release 
>>>>>>> of
>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>> 
>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>> Plenty of bug fixes and new components including:
>>>>>>> - New CEM Instance Template population
>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>> - New optional Clear POSTagger
>>>>>>> - New regression testing component
>>>>>>> 
>>>>>>> Should we wait for the Temporal component?
>>>>>>> 
>>>>>>> [1]
>>>>>> 
>>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>> 22%20AND%20project%20%3D%20CTAKES
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: Next cTAKES release (3.1)?

Posted by Andy McMurry <mc...@gmail.com>.
iDash and others have medical NLP datasets that could be used for ctakes "Getting Started" examples 
http://idash.ucsd.edu/nlp-and-data-modeling
http://idash.ucsd.edu/nlp/umls-vm

the GOOD: iDash already includes ctakes 
the BAD: iDash references old versions ctakes and points to cabig (which is now defunct)   

Recommendation: we should talk to iDash, create "hello medical world" training examples, and request iDaash point to the cTakes Apache home page. 

Disclaimer: I'm not involved with iDash 

On Jun 27, 2013, at 10:58 PM, Girivaraprasad Nambari <gi...@gmail.com> wrote:

> Hi Vijay and Andy,
> 
> Thanks for sharing those examples.
> 
> "Trouble is, privacy requires that these examples be made up by hand"
> 
> Agree with this statement and this is very valid concern.
> 
> In "getting started examples", I think we should just have couple of
> entries (5-10 small entries), not more than that (with explicit statement
> like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand handcrafting
> these may not be easy because we are not medical domain experts, but I feel
> worth time, because it brings in more user community.
> 
> Thank you,
> Giri
> 
> 
> 
> 
> 
> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <mc...@gmail.com>wrote:
> 
>> GREAT !
>> 
>> The i2b2 data though isn't publicly distributable, you still need to
>> request access to it since it is "semi private"
>> 
>> 
>> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>> 
>>> We released code on using cTAKES to annotate clinical text and SVMs that
>>> use the annotations to classify clinical text from the CMC 2007 and I2B2
>>> 2008 challenges:
>>> 
>>> We did the cmd 2007 with cTAKES 2.5:
>>> 
>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge
>> <https://code.google.com/p/ytex/downloads/list>
>>> 
>>> 
>>> And the i2b2 2008 with the version of cTAKES distributed with the first
>>> version of ARC:
>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>> 
>>> These are both publicly available datasets, and represent real-world
>>> problems (in general I believe when publishing a paper the code should be
>>> reproducible and made publicly available, but that's a different issue).
>>> 
>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>> upgrade these samples as well.
>>> 
>>> Best,
>>> 
>>> VJ
>>> 
>>> 
>>> 
>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <mcmurry.andy@gmail.com
>>> wrote:
>>> 
>>>> +1 suggestion for documenting many examples of "getting started" NLP
>>>> datasets.
>>>> 
>>>> I have at least one we can use that was created by our lead Pathologist
>>>> 
>>>> 
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml
>>>> 
>>>> We should provide at least one sample for each domain.
>>>> Trouble is, privacy requires that these examples be made up by hand and
>>>> not copy-pasted from EMR systems.
>>>> 
>>>> --Andy
>>>> 
>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>> girinambari@gmail.com>
>>>> wrote:
>>>> 
>>>>> +1 for this observation Andy!
>>>>> 
>>>>> Lowering time will motive users in writing blogs about features, how
>> to,
>>>>> etc., which reduces core team work load on documentation.
>>>>> 
>>>>> I have been trying to write a small "how to write standalone client for
>>>>> ctakes" with my experience (I saw at least 4 users posted similar
>>>> question
>>>>> in last 2 months), but not getting enough time because ctakes depends
>> on
>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most
>> of
>>>>> my spare time is being spent on juggling between these frameworks,
>>>> posting
>>>>> and browsing those forums, relating observations to ctakes code. I
>> think
>>>> we
>>>>> need to have some high level documentation about these (with links to
>>>>> corresponding forums).
>>>>> 
>>>>> Above case is for developers (I think this will be more user base as
>>>> ctakes
>>>>> progress), for users I think documentation is lot better though some
>>>>> improvements need to be done.
>>>>> 
>>>>> As a developer I felt tough with lack of sample training data (I am
>> still
>>>>> struggling in this area even though I browsed all relevant code),
>> though
>>>>> training class are there. I understood that there are licensing issues
>>>> with
>>>>> REAL data, but at least some hand made example sentences, which may not
>>>> be
>>>>> real but helps developers in understanding the type/structure of input
>>>>> TRAINING classes expecting. This way people who browse the code can
>>>> reverse
>>>>> engineer and develop their own models. Sorry if you guys feel this as
>>>>> novice issue, but I feel most of the developers will be novice when
>> they
>>>>> adopt a system and Machine Learning/NLP is ocean. Some documentation in
>>>>> this area will same lot of time for us.
>>>>> 
>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>> 
>>>>> Thank you,
>>>>> Giri
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mcmurry.andy@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> ctakes is at a point where we have a LOT of features but it is still
>>>> hard
>>>>>> to get started.
>>>>>> 
>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>> obvious
>>>>>> and requires hand holding.
>>>>>> This is very typical in early FOSS projects.
>>>>>> 
>>>>>> Lowering the time to get invested in ctakes gets more users AND better
>>>> bug
>>>>>> reports, FAQ, etc.
>>>>>> 
>>>>>> thoughts?
>>>>>> --Andy
>>>>>> 
>>>>>> 
>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>> Pei.Chen@childrens.harvard.edu>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I just wanted to gauge the interest of creating the next release of
>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>> 
>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>> Plenty of bug fixes and new components including:
>>>>>>> - New CEM Instance Template population
>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>> - New optional Clear POSTagger
>>>>>>> - New regression testing component
>>>>>>> 
>>>>>>> Should we wait for the Temporal component?
>>>>>>> 
>>>>>>> [1]
>>>>>> 
>>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: Next cTAKES release (3.1)?

Posted by Girivaraprasad Nambari <gi...@gmail.com>.
Hi Vijay and Andy,

Thanks for sharing those examples.

"Trouble is, privacy requires that these examples be made up by hand"

Agree with this statement and this is very valid concern.

In "getting started examples", I think we should just have couple of
entries (5-10 small entries), not more than that (with explicit statement
like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand handcrafting
these may not be easy because we are not medical domain experts, but I feel
worth time, because it brings in more user community.

Thank you,
Giri





On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <mc...@gmail.com>wrote:

> GREAT !
>
> The i2b2 data though isn't publicly distributable, you still need to
> request access to it since it is "semi private"
>
>
> On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:
>
> > We released code on using cTAKES to annotate clinical text and SVMs that
> > use the annotations to classify clinical text from the CMC 2007 and I2B2
> > 2008 challenges:
> >
> > We did the cmd 2007 with cTAKES 2.5:
> >
> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge
> <https://code.google.com/p/ytex/downloads/list>
> >
> >
> > And the i2b2 2008 with the version of cTAKES distributed with the first
> > version of ARC:
> > https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> >
> > These are both publicly available datasets, and represent real-world
> > problems (in general I believe when publishing a paper the code should be
> > reproducible and made publicly available, but that's a different issue).
> >
> > When we get around to upgrading YTEX to cTAKES 3.1, we would like to
> > upgrade these samples as well.
> >
> > Best,
> >
> > VJ
> >
> >
> >
> > On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <mcmurry.andy@gmail.com
> >wrote:
> >
> >> +1 suggestion for documenting many examples of "getting started" NLP
> >> datasets.
> >>
> >> I have at least one we can use that was created by our lead Pathologist
> >>
> >>
> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml
> >>
> >> We should provide at least one sample for each domain.
> >> Trouble is, privacy requires that these examples be made up by hand and
> >> not copy-pasted from EMR systems.
> >>
> >> --Andy
> >>
> >> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
> girinambari@gmail.com>
> >> wrote:
> >>
> >>> +1 for this observation Andy!
> >>>
> >>> Lowering time will motive users in writing blogs about features, how
> to,
> >>> etc., which reduces core team work load on documentation.
> >>>
> >>> I have been trying to write a small "how to write standalone client for
> >>> ctakes" with my experience (I saw at least 4 users posted similar
> >> question
> >>> in last 2 months), but not getting enough time because ctakes depends
> on
> >>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most
> of
> >>> my spare time is being spent on juggling between these frameworks,
> >> posting
> >>> and browsing those forums, relating observations to ctakes code. I
> think
> >> we
> >>> need to have some high level documentation about these (with links to
> >>> corresponding forums).
> >>>
> >>> Above case is for developers (I think this will be more user base as
> >> ctakes
> >>> progress), for users I think documentation is lot better though some
> >>> improvements need to be done.
> >>>
> >>> As a developer I felt tough with lack of sample training data (I am
> still
> >>> struggling in this area even though I browsed all relevant code),
> though
> >>> training class are there. I understood that there are licensing issues
> >> with
> >>> REAL data, but at least some hand made example sentences, which may not
> >> be
> >>> real but helps developers in understanding the type/structure of input
> >>> TRAINING classes expecting. This way people who browse the code can
> >> reverse
> >>> engineer and develop their own models. Sorry if you guys feel this as
> >>> novice issue, but I feel most of the developers will be novice when
> they
> >>> adopt a system and Machine Learning/NLP is ocean. Some documentation in
> >>> this area will same lot of time for us.
> >>>
> >>> I wish there will be some activity in this area from ctakes core team.
> >>>
> >>> Thank you,
> >>> Giri
> >>>
> >>>
> >>>
> >>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mcmurry.andy@gmail.com
> >>> wrote:
> >>>
> >>>> ctakes is at a point where we have a LOT of features but it is still
> >> hard
> >>>> to get started.
> >>>>
> >>>> Judging from the mailing lists a lot of how cTakes works is not
> obvious
> >>>> and requires hand holding.
> >>>> This is very typical in early FOSS projects.
> >>>>
> >>>> Lowering the time to get invested in ctakes gets more users AND better
> >> bug
> >>>> reports, FAQ, etc.
> >>>>
> >>>> thoughts?
> >>>> --Andy
> >>>>
> >>>>
> >>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> >> Pei.Chen@childrens.harvard.edu>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>> I just wanted to gauge the interest of creating the next release of
> >>>> cTAKES (3.1) which is currently marked for May in Jira-
> >>>>>
> >>>>> There have already been 22/53 issues [1] marked as fixed or closed.
> >>>> Plenty of bug fixes and new components including:
> >>>>> - New CEM Instance Template population
> >>>>> - New Dependency Parser/Semantic Role Labeler
> >>>>> - New optional Clear POSTagger
> >>>>> - New regression testing component
> >>>>>
> >>>>> Should we wait for the Temporal component?
> >>>>>
> >>>>> [1]
> >>>>
> >>
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Next cTAKES release (3.1)?

Posted by Andy McMurry <mc...@gmail.com>.
GREAT ! 

The i2b2 data though isn't publicly distributable, you still need to request access to it since it is "semi private" 


On Jun 27, 2013, at 9:52 PM, vijay garla <vn...@gmail.com> wrote:

> We released code on using cTAKES to annotate clinical text and SVMs that
> use the annotations to classify clinical text from the CMC 2007 and I2B2
> 2008 challenges:
> 
> We did the cmd 2007 with cTAKES 2.5:
> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge<https://code.google.com/p/ytex/downloads/list>
> 
> 
> And the i2b2 2008 with the version of cTAKES distributed with the first
> version of ARC:
> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
> 
> These are both publicly available datasets, and represent real-world
> problems (in general I believe when publishing a paper the code should be
> reproducible and made publicly available, but that's a different issue).
> 
> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
> upgrade these samples as well.
> 
> Best,
> 
> VJ
> 
> 
> 
> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <mc...@gmail.com>wrote:
> 
>> +1 suggestion for documenting many examples of "getting started" NLP
>> datasets.
>> 
>> I have at least one we can use that was created by our lead Pathologist
>> 
>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml
>> 
>> We should provide at least one sample for each domain.
>> Trouble is, privacy requires that these examples be made up by hand and
>> not copy-pasted from EMR systems.
>> 
>> --Andy
>> 
>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <gi...@gmail.com>
>> wrote:
>> 
>>> +1 for this observation Andy!
>>> 
>>> Lowering time will motive users in writing blogs about features, how to,
>>> etc., which reduces core team work load on documentation.
>>> 
>>> I have been trying to write a small "how to write standalone client for
>>> ctakes" with my experience (I saw at least 4 users posted similar
>> question
>>> in last 2 months), but not getting enough time because ctakes depends on
>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most of
>>> my spare time is being spent on juggling between these frameworks,
>> posting
>>> and browsing those forums, relating observations to ctakes code. I think
>> we
>>> need to have some high level documentation about these (with links to
>>> corresponding forums).
>>> 
>>> Above case is for developers (I think this will be more user base as
>> ctakes
>>> progress), for users I think documentation is lot better though some
>>> improvements need to be done.
>>> 
>>> As a developer I felt tough with lack of sample training data (I am still
>>> struggling in this area even though I browsed all relevant code), though
>>> training class are there. I understood that there are licensing issues
>> with
>>> REAL data, but at least some hand made example sentences, which may not
>> be
>>> real but helps developers in understanding the type/structure of input
>>> TRAINING classes expecting. This way people who browse the code can
>> reverse
>>> engineer and develop their own models. Sorry if you guys feel this as
>>> novice issue, but I feel most of the developers will be novice when they
>>> adopt a system and Machine Learning/NLP is ocean. Some documentation in
>>> this area will same lot of time for us.
>>> 
>>> I wish there will be some activity in this area from ctakes core team.
>>> 
>>> Thank you,
>>> Giri
>>> 
>>> 
>>> 
>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mcmurry.andy@gmail.com
>>> wrote:
>>> 
>>>> ctakes is at a point where we have a LOT of features but it is still
>> hard
>>>> to get started.
>>>> 
>>>> Judging from the mailing lists a lot of how cTakes works is not obvious
>>>> and requires hand holding.
>>>> This is very typical in early FOSS projects.
>>>> 
>>>> Lowering the time to get invested in ctakes gets more users AND better
>> bug
>>>> reports, FAQ, etc.
>>>> 
>>>> thoughts?
>>>> --Andy
>>>> 
>>>> 
>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>> Pei.Chen@childrens.harvard.edu>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> I just wanted to gauge the interest of creating the next release of
>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>> 
>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>> Plenty of bug fixes and new components including:
>>>>> - New CEM Instance Template population
>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>> - New optional Clear POSTagger
>>>>> - New regression testing component
>>>>> 
>>>>> Should we wait for the Temporal component?
>>>>> 
>>>>> [1]
>>>> 
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: Next cTAKES release (3.1)?

Posted by vijay garla <vn...@gmail.com>.
We released code on using cTAKES to annotate clinical text and SVMs that
use the annotations to classify clinical text from the CMC 2007 and I2B2
2008 challenges:

We did the cmd 2007 with cTAKES 2.5:
https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge<https://code.google.com/p/ytex/downloads/list>


And the i2b2 2008 with the version of cTAKES distributed with the first
version of ARC:
https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008

These are both publicly available datasets, and represent real-world
problems (in general I believe when publishing a paper the code should be
reproducible and made publicly available, but that's a different issue).

When we get around to upgrading YTEX to cTAKES 3.1, we would like to
upgrade these samples as well.

Best,

VJ



On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <mc...@gmail.com>wrote:

> +1 suggestion for documenting many examples of "getting started" NLP
> datasets.
>
> I have at least one we can use that was created by our lead Pathologist
>
> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml
>
> We should provide at least one sample for each domain.
> Trouble is, privacy requires that these examples be made up by hand and
> not copy-pasted from EMR systems.
>
> --Andy
>
> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <gi...@gmail.com>
> wrote:
>
> > +1 for this observation Andy!
> >
> > Lowering time will motive users in writing blogs about features, how to,
> > etc., which reduces core team work load on documentation.
> >
> > I have been trying to write a small "how to write standalone client for
> > ctakes" with my experience (I saw at least 4 users posted similar
> question
> > in last 2 months), but not getting enough time because ctakes depends on
> > lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most of
> > my spare time is being spent on juggling between these frameworks,
> posting
> > and browsing those forums, relating observations to ctakes code. I think
> we
> > need to have some high level documentation about these (with links to
> > corresponding forums).
> >
> > Above case is for developers (I think this will be more user base as
> ctakes
> > progress), for users I think documentation is lot better though some
> > improvements need to be done.
> >
> > As a developer I felt tough with lack of sample training data (I am still
> > struggling in this area even though I browsed all relevant code), though
> > training class are there. I understood that there are licensing issues
> with
> > REAL data, but at least some hand made example sentences, which may not
> be
> > real but helps developers in understanding the type/structure of input
> > TRAINING classes expecting. This way people who browse the code can
> reverse
> > engineer and develop their own models. Sorry if you guys feel this as
> > novice issue, but I feel most of the developers will be novice when they
> > adopt a system and Machine Learning/NLP is ocean. Some documentation in
> > this area will same lot of time for us.
> >
> > I wish there will be some activity in this area from ctakes core team.
> >
> > Thank you,
> > Giri
> >
> >
> >
> > On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mcmurry.andy@gmail.com
> >wrote:
> >
> >> ctakes is at a point where we have a LOT of features but it is still
> hard
> >> to get started.
> >>
> >> Judging from the mailing lists a lot of how cTakes works is not obvious
> >> and requires hand holding.
> >> This is very typical in early FOSS projects.
> >>
> >> Lowering the time to get invested in ctakes gets more users AND better
> bug
> >> reports, FAQ, etc.
> >>
> >> thoughts?
> >> --Andy
> >>
> >>
> >> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
> Pei.Chen@childrens.harvard.edu>
> >> wrote:
> >>
> >>> Hi,
> >>> I just wanted to gauge the interest of creating the next release of
> >> cTAKES (3.1) which is currently marked for May in Jira-
> >>>
> >>> There have already been 22/53 issues [1] marked as fixed or closed.
> >> Plenty of bug fixes and new components including:
> >>> - New CEM Instance Template population
> >>> - New Dependency Parser/Semantic Role Labeler
> >>> - New optional Clear POSTagger
> >>> - New regression testing component
> >>>
> >>> Should we wait for the Temporal component?
> >>>
> >>> [1]
> >>
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
> >>>
> >>
> >>
>
>

Re: Next cTAKES release (3.1)?

Posted by Andy McMurry <mc...@gmail.com>.
+1 suggestion for documenting many examples of "getting started" NLP datasets. 

I have at least one we can use that was created by our lead Pathologist 
https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml

We should provide at least one sample for each domain. 
Trouble is, privacy requires that these examples be made up by hand and not copy-pasted from EMR systems. 

--Andy 

On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <gi...@gmail.com> wrote:

> +1 for this observation Andy!
> 
> Lowering time will motive users in writing blogs about features, how to,
> etc., which reduces core team work load on documentation.
> 
> I have been trying to write a small "how to write standalone client for
> ctakes" with my experience (I saw at least 4 users posted similar question
> in last 2 months), but not getting enough time because ctakes depends on
> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most of
> my spare time is being spent on juggling between these frameworks, posting
> and browsing those forums, relating observations to ctakes code. I think we
> need to have some high level documentation about these (with links to
> corresponding forums).
> 
> Above case is for developers (I think this will be more user base as ctakes
> progress), for users I think documentation is lot better though some
> improvements need to be done.
> 
> As a developer I felt tough with lack of sample training data (I am still
> struggling in this area even though I browsed all relevant code), though
> training class are there. I understood that there are licensing issues with
> REAL data, but at least some hand made example sentences, which may not be
> real but helps developers in understanding the type/structure of input
> TRAINING classes expecting. This way people who browse the code can reverse
> engineer and develop their own models. Sorry if you guys feel this as
> novice issue, but I feel most of the developers will be novice when they
> adopt a system and Machine Learning/NLP is ocean. Some documentation in
> this area will same lot of time for us.
> 
> I wish there will be some activity in this area from ctakes core team.
> 
> Thank you,
> Giri
> 
> 
> 
> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mc...@gmail.com>wrote:
> 
>> ctakes is at a point where we have a LOT of features but it is still hard
>> to get started.
>> 
>> Judging from the mailing lists a lot of how cTakes works is not obvious
>> and requires hand holding.
>> This is very typical in early FOSS projects.
>> 
>> Lowering the time to get invested in ctakes gets more users AND better bug
>> reports, FAQ, etc.
>> 
>> thoughts?
>> --Andy
>> 
>> 
>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <Pe...@childrens.harvard.edu>
>> wrote:
>> 
>>> Hi,
>>> I just wanted to gauge the interest of creating the next release of
>> cTAKES (3.1) which is currently marked for May in Jira-
>>> 
>>> There have already been 22/53 issues [1] marked as fixed or closed.
>> Plenty of bug fixes and new components including:
>>> - New CEM Instance Template population
>>> - New Dependency Parser/Semantic Role Labeler
>>> - New optional Clear POSTagger
>>> - New regression testing component
>>> 
>>> Should we wait for the Temporal component?
>>> 
>>> [1]
>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
>>> 
>> 
>> 


Re: Next cTAKES release (3.1)?

Posted by Girivaraprasad Nambari <gi...@gmail.com>.
+1 for this observation Andy!

Lowering time will motive users in writing blogs about features, how to,
etc., which reduces core team work load on documentation.

I have been trying to write a small "how to write standalone client for
ctakes" with my experience (I saw at least 4 users posted similar question
in last 2 months), but not getting enough time because ctakes depends on
lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most of
my spare time is being spent on juggling between these frameworks, posting
and browsing those forums, relating observations to ctakes code. I think we
need to have some high level documentation about these (with links to
corresponding forums).

Above case is for developers (I think this will be more user base as ctakes
progress), for users I think documentation is lot better though some
improvements need to be done.

As a developer I felt tough with lack of sample training data (I am still
struggling in this area even though I browsed all relevant code), though
training class are there. I understood that there are licensing issues with
REAL data, but at least some hand made example sentences, which may not be
real but helps developers in understanding the type/structure of input
TRAINING classes expecting. This way people who browse the code can reverse
engineer and develop their own models. Sorry if you guys feel this as
novice issue, but I feel most of the developers will be novice when they
adopt a system and Machine Learning/NLP is ocean. Some documentation in
this area will same lot of time for us.

I wish there will be some activity in this area from ctakes core team.

Thank you,
Giri



On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <mc...@gmail.com>wrote:

> ctakes is at a point where we have a LOT of features but it is still hard
> to get started.
>
> Judging from the mailing lists a lot of how cTakes works is not obvious
> and requires hand holding.
> This is very typical in early FOSS projects.
>
> Lowering the time to get invested in ctakes gets more users AND better bug
> reports, FAQ, etc.
>
> thoughts?
> --Andy
>
>
> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <Pe...@childrens.harvard.edu>
> wrote:
>
> > Hi,
> > I just wanted to gauge the interest of creating the next release of
> cTAKES (3.1) which is currently marked for May in Jira-
> >
> > There have already been 22/53 issues [1] marked as fixed or closed.
>  Plenty of bug fixes and new components including:
> > - New CEM Instance Template population
> > - New Dependency Parser/Semantic Role Labeler
> > - New optional Clear POSTagger
> > - New regression testing component
> >
> > Should we wait for the Temporal component?
> >
> > [1]
> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
> >
>
>

Re: Next cTAKES release (3.1)?

Posted by Andy McMurry <mc...@gmail.com>.
ctakes is at a point where we have a LOT of features but it is still hard to get started. 

Judging from the mailing lists a lot of how cTakes works is not obvious and requires hand holding. 
This is very typical in early FOSS projects. 

Lowering the time to get invested in ctakes gets more users AND better bug reports, FAQ, etc. 

thoughts? 
--Andy 


On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <Pe...@childrens.harvard.edu> wrote:

> Hi,
> I just wanted to gauge the interest of creating the next release of cTAKES (3.1) which is currently marked for May in Jira-
> 
> There have already been 22/53 issues [1] marked as fixed or closed.  Plenty of bug fixes and new components including:
> - New CEM Instance Template population
> - New Dependency Parser/Semantic Role Labeler
> - New optional Clear POSTagger
> - New regression testing component
> 
> Should we wait for the Temporal component?
> 
> [1] https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
>