You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Hadrian Zbarcea <hz...@gmail.com> on 2017/06/22 19:14:12 UTC

Proposed improvements

Last week I presented at the OSEHRA Summit about ActiveMQ (and a few 
other projects) and the ASF in general.

I was surprised that most didn't know much about the ASF and more 
importantly that nobody knew about cTakes, the only (directly) 
healthcare related project at the ASF. There was no cTakes talk at 
ApacheCon in Miami, but at OSEHRA, which is all about healthcare we 
should have had a presence. I will probably submit a talk for next year, 
but until then, because I think I created a bit of interest in cTakes I 
went to build cTakes myself and try a few things.

Some of my findings are:
* test failures with openjdk; granted the docs mention oracle jdk as a 
prerequisite, but think it's easy to support openjdk
* use of svn vs git; this is a debatable topic, but by now everybody and 
their uncles are on git so moving to git (which I'd recommend) would 
probably forster adoption (yes, I know about the github mirror)
* no support for OSGi, many large players use it
* improvements in logging could go a long way, starting with moving to slf4j

Suggesting improvements imply that I volunteer to do a good chunk of the 
work, but before that I'm interested more in how much the community 
would welcome such improvements. I am curious what are considered more 
low hanging fruits, for the more controversial topics we could take them 
to [discuss] threads. Because every community has its own culture and I 
am not that familiar with the cTakes one, although I went through the 
mail archives, I thought a prudent first step would be to start with this.

Feedback appreciated,
Hadrian

Re: Proposed improvements [EXTERNAL]

Posted by Peter Szolovits <ps...@mit.edu>.
I’m not sure I have a comprehensive picture of how lvg is being used by various cTakes modules, but my experience is that the most common usage is the norm program (a specific set of flags on lvg) that normalizes phrases to a format that supports matching to strings in MRCONSO.  This allows one to use indexes such as mrxns_eng and mrxnw_eng in the UMLS distribution to match strings to CUIs efficiently.  Are there other aspects of cTakes that depend on use of lvg other than for norm?

Would it be reasonable to try two approaches to simplifying this:

1. Rewrite norm to have equivalent but faster and more compact performance by eliminating the other options in lvg?  This would preserve the ability to use the indexes pre-computed by NLM that are part of the UMLS distribution.  However, even small discrepancies between the existing lvg-based implementation and a new one would create changes in matching.

2. Create a new normalizer that is not necessarily equivalent to the existing norm, and then require people who want to use it to run a process over a newly-installed UMLS that creates new index tables based on this normalizer.  Matching performance using this method would clearly change from the current one, but one could imagine that it might be better.  For example, instead of using only Specialist lexical tools, one could incorporate recent advances in vector space embedding and other neural net methods.

—Peter Szolovits



> On Jun 28, 2017, at 11:46 AM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Hi Hadrian, all,
> 
>> .. lvg ..
> James Masanz has done a lot of work with keeping pace with nlm's lvg version(s) and ctakes-lvg changes and enhancements.  He is currently out-of-office but he may be one person who can work with you on this.  Of course I encourage everybody else out there to help as they can.  I am sure that some ctaker has expert knowledge of lvg.
> 
>> * I looked at the code, and I will refrain from any comment now,
> Much appreciated!  Both for looking at the code and not making any comments ...
> 
>> no way this code will work in OSGi.
> My thoughts exactly.
> 
>> biggest bang for the buck would come from cleaning up the architecture and dependency structure
> True, true.  I created a jira item for dependency cleanup:
> https://issues.apache.org/jira/browse/CTAKES-448
> Please feel free to create jiras on specific items that you identify.
> I agree that the overall architecture could be much better.  I hope that your emails restart some old discussions on possibilities.
> 
>> I'd be happy to work with (and learn from) you on it.
> The collective community has knowledge on a lot of systems and architectures.  Since architecture changes are likely to be extensive, intensive and far-reaching, we need input from everybody before we embark on any long voyages.
> 
> Cheers,
> Sean
> 
> -----Original Message-----
> From: Hadrian Zbarcea [mailto:hzbarcea@gmail.com] 
> Sent: Wednesday, June 28, 2017 11:20 AM
> To: dev@ctakes.apache.org
> Subject: Re: Proposed improvements [EXTERNAL]
> 
> Ok, so:
> 
> * I tracked the source code for the 2016 version [1] of lvg we use. The source is included in the (almost) 1G .tgz, didn't check the lite version.
> * There is a newer 2017 version [2]. I don't know if the community wants to upgrade.
> * I looked at the code, and I will refrain from any comment now, but one thing is clear: no way this code will work in OSGi.
> 
> Sean, I vastly underestimated the work required to achieve what I hoped. 
> I will not back off, but there's a lot of work to be done and I am not even sure where to start yet. To your comment re: OSGi, the issue is that there are too many constraints embedded in the code, dependency on file system, embedded database, etc. In my opinion the biggest bang for the buck would come from cleaning up the architecture and dependency structure, make it more loosely coupled. I'd be happy to work with (and learn from) you on it.
> 
> Cheers,
> Hadrian
> 
> 
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lexsrv3.nlm.nih.gov_LexSysGroup_Projects_lvg_2016_web_download.html&d=DwICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=yrWhlRsdzUvgRqXN81TQaszcXtCx5Ehe5cymnXHSNwU&s=OnLocVfo9Pl3BU4m_si-n66THsMj1bb7Y2eidz_mGls&e=
> [2]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lexsrv3.nlm.nih.gov_LexSysGroup_Projects_lvg_current_web_release_index.html&d=DwICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=yrWhlRsdzUvgRqXN81TQaszcXtCx5Ehe5cymnXHSNwU&s=jodby0LDvZ4fJmeVjXK_2-4zhZpvKCu9HseL25pjgQ0&e= 
> 
> 
> On 06/27/2017 07:36 PM, Hadrian Zbarcea wrote:
>> Speaking of lvg. Does anybody know where the source code for 
>> lvgdist-2016.0.jar is?
>> 
>> Thanks,
>> Hadrian
>> 
>> 
>> 
>> On 06/26/2017 11:04 AM, Finan, Sean wrote:
>>> Hi Andrey,
>>> 
>>> Thank you for the input.  Thank you also Hadrian.
>>> 
>>> With regard to a smaller ctakes, I know that a couple of people 
>>> (including yours truly) are currently working on trimming some fat.  
>>> A few areas have been targeted, with the old/huge umls dictionary 
>>> being at the top of the list.  It is deprecated and only used in a 
>>> few tests.  Lvg is also used in a few test configurations, but I am 
>>> unsure of its necessity.
>>> 
>>> As far as a "ctakes core" ... I have been trying to figure out a 
>>> smart way to separate the default clinical pipeline modules from 
>>> others, making the others optional.  I already have a pom for 
>>> clinical that does not include relation, temporal, coref, very importantly ytex ...
>>> as those are not part of the default clinical pipeline.  One thing 
>>> that has me halted is figuring out how and where to make a simple 
>>> mechanism for people to grab the more advanced modules.  A while ago 
>>> I put a project pom in sandbox under "ctakes the api" or something to 
>>> that effect.  It is basically a pom with advanced modules commented 
>>> out.  A developer could start with that pom as their project main, 
>>> then uncomment modules as needed.  It was a first ten-minute attempt 
>>> at something simple and, while worth a try, not an ideal solution.
>>> 
>>> Another idea that I have been tossing around is separating tests into 
>>> separate modules.  Also possibly "training" into separate modules.  
>>> It is standard practice to keep parallel src/ and test/ directories 
>>> in a repository and this kind of follows that thinking.  Many of the 
>>> tests (such as mentioned above) require/use modules and resources 
>>> that are not actually required to build the source.  The same goes 
>>> for possible examples.  I think that the same could be true for training - if not
>>> now, perhaps in the future.   Again, I am held up on the best way to 
>>> actually do this, keeping things simple wrt maven and a lack of 
>>> excess complexity.  The last thing that I want to do is make ctakes 
>>> more difficult to use.
>>> 
>>> Maybe osgi can help the above, but I'm honestly not sure how.  If 
>>> anybody else thinks that it can then I am going to let them handle 
>>> it.  Perhaps I am just jaded.  Years ago my previous company had 
>>> great hopes for osgi and invested a lot of time (=money) into 
>>> applying it to our applications.  Over a million dollars later, the 
>>> consensus was that osgi couldn't apply to our applications without 
>>> completely rewriting infrastructure - which was an absolute no-go - 
>>> and even if it could just be slapped on overnight did nothing for us 
>>> or our customers.
>>> 
>>> With regard to better logging, I think that James added some more 
>>> detailed logging for the 4.0 release, and I think that he has a few 
>>> more areas slated.  There are more logging statements that exist at 
>>> finer levels than "info" and can be seen by changing the log4j 
>>> configuration.  As for changing the entire codebase to slf4j, I may 
>>> be alone but I'm not sure how that alone will make ctakes any more 
>>> transparent.
>>> 
>>> With regard to ctakes setup having some quirks ... yup.  Known issue 
>>> to a lot of us.  Documentation was improved for the 4.0 release, but 
>>> "run anywhere" documentation is difficult to both create and 
>>> maintain.  Several ideas have been tossed around including 
>>> installation scripts, an "environment/setup 
>>> investigation/confirmation" gui or something like a running faq/blog 
>>> on nothing but installation problems and solutions.
>>> 
>>> Sean
>>> 
>>> -----Original Message-----
>>> From: Andrey Kurdumov [mailto:kant2002@googlemail.com]
>>> Sent: Sunday, June 25, 2017 1:52 AM
>>> To: cTakes developers list
>>> Subject: Re: Proposed improvements [EXTERNAL]
>>> 
>>> Just want to note that ASF PMC want to make GitHub primary repository 
>>> and Apache servers secondary soon.
>>> 
>>> Regarding improvements:
>>> I personally want better support for embedding. Right now cTakes 
>>> distribution comes with LVG and UMLS dictionary and size of cTakes 
>>> thus become very.
>>> I would like to have (and work on it) much leaner distribution, let's 
>>> name it cTakes Core, which will just provide cTakes executable 
>>> without need for data.
>>> Right now I have constantly rip-off that data after cTakes build 
>>> which slow down my build significantly.
>>> 
>>> Personally I support Hadrian initiative to have better logging since 
>>> cTakes setup has some quirks which could be faster resolved by better 
>>> logging.
>>> 
>>> 
>>> 2017-06-23 17:38 GMT+06:00 Miller, Timothy <
>>> Timothy.Miller@childrens.harvard.edu>:
>>> 
>>>> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting 
>>>> and like something where we should be making people aware of cTAKES!
>>>> 
>>>> svn vs. git -- I'm with you on preferring git, but not by so much 
>>>> that it's worth spending time on an argument if it turns into an 
>>>> argument :). As far as I know we've never really had a discussion about it.
>>>> It's probably getting to the point where new developers have _only_ 
>>>> used git and would find it a complete roadblock to use svn but for 
>>>> me it's just a mild annoyance.
>>>> 
>>>> All others you mentioned -- if you are willing to contribute a patch 
>>>> we are happy to accept one-off contributions, and we are also 
>>>> interested in growing the developer community with people who are 
>>>> interested in contributing regularly over time.
>>>> 
>>>> Tim
>>>> 
>>>> ________________________________________
>>>> From: Hadrian Zbarcea <hz...@gmail.com>
>>>> Sent: Thursday, June 22, 2017 9:14 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Proposed improvements [EXTERNAL]
>>>> 
>>>> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few 
>>>> other projects) and the ASF in general.
>>>> 
>>>> I was surprised that most didn't know much about the ASF and more 
>>>> importantly that nobody knew about cTakes, the only (directly) 
>>>> healthcare related project at the ASF. There was no cTakes talk at 
>>>> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we 
>>>> should have had a presence. I will probably submit a talk for next 
>>>> year, but until then, because I think I created a bit of interest in 
>>>> cTakes I went to build cTakes myself and try a few things.
>>>> 
>>>> Some of my findings are:
>>>> * test failures with openjdk; granted the docs mention oracle jdk as 
>>>> a prerequisite, but think it's easy to support openjdk
>>>> * use of svn vs git; this is a debatable topic, but by now everybody 
>>>> and their uncles are on git so moving to git (which I'd recommend) 
>>>> would probably forster adoption (yes, I know about the github 
>>>> mirror)
>>>> * no support for OSGi, many large players use it
>>>> * improvements in logging could go a long way, starting with moving 
>>>> to slf4j
>>>> 
>>>> Suggesting improvements imply that I volunteer to do a good chunk of 
>>>> the work, but before that I'm interested more in how much the 
>>>> community would welcome such improvements. I am curious what are 
>>>> considered more low hanging fruits, for the more controversial 
>>>> topics we could take them to [discuss] threads. Because every 
>>>> community has its own culture and I am not that familiar with the 
>>>> cTakes one, although I went through the mail archives, I thought a 
>>>> prudent first step would be to start with this.
>>>> 
>>>> Feedback appreciated,
>>>> Hadrian


RE: Proposed improvements [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Hadrian, all,

> .. lvg ..
James Masanz has done a lot of work with keeping pace with nlm's lvg version(s) and ctakes-lvg changes and enhancements.  He is currently out-of-office but he may be one person who can work with you on this.  Of course I encourage everybody else out there to help as they can.  I am sure that some ctaker has expert knowledge of lvg.

> * I looked at the code, and I will refrain from any comment now,
Much appreciated!  Both for looking at the code and not making any comments ...

> no way this code will work in OSGi.
My thoughts exactly.

> biggest bang for the buck would come from cleaning up the architecture and dependency structure
True, true.  I created a jira item for dependency cleanup:
https://issues.apache.org/jira/browse/CTAKES-448
Please feel free to create jiras on specific items that you identify.
I agree that the overall architecture could be much better.  I hope that your emails restart some old discussions on possibilities.
  
> I'd be happy to work with (and learn from) you on it.
The collective community has knowledge on a lot of systems and architectures.  Since architecture changes are likely to be extensive, intensive and far-reaching, we need input from everybody before we embark on any long voyages.

Cheers,
Sean

-----Original Message-----
From: Hadrian Zbarcea [mailto:hzbarcea@gmail.com] 
Sent: Wednesday, June 28, 2017 11:20 AM
To: dev@ctakes.apache.org
Subject: Re: Proposed improvements [EXTERNAL]

Ok, so:

* I tracked the source code for the 2016 version [1] of lvg we use. The source is included in the (almost) 1G .tgz, didn't check the lite version.
* There is a newer 2017 version [2]. I don't know if the community wants to upgrade.
* I looked at the code, and I will refrain from any comment now, but one thing is clear: no way this code will work in OSGi.

Sean, I vastly underestimated the work required to achieve what I hoped. 
I will not back off, but there's a lot of work to be done and I am not even sure where to start yet. To your comment re: OSGi, the issue is that there are too many constraints embedded in the code, dependency on file system, embedded database, etc. In my opinion the biggest bang for the buck would come from cleaning up the architecture and dependency structure, make it more loosely coupled. I'd be happy to work with (and learn from) you on it.

Cheers,
Hadrian


[1]
https://urldefense.proofpoint.com/v2/url?u=https-3A__lexsrv3.nlm.nih.gov_LexSysGroup_Projects_lvg_2016_web_download.html&d=DwICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=yrWhlRsdzUvgRqXN81TQaszcXtCx5Ehe5cymnXHSNwU&s=OnLocVfo9Pl3BU4m_si-n66THsMj1bb7Y2eidz_mGls&e=
[2]
https://urldefense.proofpoint.com/v2/url?u=https-3A__lexsrv3.nlm.nih.gov_LexSysGroup_Projects_lvg_current_web_release_index.html&d=DwICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=yrWhlRsdzUvgRqXN81TQaszcXtCx5Ehe5cymnXHSNwU&s=jodby0LDvZ4fJmeVjXK_2-4zhZpvKCu9HseL25pjgQ0&e= 


On 06/27/2017 07:36 PM, Hadrian Zbarcea wrote:
> Speaking of lvg. Does anybody know where the source code for 
> lvgdist-2016.0.jar is?
> 
> Thanks,
> Hadrian
> 
> 
> 
> On 06/26/2017 11:04 AM, Finan, Sean wrote:
>> Hi Andrey,
>>
>> Thank you for the input.  Thank you also Hadrian.
>>
>> With regard to a smaller ctakes, I know that a couple of people 
>> (including yours truly) are currently working on trimming some fat.  
>> A few areas have been targeted, with the old/huge umls dictionary 
>> being at the top of the list.  It is deprecated and only used in a 
>> few tests.  Lvg is also used in a few test configurations, but I am 
>> unsure of its necessity.
>>
>> As far as a "ctakes core" ... I have been trying to figure out a 
>> smart way to separate the default clinical pipeline modules from 
>> others, making the others optional.  I already have a pom for 
>> clinical that does not include relation, temporal, coref, very importantly ytex ...
>> as those are not part of the default clinical pipeline.  One thing 
>> that has me halted is figuring out how and where to make a simple 
>> mechanism for people to grab the more advanced modules.  A while ago 
>> I put a project pom in sandbox under "ctakes the api" or something to 
>> that effect.  It is basically a pom with advanced modules commented 
>> out.  A developer could start with that pom as their project main, 
>> then uncomment modules as needed.  It was a first ten-minute attempt 
>> at something simple and, while worth a try, not an ideal solution.
>>
>> Another idea that I have been tossing around is separating tests into 
>> separate modules.  Also possibly "training" into separate modules.  
>> It is standard practice to keep parallel src/ and test/ directories 
>> in a repository and this kind of follows that thinking.  Many of the 
>> tests (such as mentioned above) require/use modules and resources 
>> that are not actually required to build the source.  The same goes 
>> for possible examples.  I think that the same could be true for training - if not
>> now, perhaps in the future.   Again, I am held up on the best way to 
>> actually do this, keeping things simple wrt maven and a lack of 
>> excess complexity.  The last thing that I want to do is make ctakes 
>> more difficult to use.
>>
>> Maybe osgi can help the above, but I'm honestly not sure how.  If 
>> anybody else thinks that it can then I am going to let them handle 
>> it.  Perhaps I am just jaded.  Years ago my previous company had 
>> great hopes for osgi and invested a lot of time (=money) into 
>> applying it to our applications.  Over a million dollars later, the 
>> consensus was that osgi couldn't apply to our applications without 
>> completely rewriting infrastructure - which was an absolute no-go - 
>> and even if it could just be slapped on overnight did nothing for us 
>> or our customers.
>>
>> With regard to better logging, I think that James added some more 
>> detailed logging for the 4.0 release, and I think that he has a few 
>> more areas slated.  There are more logging statements that exist at 
>> finer levels than "info" and can be seen by changing the log4j 
>> configuration.  As for changing the entire codebase to slf4j, I may 
>> be alone but I'm not sure how that alone will make ctakes any more 
>> transparent.
>>
>> With regard to ctakes setup having some quirks ... yup.  Known issue 
>> to a lot of us.  Documentation was improved for the 4.0 release, but 
>> "run anywhere" documentation is difficult to both create and 
>> maintain.  Several ideas have been tossed around including 
>> installation scripts, an "environment/setup 
>> investigation/confirmation" gui or something like a running faq/blog 
>> on nothing but installation problems and solutions.
>>
>> Sean
>>
>> -----Original Message-----
>> From: Andrey Kurdumov [mailto:kant2002@googlemail.com]
>> Sent: Sunday, June 25, 2017 1:52 AM
>> To: cTakes developers list
>> Subject: Re: Proposed improvements [EXTERNAL]
>>
>> Just want to note that ASF PMC want to make GitHub primary repository 
>> and Apache servers secondary soon.
>>
>> Regarding improvements:
>> I personally want better support for embedding. Right now cTakes 
>> distribution comes with LVG and UMLS dictionary and size of cTakes 
>> thus become very.
>> I would like to have (and work on it) much leaner distribution, let's 
>> name it cTakes Core, which will just provide cTakes executable 
>> without need for data.
>> Right now I have constantly rip-off that data after cTakes build 
>> which slow down my build significantly.
>>
>> Personally I support Hadrian initiative to have better logging since 
>> cTakes setup has some quirks which could be faster resolved by better 
>> logging.
>>
>>
>> 2017-06-23 17:38 GMT+06:00 Miller, Timothy <
>> Timothy.Miller@childrens.harvard.edu>:
>>
>>> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting 
>>> and like something where we should be making people aware of cTAKES!
>>>
>>> svn vs. git -- I'm with you on preferring git, but not by so much 
>>> that it's worth spending time on an argument if it turns into an 
>>> argument :). As far as I know we've never really had a discussion about it.
>>> It's probably getting to the point where new developers have _only_ 
>>> used git and would find it a complete roadblock to use svn but for 
>>> me it's just a mild annoyance.
>>>
>>> All others you mentioned -- if you are willing to contribute a patch 
>>> we are happy to accept one-off contributions, and we are also 
>>> interested in growing the developer community with people who are 
>>> interested in contributing regularly over time.
>>>
>>> Tim
>>>
>>> ________________________________________
>>> From: Hadrian Zbarcea <hz...@gmail.com>
>>> Sent: Thursday, June 22, 2017 9:14 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Proposed improvements [EXTERNAL]
>>>
>>> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few 
>>> other projects) and the ASF in general.
>>>
>>> I was surprised that most didn't know much about the ASF and more 
>>> importantly that nobody knew about cTakes, the only (directly) 
>>> healthcare related project at the ASF. There was no cTakes talk at 
>>> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we 
>>> should have had a presence. I will probably submit a talk for next 
>>> year, but until then, because I think I created a bit of interest in 
>>> cTakes I went to build cTakes myself and try a few things.
>>>
>>> Some of my findings are:
>>> * test failures with openjdk; granted the docs mention oracle jdk as 
>>> a prerequisite, but think it's easy to support openjdk
>>> * use of svn vs git; this is a debatable topic, but by now everybody 
>>> and their uncles are on git so moving to git (which I'd recommend) 
>>> would probably forster adoption (yes, I know about the github 
>>> mirror)
>>> * no support for OSGi, many large players use it
>>> * improvements in logging could go a long way, starting with moving 
>>> to slf4j
>>>
>>> Suggesting improvements imply that I volunteer to do a good chunk of 
>>> the work, but before that I'm interested more in how much the 
>>> community would welcome such improvements. I am curious what are 
>>> considered more low hanging fruits, for the more controversial 
>>> topics we could take them to [discuss] threads. Because every 
>>> community has its own culture and I am not that familiar with the 
>>> cTakes one, although I went through the mail archives, I thought a 
>>> prudent first step would be to start with this.
>>>
>>> Feedback appreciated,
>>> Hadrian
>>>

Re: Proposed improvements [EXTERNAL]

Posted by Hadrian Zbarcea <hz...@gmail.com>.
Ok, so:

* I tracked the source code for the 2016 version [1] of lvg we use. The 
source is included in the (almost) 1G .tgz, didn't check the lite version.
* There is a newer 2017 version [2]. I don't know if the community wants 
to upgrade.
* I looked at the code, and I will refrain from any comment now, but one 
thing is clear: no way this code will work in OSGi.

Sean, I vastly underestimated the work required to achieve what I hoped. 
I will not back off, but there's a lot of work to be done and I am not 
even sure where to start yet. To your comment re: OSGi, the issue is 
that there are too many constraints embedded in the code, dependency on 
file system, embedded database, etc. In my opinion the biggest bang for 
the buck would come from cleaning up the architecture and dependency 
structure, make it more loosely coupled. I'd be happy to work with (and 
learn from) you on it.

Cheers,
Hadrian


[1] 
https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2016/web/download.html
[2] 
https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/web/release/index.html


On 06/27/2017 07:36 PM, Hadrian Zbarcea wrote:
> Speaking of lvg. Does anybody know where the source code for 
> lvgdist-2016.0.jar is?
> 
> Thanks,
> Hadrian
> 
> 
> 
> On 06/26/2017 11:04 AM, Finan, Sean wrote:
>> Hi Andrey,
>>
>> Thank you for the input.  Thank you also Hadrian.
>>
>> With regard to a smaller ctakes, I know that a couple of people 
>> (including yours truly) are currently working on trimming some fat.  A 
>> few areas have been targeted, with the old/huge umls dictionary being 
>> at the top of the list.  It is deprecated and only used in a few 
>> tests.  Lvg is also used in a few test configurations, but I am unsure 
>> of its necessity.
>>
>> As far as a "ctakes core" ... I have been trying to figure out a smart 
>> way to separate the default clinical pipeline modules from others, 
>> making the others optional.  I already have a pom for clinical that 
>> does not include relation, temporal, coref, very importantly ytex ... 
>> as those are not part of the default clinical pipeline.  One thing 
>> that has me halted is figuring out how and where to make a simple 
>> mechanism for people to grab the more advanced modules.  A while ago I 
>> put a project pom in sandbox under "ctakes the api" or something to 
>> that effect.  It is basically a pom with advanced modules commented 
>> out.  A developer could start with that pom as their project main, 
>> then uncomment modules as needed.  It was a first ten-minute attempt 
>> at something simple and, while worth a try, not an ideal solution.
>>
>> Another idea that I have been tossing around is separating tests into 
>> separate modules.  Also possibly "training" into separate modules.  It 
>> is standard practice to keep parallel src/ and test/ directories in a 
>> repository and this kind of follows that thinking.  Many of the tests 
>> (such as mentioned above) require/use modules and resources that are 
>> not actually required to build the source.  The same goes for possible 
>> examples.  I think that the same could be true for training - if not 
>> now, perhaps in the future.   Again, I am held up on the best way to 
>> actually do this, keeping things simple wrt maven and a lack of excess 
>> complexity.  The last thing that I want to do is make ctakes more 
>> difficult to use.
>>
>> Maybe osgi can help the above, but I'm honestly not sure how.  If 
>> anybody else thinks that it can then I am going to let them handle 
>> it.  Perhaps I am just jaded.  Years ago my previous company had great 
>> hopes for osgi and invested a lot of time (=money) into applying it to 
>> our applications.  Over a million dollars later, the consensus was 
>> that osgi couldn't apply to our applications without completely 
>> rewriting infrastructure - which was an absolute no-go - and even if 
>> it could just be slapped on overnight did nothing for us or our 
>> customers.
>>
>> With regard to better logging, I think that James added some more 
>> detailed logging for the 4.0 release, and I think that he has a few 
>> more areas slated.  There are more logging statements that exist at 
>> finer levels than "info" and can be seen by changing the log4j 
>> configuration.  As for changing the entire codebase to slf4j, I may be 
>> alone but I'm not sure how that alone will make ctakes any more 
>> transparent.
>>
>> With regard to ctakes setup having some quirks ... yup.  Known issue 
>> to a lot of us.  Documentation was improved for the 4.0 release, but 
>> "run anywhere" documentation is difficult to both create and 
>> maintain.  Several ideas have been tossed around including 
>> installation scripts, an "environment/setup 
>> investigation/confirmation" gui or something like a running faq/blog 
>> on nothing but installation problems and solutions.
>>
>> Sean
>>
>> -----Original Message-----
>> From: Andrey Kurdumov [mailto:kant2002@googlemail.com]
>> Sent: Sunday, June 25, 2017 1:52 AM
>> To: cTakes developers list
>> Subject: Re: Proposed improvements [EXTERNAL]
>>
>> Just want to note that ASF PMC want to make GitHub primary repository 
>> and Apache servers secondary soon.
>>
>> Regarding improvements:
>> I personally want better support for embedding. Right now cTakes 
>> distribution comes with LVG and UMLS dictionary and size of cTakes 
>> thus become very.
>> I would like to have (and work on it) much leaner distribution, let's 
>> name it cTakes Core, which will just provide cTakes executable without 
>> need for data.
>> Right now I have constantly rip-off that data after cTakes build which 
>> slow down my build significantly.
>>
>> Personally I support Hadrian initiative to have better logging since 
>> cTakes setup has some quirks which could be faster resolved by better 
>> logging.
>>
>>
>> 2017-06-23 17:38 GMT+06:00 Miller, Timothy <
>> Timothy.Miller@childrens.harvard.edu>:
>>
>>> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and
>>> like something where we should be making people aware of cTAKES!
>>>
>>> svn vs. git -- I'm with you on preferring git, but not by so much that
>>> it's worth spending time on an argument if it turns into an argument
>>> :). As far as I know we've never really had a discussion about it.
>>> It's probably getting to the point where new developers have _only_
>>> used git and would find it a complete roadblock to use svn but for me
>>> it's just a mild annoyance.
>>>
>>> All others you mentioned -- if you are willing to contribute a patch
>>> we are happy to accept one-off contributions, and we are also
>>> interested in growing the developer community with people who are
>>> interested in contributing regularly over time.
>>>
>>> Tim
>>>
>>> ________________________________________
>>> From: Hadrian Zbarcea <hz...@gmail.com>
>>> Sent: Thursday, June 22, 2017 9:14 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Proposed improvements [EXTERNAL]
>>>
>>> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few
>>> other projects) and the ASF in general.
>>>
>>> I was surprised that most didn't know much about the ASF and more
>>> importantly that nobody knew about cTakes, the only (directly)
>>> healthcare related project at the ASF. There was no cTakes talk at
>>> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we
>>> should have had a presence. I will probably submit a talk for next
>>> year, but until then, because I think I created a bit of interest in
>>> cTakes I went to build cTakes myself and try a few things.
>>>
>>> Some of my findings are:
>>> * test failures with openjdk; granted the docs mention oracle jdk as a
>>> prerequisite, but think it's easy to support openjdk
>>> * use of svn vs git; this is a debatable topic, but by now everybody
>>> and their uncles are on git so moving to git (which I'd recommend)
>>> would probably forster adoption (yes, I know about the github mirror)
>>> * no support for OSGi, many large players use it
>>> * improvements in logging could go a long way, starting with moving to
>>> slf4j
>>>
>>> Suggesting improvements imply that I volunteer to do a good chunk of
>>> the work, but before that I'm interested more in how much the
>>> community would welcome such improvements. I am curious what are
>>> considered more low hanging fruits, for the more controversial topics
>>> we could take them to [discuss] threads. Because every community has
>>> its own culture and I am not that familiar with the cTakes one,
>>> although I went through the mail archives, I thought a prudent first 
>>> step would be to start with this.
>>>
>>> Feedback appreciated,
>>> Hadrian
>>>

Re: Proposed improvements [EXTERNAL]

Posted by Hadrian Zbarcea <hz...@gmail.com>.
Speaking of lvg. Does anybody know where the source code for 
lvgdist-2016.0.jar is?

Thanks,
Hadrian



On 06/26/2017 11:04 AM, Finan, Sean wrote:
> Hi Andrey,
> 
> Thank you for the input.  Thank you also Hadrian.
> 
> With regard to a smaller ctakes, I know that a couple of people (including yours truly) are currently working on trimming some fat.  A few areas have been targeted, with the old/huge umls dictionary being at the top of the list.  It is deprecated and only used in a few tests.  Lvg is also used in a few test configurations, but I am unsure of its necessity.
> 
> As far as a "ctakes core" ... I have been trying to figure out a smart way to separate the default clinical pipeline modules from others, making the others optional.  I already have a pom for clinical that does not include relation, temporal, coref, very importantly ytex ... as those are not part of the default clinical pipeline.  One thing that has me halted is figuring out how and where to make a simple mechanism for people to grab the more advanced modules.  A while ago I put a project pom in sandbox under "ctakes the api" or something to that effect.  It is basically a pom with advanced modules commented out.  A developer could start with that pom as their project main, then uncomment modules as needed.  It was a first ten-minute attempt at something simple and, while worth a try, not an ideal solution.
> 
> Another idea that I have been tossing around is separating tests into separate modules.  Also possibly "training" into separate modules.  It is standard practice to keep parallel src/ and test/ directories in a repository and this kind of follows that thinking.  Many of the tests (such as mentioned above) require/use modules and resources that are not actually required to build the source.  The same goes for possible examples.  I think that the same could be true for training - if not now, perhaps in the future.   Again, I am held up on the best way to actually do this, keeping things simple wrt maven and a lack of excess complexity.  The last thing that I want to do is make ctakes more difficult to use.
> 
> Maybe osgi can help the above, but I'm honestly not sure how.  If anybody else thinks that it can then I am going to let them handle it.  Perhaps I am just jaded.  Years ago my previous company had great hopes for osgi and invested a lot of time (=money) into applying it to our applications.  Over a million dollars later, the consensus was that osgi couldn't apply to our applications without completely rewriting infrastructure - which was an absolute no-go - and even if it could just be slapped on overnight did nothing for us or our customers.
> 
> With regard to better logging, I think that James added some more detailed logging for the 4.0 release, and I think that he has a few more areas slated.  There are more logging statements that exist at finer levels than "info" and can be seen by changing the log4j configuration.  As for changing the entire codebase to slf4j, I may be alone but I'm not sure how that alone will make ctakes any more transparent.
> 
> With regard to ctakes setup having some quirks ... yup.  Known issue to a lot of us.  Documentation was improved for the 4.0 release, but "run anywhere" documentation is difficult to both create and maintain.  Several ideas have been tossed around including installation scripts, an "environment/setup investigation/confirmation" gui or something like a running faq/blog on nothing but installation problems and solutions.
> 
> Sean
> 
> -----Original Message-----
> From: Andrey Kurdumov [mailto:kant2002@googlemail.com]
> Sent: Sunday, June 25, 2017 1:52 AM
> To: cTakes developers list
> Subject: Re: Proposed improvements [EXTERNAL]
> 
> Just want to note that ASF PMC want to make GitHub primary repository and Apache servers secondary soon.
> 
> Regarding improvements:
> I personally want better support for embedding. Right now cTakes distribution comes with LVG and UMLS dictionary and size of cTakes thus become very.
> I would like to have (and work on it) much leaner distribution, let's name it cTakes Core, which will just provide cTakes executable without need for data.
> Right now I have constantly rip-off that data after cTakes build which slow down my build significantly.
> 
> Personally I support Hadrian initiative to have better logging since cTakes setup has some quirks which could be faster resolved by better logging.
> 
> 
> 2017-06-23 17:38 GMT+06:00 Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu>:
> 
>> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and
>> like something where we should be making people aware of cTAKES!
>>
>> svn vs. git -- I'm with you on preferring git, but not by so much that
>> it's worth spending time on an argument if it turns into an argument
>> :). As far as I know we've never really had a discussion about it.
>> It's probably getting to the point where new developers have _only_
>> used git and would find it a complete roadblock to use svn but for me
>> it's just a mild annoyance.
>>
>> All others you mentioned -- if you are willing to contribute a patch
>> we are happy to accept one-off contributions, and we are also
>> interested in growing the developer community with people who are
>> interested in contributing regularly over time.
>>
>> Tim
>>
>> ________________________________________
>> From: Hadrian Zbarcea <hz...@gmail.com>
>> Sent: Thursday, June 22, 2017 9:14 PM
>> To: dev@ctakes.apache.org
>> Subject: Proposed improvements [EXTERNAL]
>>
>> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few
>> other projects) and the ASF in general.
>>
>> I was surprised that most didn't know much about the ASF and more
>> importantly that nobody knew about cTakes, the only (directly)
>> healthcare related project at the ASF. There was no cTakes talk at
>> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we
>> should have had a presence. I will probably submit a talk for next
>> year, but until then, because I think I created a bit of interest in
>> cTakes I went to build cTakes myself and try a few things.
>>
>> Some of my findings are:
>> * test failures with openjdk; granted the docs mention oracle jdk as a
>> prerequisite, but think it's easy to support openjdk
>> * use of svn vs git; this is a debatable topic, but by now everybody
>> and their uncles are on git so moving to git (which I'd recommend)
>> would probably forster adoption (yes, I know about the github mirror)
>> * no support for OSGi, many large players use it
>> * improvements in logging could go a long way, starting with moving to
>> slf4j
>>
>> Suggesting improvements imply that I volunteer to do a good chunk of
>> the work, but before that I'm interested more in how much the
>> community would welcome such improvements. I am curious what are
>> considered more low hanging fruits, for the more controversial topics
>> we could take them to [discuss] threads. Because every community has
>> its own culture and I am not that familiar with the cTakes one,
>> although I went through the mail archives, I thought a prudent first step would be to start with this.
>>
>> Feedback appreciated,
>> Hadrian
>>

RE: Proposed improvements [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Andrey, 

Thank you for the input.  Thank you also Hadrian.

With regard to a smaller ctakes, I know that a couple of people (including yours truly) are currently working on trimming some fat.  A few areas have been targeted, with the old/huge umls dictionary being at the top of the list.  It is deprecated and only used in a few tests.  Lvg is also used in a few test configurations, but I am unsure of its necessity.

As far as a "ctakes core" ... I have been trying to figure out a smart way to separate the default clinical pipeline modules from others, making the others optional.  I already have a pom for clinical that does not include relation, temporal, coref, very importantly ytex ... as those are not part of the default clinical pipeline.  One thing that has me halted is figuring out how and where to make a simple mechanism for people to grab the more advanced modules.  A while ago I put a project pom in sandbox under "ctakes the api" or something to that effect.  It is basically a pom with advanced modules commented out.  A developer could start with that pom as their project main, then uncomment modules as needed.  It was a first ten-minute attempt at something simple and, while worth a try, not an ideal solution.

Another idea that I have been tossing around is separating tests into separate modules.  Also possibly "training" into separate modules.  It is standard practice to keep parallel src/ and test/ directories in a repository and this kind of follows that thinking.  Many of the tests (such as mentioned above) require/use modules and resources that are not actually required to build the source.  The same goes for possible examples.  I think that the same could be true for training - if not now, perhaps in the future.   Again, I am held up on the best way to actually do this, keeping things simple wrt maven and a lack of excess complexity.  The last thing that I want to do is make ctakes more difficult to use.  

Maybe osgi can help the above, but I'm honestly not sure how.  If anybody else thinks that it can then I am going to let them handle it.  Perhaps I am just jaded.  Years ago my previous company had great hopes for osgi and invested a lot of time (=money) into applying it to our applications.  Over a million dollars later, the consensus was that osgi couldn't apply to our applications without completely rewriting infrastructure - which was an absolute no-go - and even if it could just be slapped on overnight did nothing for us or our customers.

With regard to better logging, I think that James added some more detailed logging for the 4.0 release, and I think that he has a few more areas slated.  There are more logging statements that exist at finer levels than "info" and can be seen by changing the log4j configuration.  As for changing the entire codebase to slf4j, I may be alone but I'm not sure how that alone will make ctakes any more transparent.

With regard to ctakes setup having some quirks ... yup.  Known issue to a lot of us.  Documentation was improved for the 4.0 release, but "run anywhere" documentation is difficult to both create and maintain.  Several ideas have been tossed around including installation scripts, an "environment/setup investigation/confirmation" gui or something like a running faq/blog on nothing but installation problems and solutions.

Sean 

-----Original Message-----
From: Andrey Kurdumov [mailto:kant2002@googlemail.com] 
Sent: Sunday, June 25, 2017 1:52 AM
To: cTakes developers list
Subject: Re: Proposed improvements [EXTERNAL]

Just want to note that ASF PMC want to make GitHub primary repository and Apache servers secondary soon.

Regarding improvements:
I personally want better support for embedding. Right now cTakes distribution comes with LVG and UMLS dictionary and size of cTakes thus become very.
I would like to have (and work on it) much leaner distribution, let's name it cTakes Core, which will just provide cTakes executable without need for data.
Right now I have constantly rip-off that data after cTakes build which slow down my build significantly.

Personally I support Hadrian initiative to have better logging since cTakes setup has some quirks which could be faster resolved by better logging.


2017-06-23 17:38 GMT+06:00 Miller, Timothy <
Timothy.Miller@childrens.harvard.edu>:

> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and 
> like something where we should be making people aware of cTAKES!
>
> svn vs. git -- I'm with you on preferring git, but not by so much that 
> it's worth spending time on an argument if it turns into an argument 
> :). As far as I know we've never really had a discussion about it. 
> It's probably getting to the point where new developers have _only_ 
> used git and would find it a complete roadblock to use svn but for me 
> it's just a mild annoyance.
>
> All others you mentioned -- if you are willing to contribute a patch 
> we are happy to accept one-off contributions, and we are also 
> interested in growing the developer community with people who are 
> interested in contributing regularly over time.
>
> Tim
>
> ________________________________________
> From: Hadrian Zbarcea <hz...@gmail.com>
> Sent: Thursday, June 22, 2017 9:14 PM
> To: dev@ctakes.apache.org
> Subject: Proposed improvements [EXTERNAL]
>
> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few 
> other projects) and the ASF in general.
>
> I was surprised that most didn't know much about the ASF and more 
> importantly that nobody knew about cTakes, the only (directly) 
> healthcare related project at the ASF. There was no cTakes talk at 
> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we 
> should have had a presence. I will probably submit a talk for next 
> year, but until then, because I think I created a bit of interest in 
> cTakes I went to build cTakes myself and try a few things.
>
> Some of my findings are:
> * test failures with openjdk; granted the docs mention oracle jdk as a 
> prerequisite, but think it's easy to support openjdk
> * use of svn vs git; this is a debatable topic, but by now everybody 
> and their uncles are on git so moving to git (which I'd recommend) 
> would probably forster adoption (yes, I know about the github mirror)
> * no support for OSGi, many large players use it
> * improvements in logging could go a long way, starting with moving to 
> slf4j
>
> Suggesting improvements imply that I volunteer to do a good chunk of 
> the work, but before that I'm interested more in how much the 
> community would welcome such improvements. I am curious what are 
> considered more low hanging fruits, for the more controversial topics 
> we could take them to [discuss] threads. Because every community has 
> its own culture and I am not that familiar with the cTakes one, 
> although I went through the mail archives, I thought a prudent first step would be to start with this.
>
> Feedback appreciated,
> Hadrian
>

Re: Proposed improvements [EXTERNAL]

Posted by Hadrian Zbarcea <hz...@gmail.com>.
Thanks for the encouraging replies. Let's see where this will go.

Cheers,
Hadrian

On 06/25/2017 01:51 AM, Andrey Kurdumov wrote:
> Just want to note that ASF PMC want to make GitHub primary repository and
> Apache servers secondary soon.
> 
> Regarding improvements:
> I personally want better support for embedding. Right now cTakes
> distribution comes with LVG and UMLS dictionary and size of cTakes thus
> become very.
> I would like to have (and work on it) much leaner distribution, let's name
> it cTakes Core, which will just provide cTakes executable without need for
> data.
> Right now I have constantly rip-off that data after cTakes build which slow
> down my build significantly.
> 
> Personally I support Hadrian initiative to have better logging since cTakes
> setup has some quirks which could be faster resolved by better logging.
> 
> 
> 2017-06-23 17:38 GMT+06:00 Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu>:
> 
>> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and like
>> something where we should be making people aware of cTAKES!
>>
>> svn vs. git -- I'm with you on preferring git, but not by so much that
>> it's worth spending time on an argument if it turns into an argument :). As
>> far as I know we've never really had a discussion about it. It's probably
>> getting to the point where new developers have _only_ used git and would
>> find it a complete roadblock to use svn but for me it's just a mild
>> annoyance.
>>
>> All others you mentioned -- if you are willing to contribute a patch we
>> are happy to accept one-off contributions, and we are also interested in
>> growing the developer community with people who are interested in
>> contributing regularly over time.
>>
>> Tim
>>
>> ________________________________________
>> From: Hadrian Zbarcea <hz...@gmail.com>
>> Sent: Thursday, June 22, 2017 9:14 PM
>> To: dev@ctakes.apache.org
>> Subject: Proposed improvements [EXTERNAL]
>>
>> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few
>> other projects) and the ASF in general.
>>
>> I was surprised that most didn't know much about the ASF and more
>> importantly that nobody knew about cTakes, the only (directly)
>> healthcare related project at the ASF. There was no cTakes talk at
>> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we
>> should have had a presence. I will probably submit a talk for next year,
>> but until then, because I think I created a bit of interest in cTakes I
>> went to build cTakes myself and try a few things.
>>
>> Some of my findings are:
>> * test failures with openjdk; granted the docs mention oracle jdk as a
>> prerequisite, but think it's easy to support openjdk
>> * use of svn vs git; this is a debatable topic, but by now everybody and
>> their uncles are on git so moving to git (which I'd recommend) would
>> probably forster adoption (yes, I know about the github mirror)
>> * no support for OSGi, many large players use it
>> * improvements in logging could go a long way, starting with moving to
>> slf4j
>>
>> Suggesting improvements imply that I volunteer to do a good chunk of the
>> work, but before that I'm interested more in how much the community
>> would welcome such improvements. I am curious what are considered more
>> low hanging fruits, for the more controversial topics we could take them
>> to [discuss] threads. Because every community has its own culture and I
>> am not that familiar with the cTakes one, although I went through the
>> mail archives, I thought a prudent first step would be to start with this.
>>
>> Feedback appreciated,
>> Hadrian
>>
> 

Re: Proposed improvements [EXTERNAL] [SUSPICIOUS]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Yeah, actually, I have no idea why that's there. All the actual default parser models are in their own directories (dependency, srl, etc.). This almost looks like just a collection of additional models, which the average user would have no idea how to use and take up a lot of space.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Tuesday, June 27, 2017 10:07 PM
To: dev@ctakes.apache.org
Subject: RE: Proposed improvements [EXTERNAL] [SUSPICIOUS]

Hi all,

> I would like to have (and work on it) much leaner distribution
One bigfoot is the clearparser_models.jar in ctakes-dependency-parser-res.  As far as I know this is not used by default or in any checked-in non-default configuration.  As it is 1/4 GB, I would like to move it to its own module to keep it out of projects that use ctakes "as a library".  I hunted the net to see if a duplicate is available elsewhere for alternative inclusion methods but couldn't find one.

Thoughts?

Thanks,
Sean

-----Original Message-----
From: Andrey Kurdumov [mailto:kant2002@googlemail.com]
Sent: Sunday, June 25, 2017 1:52 AM
To: cTakes developers list
Subject: Re: Proposed improvements [EXTERNAL]

Just want to note that ASF PMC want to make GitHub primary repository and Apache servers secondary soon.

Regarding improvements:
I personally want better support for embedding. Right now cTakes distribution comes with LVG and UMLS dictionary and size of cTakes thus become very.
I would like to have (and work on it) much leaner distribution, let's name it cTakes Core, which will just provide cTakes executable without need for data.
Right now I have constantly rip-off that data after cTakes build which slow down my build significantly.

Personally I support Hadrian initiative to have better logging since cTakes setup has some quirks which could be faster resolved by better logging.


2017-06-23 17:38 GMT+06:00 Miller, Timothy <
Timothy.Miller@childrens.harvard.edu>:

> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and
> like something where we should be making people aware of cTAKES!
>
> svn vs. git -- I'm with you on preferring git, but not by so much that
> it's worth spending time on an argument if it turns into an argument
> :). As far as I know we've never really had a discussion about it.
> It's probably getting to the point where new developers have _only_
> used git and would find it a complete roadblock to use svn but for me
> it's just a mild annoyance.
>
> All others you mentioned -- if you are willing to contribute a patch
> we are happy to accept one-off contributions, and we are also
> interested in growing the developer community with people who are
> interested in contributing regularly over time.
>
> Tim
>
> ________________________________________
> From: Hadrian Zbarcea <hz...@gmail.com>
> Sent: Thursday, June 22, 2017 9:14 PM
> To: dev@ctakes.apache.org
> Subject: Proposed improvements [EXTERNAL]
>
> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few
> other projects) and the ASF in general.
>
> I was surprised that most didn't know much about the ASF and more
> importantly that nobody knew about cTakes, the only (directly)
> healthcare related project at the ASF. There was no cTakes talk at
> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we
> should have had a presence. I will probably submit a talk for next
> year, but until then, because I think I created a bit of interest in
> cTakes I went to build cTakes myself and try a few things.
>
> Some of my findings are:
> * test failures with openjdk; granted the docs mention oracle jdk as a
> prerequisite, but think it's easy to support openjdk
> * use of svn vs git; this is a debatable topic, but by now everybody
> and their uncles are on git so moving to git (which I'd recommend)
> would probably forster adoption (yes, I know about the github mirror)
> * no support for OSGi, many large players use it
> * improvements in logging could go a long way, starting with moving to
> slf4j
>
> Suggesting improvements imply that I volunteer to do a good chunk of
> the work, but before that I'm interested more in how much the
> community would welcome such improvements. I am curious what are
> considered more low hanging fruits, for the more controversial topics
> we could take them to [discuss] threads. Because every community has
> its own culture and I am not that familiar with the cTakes one,
> although I went through the mail archives, I thought a prudent first step would be to start with this.
>
> Feedback appreciated,
> Hadrian
>

RE: Proposed improvements [EXTERNAL] [SUSPICIOUS]

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.
Good dependency parser are hard to find; moreover good dependency parsers trained on clinical data are impossible to find. I don't think there is another dep parser trained on clinical data other than cTAKES's. In general, the state of the art of dependency parsing is associated with resource intense computing, the models are also of fair size.
--
Guergana Savova, PhD, FACMI
Associate Professor
PI Natural Language Processing Lab
Boston Children's Hospital and Harvard Medical School
300 Longwood Avenue
Mailstop: BCH3092
Enders 144.1
Boston, MA 02115
Tel: (617) 919-2972
Fax: (617) 730-0817
Guergana.Savova@childrens.harvard.edu
Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv
http://ctakes.apache.org  
http://thyme.healthnlp.org 
http://cancer.healthnlp.org 
http://share.healthnlp.org
http://center.healthnlp.org  






-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Tuesday, June 27, 2017 4:07 PM
To: dev@ctakes.apache.org
Subject: RE: Proposed improvements [EXTERNAL] [SUSPICIOUS]

Hi all,

> I would like to have (and work on it) much leaner distribution
One bigfoot is the clearparser_models.jar in ctakes-dependency-parser-res.  As far as I know this is not used by default or in any checked-in non-default configuration.  As it is 1/4 GB, I would like to move it to its own module to keep it out of projects that use ctakes "as a library".  I hunted the net to see if a duplicate is available elsewhere for alternative inclusion methods but couldn't find one.

Thoughts?

Thanks,
Sean

-----Original Message-----
From: Andrey Kurdumov [mailto:kant2002@googlemail.com]
Sent: Sunday, June 25, 2017 1:52 AM
To: cTakes developers list
Subject: Re: Proposed improvements [EXTERNAL]

Just want to note that ASF PMC want to make GitHub primary repository and Apache servers secondary soon.

Regarding improvements:
I personally want better support for embedding. Right now cTakes distribution comes with LVG and UMLS dictionary and size of cTakes thus become very.
I would like to have (and work on it) much leaner distribution, let's name it cTakes Core, which will just provide cTakes executable without need for data.
Right now I have constantly rip-off that data after cTakes build which slow down my build significantly.

Personally I support Hadrian initiative to have better logging since cTakes setup has some quirks which could be faster resolved by better logging.


2017-06-23 17:38 GMT+06:00 Miller, Timothy <
Timothy.Miller@childrens.harvard.edu>:

> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and 
> like something where we should be making people aware of cTAKES!
>
> svn vs. git -- I'm with you on preferring git, but not by so much that 
> it's worth spending time on an argument if it turns into an argument 
> :). As far as I know we've never really had a discussion about it.
> It's probably getting to the point where new developers have _only_ 
> used git and would find it a complete roadblock to use svn but for me 
> it's just a mild annoyance.
>
> All others you mentioned -- if you are willing to contribute a patch 
> we are happy to accept one-off contributions, and we are also 
> interested in growing the developer community with people who are 
> interested in contributing regularly over time.
>
> Tim
>
> ________________________________________
> From: Hadrian Zbarcea <hz...@gmail.com>
> Sent: Thursday, June 22, 2017 9:14 PM
> To: dev@ctakes.apache.org
> Subject: Proposed improvements [EXTERNAL]
>
> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few 
> other projects) and the ASF in general.
>
> I was surprised that most didn't know much about the ASF and more 
> importantly that nobody knew about cTakes, the only (directly) 
> healthcare related project at the ASF. There was no cTakes talk at 
> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we 
> should have had a presence. I will probably submit a talk for next 
> year, but until then, because I think I created a bit of interest in 
> cTakes I went to build cTakes myself and try a few things.
>
> Some of my findings are:
> * test failures with openjdk; granted the docs mention oracle jdk as a 
> prerequisite, but think it's easy to support openjdk
> * use of svn vs git; this is a debatable topic, but by now everybody 
> and their uncles are on git so moving to git (which I'd recommend) 
> would probably forster adoption (yes, I know about the github mirror)
> * no support for OSGi, many large players use it
> * improvements in logging could go a long way, starting with moving to 
> slf4j
>
> Suggesting improvements imply that I volunteer to do a good chunk of 
> the work, but before that I'm interested more in how much the 
> community would welcome such improvements. I am curious what are 
> considered more low hanging fruits, for the more controversial topics 
> we could take them to [discuss] threads. Because every community has 
> its own culture and I am not that familiar with the cTakes one, 
> although I went through the mail archives, I thought a prudent first step would be to start with this.
>
> Feedback appreciated,
> Hadrian
>

RE: Proposed improvements [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi all,

> I would like to have (and work on it) much leaner distribution
One bigfoot is the clearparser_models.jar in ctakes-dependency-parser-res.  As far as I know this is not used by default or in any checked-in non-default configuration.  As it is 1/4 GB, I would like to move it to its own module to keep it out of projects that use ctakes "as a library".  I hunted the net to see if a duplicate is available elsewhere for alternative inclusion methods but couldn't find one.

Thoughts?

Thanks,
Sean

-----Original Message-----
From: Andrey Kurdumov [mailto:kant2002@googlemail.com] 
Sent: Sunday, June 25, 2017 1:52 AM
To: cTakes developers list
Subject: Re: Proposed improvements [EXTERNAL]

Just want to note that ASF PMC want to make GitHub primary repository and Apache servers secondary soon.

Regarding improvements:
I personally want better support for embedding. Right now cTakes distribution comes with LVG and UMLS dictionary and size of cTakes thus become very.
I would like to have (and work on it) much leaner distribution, let's name it cTakes Core, which will just provide cTakes executable without need for data.
Right now I have constantly rip-off that data after cTakes build which slow down my build significantly.

Personally I support Hadrian initiative to have better logging since cTakes setup has some quirks which could be faster resolved by better logging.


2017-06-23 17:38 GMT+06:00 Miller, Timothy <
Timothy.Miller@childrens.harvard.edu>:

> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and 
> like something where we should be making people aware of cTAKES!
>
> svn vs. git -- I'm with you on preferring git, but not by so much that 
> it's worth spending time on an argument if it turns into an argument 
> :). As far as I know we've never really had a discussion about it. 
> It's probably getting to the point where new developers have _only_ 
> used git and would find it a complete roadblock to use svn but for me 
> it's just a mild annoyance.
>
> All others you mentioned -- if you are willing to contribute a patch 
> we are happy to accept one-off contributions, and we are also 
> interested in growing the developer community with people who are 
> interested in contributing regularly over time.
>
> Tim
>
> ________________________________________
> From: Hadrian Zbarcea <hz...@gmail.com>
> Sent: Thursday, June 22, 2017 9:14 PM
> To: dev@ctakes.apache.org
> Subject: Proposed improvements [EXTERNAL]
>
> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few 
> other projects) and the ASF in general.
>
> I was surprised that most didn't know much about the ASF and more 
> importantly that nobody knew about cTakes, the only (directly) 
> healthcare related project at the ASF. There was no cTakes talk at 
> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we 
> should have had a presence. I will probably submit a talk for next 
> year, but until then, because I think I created a bit of interest in 
> cTakes I went to build cTakes myself and try a few things.
>
> Some of my findings are:
> * test failures with openjdk; granted the docs mention oracle jdk as a 
> prerequisite, but think it's easy to support openjdk
> * use of svn vs git; this is a debatable topic, but by now everybody 
> and their uncles are on git so moving to git (which I'd recommend) 
> would probably forster adoption (yes, I know about the github mirror)
> * no support for OSGi, many large players use it
> * improvements in logging could go a long way, starting with moving to 
> slf4j
>
> Suggesting improvements imply that I volunteer to do a good chunk of 
> the work, but before that I'm interested more in how much the 
> community would welcome such improvements. I am curious what are 
> considered more low hanging fruits, for the more controversial topics 
> we could take them to [discuss] threads. Because every community has 
> its own culture and I am not that familiar with the cTakes one, 
> although I went through the mail archives, I thought a prudent first step would be to start with this.
>
> Feedback appreciated,
> Hadrian
>

Re: Proposed improvements [EXTERNAL]

Posted by Andrey Kurdumov <ka...@googlemail.com>.
Just want to note that ASF PMC want to make GitHub primary repository and
Apache servers secondary soon.

Regarding improvements:
I personally want better support for embedding. Right now cTakes
distribution comes with LVG and UMLS dictionary and size of cTakes thus
become very.
I would like to have (and work on it) much leaner distribution, let's name
it cTakes Core, which will just provide cTakes executable without need for
data.
Right now I have constantly rip-off that data after cTakes build which slow
down my build significantly.

Personally I support Hadrian initiative to have better logging since cTakes
setup has some quirks which could be faster resolved by better logging.


2017-06-23 17:38 GMT+06:00 Miller, Timothy <
Timothy.Miller@childrens.harvard.edu>:

> Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and like
> something where we should be making people aware of cTAKES!
>
> svn vs. git -- I'm with you on preferring git, but not by so much that
> it's worth spending time on an argument if it turns into an argument :). As
> far as I know we've never really had a discussion about it. It's probably
> getting to the point where new developers have _only_ used git and would
> find it a complete roadblock to use svn but for me it's just a mild
> annoyance.
>
> All others you mentioned -- if you are willing to contribute a patch we
> are happy to accept one-off contributions, and we are also interested in
> growing the developer community with people who are interested in
> contributing regularly over time.
>
> Tim
>
> ________________________________________
> From: Hadrian Zbarcea <hz...@gmail.com>
> Sent: Thursday, June 22, 2017 9:14 PM
> To: dev@ctakes.apache.org
> Subject: Proposed improvements [EXTERNAL]
>
> Last week I presented at the OSEHRA Summit about ActiveMQ (and a few
> other projects) and the ASF in general.
>
> I was surprised that most didn't know much about the ASF and more
> importantly that nobody knew about cTakes, the only (directly)
> healthcare related project at the ASF. There was no cTakes talk at
> ApacheCon in Miami, but at OSEHRA, which is all about healthcare we
> should have had a presence. I will probably submit a talk for next year,
> but until then, because I think I created a bit of interest in cTakes I
> went to build cTakes myself and try a few things.
>
> Some of my findings are:
> * test failures with openjdk; granted the docs mention oracle jdk as a
> prerequisite, but think it's easy to support openjdk
> * use of svn vs git; this is a debatable topic, but by now everybody and
> their uncles are on git so moving to git (which I'd recommend) would
> probably forster adoption (yes, I know about the github mirror)
> * no support for OSGi, many large players use it
> * improvements in logging could go a long way, starting with moving to
> slf4j
>
> Suggesting improvements imply that I volunteer to do a good chunk of the
> work, but before that I'm interested more in how much the community
> would welcome such improvements. I am curious what are considered more
> low hanging fruits, for the more controversial topics we could take them
> to [discuss] threads. Because every community has its own culture and I
> am not that familiar with the cTakes one, although I went through the
> mail archives, I thought a prudent first step would be to start with this.
>
> Feedback appreciated,
> Hadrian
>

Re: Proposed improvements [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Thanks Hadrian, I hadn't heard of OSEHRA but it looks interesting and like something where we should be making people aware of cTAKES!

svn vs. git -- I'm with you on preferring git, but not by so much that it's worth spending time on an argument if it turns into an argument :). As far as I know we've never really had a discussion about it. It's probably getting to the point where new developers have _only_ used git and would find it a complete roadblock to use svn but for me it's just a mild annoyance.

All others you mentioned -- if you are willing to contribute a patch we are happy to accept one-off contributions, and we are also interested in growing the developer community with people who are interested in contributing regularly over time.

Tim

________________________________________
From: Hadrian Zbarcea <hz...@gmail.com>
Sent: Thursday, June 22, 2017 9:14 PM
To: dev@ctakes.apache.org
Subject: Proposed improvements [EXTERNAL]

Last week I presented at the OSEHRA Summit about ActiveMQ (and a few
other projects) and the ASF in general.

I was surprised that most didn't know much about the ASF and more
importantly that nobody knew about cTakes, the only (directly)
healthcare related project at the ASF. There was no cTakes talk at
ApacheCon in Miami, but at OSEHRA, which is all about healthcare we
should have had a presence. I will probably submit a talk for next year,
but until then, because I think I created a bit of interest in cTakes I
went to build cTakes myself and try a few things.

Some of my findings are:
* test failures with openjdk; granted the docs mention oracle jdk as a
prerequisite, but think it's easy to support openjdk
* use of svn vs git; this is a debatable topic, but by now everybody and
their uncles are on git so moving to git (which I'd recommend) would
probably forster adoption (yes, I know about the github mirror)
* no support for OSGi, many large players use it
* improvements in logging could go a long way, starting with moving to slf4j

Suggesting improvements imply that I volunteer to do a good chunk of the
work, but before that I'm interested more in how much the community
would welcome such improvements. I am curious what are considered more
low hanging fruits, for the more controversial topics we could take them
to [discuss] threads. Because every community has its own culture and I
am not that familiar with the cTakes one, although I went through the
mail archives, I thought a prudent first step would be to start with this.

Feedback appreciated,
Hadrian