You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Bruce Tietjen <br...@perfectsearchcorp.com> on 2014/10/06 22:59:54 UTC

cTakes output predictability

Since I started working with cTakes some time ago, I have found it
difficult to compare the output between subsequent runs on the same files
because annotations are often assigned different IDs, are listed in
different order, etc.

One area that seems to be a cause for at least some of these differences is
the common use of HashMap where enumerating the contents is not guaranteed
to return items in the same order they were added.

I would like to work towards addressing this issue by changing those areas
of the code where it matters to use a LinkedHashMap instead.

Is this something the community would be interested in and find helpful?

Thanks,

Bruce Tietjen
Perfect Search Corp.

RE: cTakes output predictability

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Kim,

>In our testing we've found several cases where running with the same configuration outputs different data under different moons

This is a known behavior and I understand that this may be the case that started the thread.  

>>> I spent some time writing a script for diff-ing CASes
>> I urge anyone interested in comparing cTakes CASes / output to use this type of approach.  

I still stand by my original email.

> Having output that is in a predictable order makes checking to see if there are differences much cheaper when you are dealing with larger data sets.

Britt said:
> The option Sean mentioned of writing your own custom consumer (without the UIMA id that is causing your issues) should meet these needs I believe.

I agree.

Sean

-----Original Message-----
From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 11:30 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Hi Sean,

Well of course that makes plenty of sense. Testing different cTakes configurations you would expect different output. In our testing we've found several cases where running with the same configuration outputs different data under different moons. Having consistent results helps us know if we've made improvements to our quality or not. c

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 08:50 AM, Finan, Sean wrote:
> Hi Kim,
>
> One might want compare the Sentence detector that uses end of line characters as sentence splitters with one that does not.  Such a change in sentence splitting would not only effect the sentence type discoveries but also practically every type that follows.
>
> Another might want to compare a note with "skin cancer" vs. one in which you replace "skin cancer" with "melanoma" just to see what the CUI differences might be.  There are changes in two words vs. one, 11 characters vs. 8, a removed adjective(?), and of course changes in CUIs.
>
> Of course, if you are just running notes on a new moon and then again on a full moon ...
>
> Sean
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
> Sent: Tuesday, October 07, 2014 10:41 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> Sean,
>
> "...being different because of a possibly intentional difference."
>
> I would like you to elaborate a bit on the what would be intentionally different between the processing of the same document multiple times. It would help my understanding of cTakes.
>
> Thanks,
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>> Steve Bethard wrote:
>>> I spent some time writing a script for diff-ing CASes
>> I urge anyone interested in comparing cTakes CASes / output to use this type of approach.  Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there.  Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved.  I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference.
>>
>> In addition to or instead of creating a post-processing script, one could write a new "cas-consumer" that writes output in a desired format - but this should not require changes to engines.
>>
>> "If it ain't broke, don't fix it"
>>
>> Sean
>>
>>
>> -----Original Message-----
>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>> Sent: Monday, October 06, 2014 11:23 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes output predictability
>>
>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>> <br...@perfectsearchcorp.com> wrote:
>>> Since I started working with cTakes some time ago, I have found it 
>>> difficult to compare the output between subsequent runs on the same 
>>> files because annotations are often assigned different IDs, are 
>>> listed in different order, etc.
>> At one point, I spent some time writing a script for diff-ing CASes 
>> that intended to address some of these kinds of issues. It's still 
>> here in cTAKES:
>>
>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysi
>> s
>> /CompareFeatureStructures.java
>>
>> You might see if you could use or adapt that to your needs.
>>
>> Steve


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
I think it would be helpful actually, as digging deeper into the issue
has highlighted to me a few places in the code that actually cause
inconsistent results to be returned when running the same document
through multiple times. I think having the code base be predictable will
make it easier to debug.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:58 AM, Masanz, James J. wrote:
> FWIW, I agree with Sean that comparing should be a post-processing step and trying to get UIMA internal IDs to match on subsequent runs is not worth opening the code for.
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 10:56 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> I think we may really prefer the first method. Since it doesn't appear
> that there are any consequences with moving forward with changing the
> code, we would really like to move forward with this approach.
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 09:35 AM, britt fitch wrote:
>> The option Sean mentioned of writing your own custom consumer (without
>> the UIMA id that is causing your issues) should meet these needs I
>> believe. 
>>
>>   	  	  	 
>>
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com
>> Britt.Fitch@wiredinformatics.com
>>
>> On Oct 7, 2014, at 11:29 AM, Kim Ebert
>> <kim.ebert@perfectsearchcorp.com
>> <ma...@perfectsearchcorp.com>> wrote:
>>
>>> Hi Sean,
>>>
>>> Well of course that makes plenty of sense. Testing different cTakes
>>> configurations you would expect different output. In our testing we've
>>> found several cases where running with the same configuration outputs
>>> different data under different moons. Having consistent results helps us
>>> know if we've made improvements to our quality or not. Having output
>>> that is in a predictable order makes checking to see if there are
>>> differences much cheaper when you are dealing with larger data sets.
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>> Hi Kim,
>>>>
>>>> One might want compare the Sentence detector that uses end of line
>>>> characters as sentence splitters with one that does not.  Such a
>>>> change in sentence splitting would not only effect the sentence type
>>>> discoveries but also practically every type that follows.
>>>>
>>>> Another might want to compare a note with "skin cancer" vs. one in
>>>> which you replace "skin cancer" with "melanoma" just to see what the
>>>> CUI differences might be.  There are changes in two words vs. one,
>>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>>> in CUIs.
>>>>
>>>> Of course, if you are just running notes on a new moon and then
>>>> again on a full moon ...
>>>>
>>>> Sean
>>>>
>>>> -----Original Message-----
>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> Sean,
>>>>
>>>> "...being different because of a possibly intentional difference."
>>>>
>>>> I would like you to elaborate a bit on the what would be
>>>> intentionally different between the processing of the same document
>>>> multiple times. It would help my understanding of cTakes.
>>>>
>>>> Thanks,
>>>>
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>>
>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>> Steve Bethard wrote:
>>>>>> I spent some time writing a script for diff-ing CASes
>>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>>> this type of approach.  Comparison of program output is a
>>>>> post-process task, and unless absolutely necessary code to juggle
>>>>> data and metadata belongs there.  Attempts to force every module
>>>>> past, present and Future to abide by fixed orderings, enumerations
>>>>> etc. is not as simple a task as one might initially think -
>>>>> especially if third-party libraries are involved.  I won't get into
>>>>> problems associated with why one is comparing output (swapped
>>>>> module?) and IDs, orders etc. being different because of a possibly
>>>>> intentional difference.
>>>>>
>>>>> In addition to or instead of creating a post-processing script, one
>>>>> could write a new "cas-consumer" that writes output in a desired
>>>>> format - but this should not require changes to engines.
>>>>>
>>>>> "If it ain't broke, don't fix it"
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: cTakes output predictability
>>>>>
>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>> Since I started working with cTakes some time ago, I have found it
>>>>>> difficult to compare the output between subsequent runs on the same
>>>>>> files because annotations are often assigned different IDs, are
>>>>>> listed in different order, etc.
>>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>>> that intended to address some of these kinds of issues. It's still
>>>>> here in cTAKES:
>>>>>
>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>>>> /CompareFeatureStructures.java
>>>>>
>>>>> You might see if you could use or adapt that to your needs.
>>>>>
>>>>> Steve
> .
>


RE: cTakes output predictability

Posted by "Masanz, James J." <Ma...@mayo.edu>.
FWIW, I agree with Sean that comparing should be a post-processing step and trying to get UIMA internal IDs to match on subsequent runs is not worth opening the code for.

-----Original Message-----
From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 10:56 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I think we may really prefer the first method. Since it doesn't appear
that there are any consequences with moving forward with changing the
code, we would really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
> The option Sean mentioned of writing your own custom consumer (without
> the UIMA id that is causing your issues) should meet these needs I
> believe. 
>
>   	  	  	 
>
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
>
> On Oct 7, 2014, at 11:29 AM, Kim Ebert
> <kim.ebert@perfectsearchcorp.com
> <ma...@perfectsearchcorp.com>> wrote:
>
>> Hi Sean,
>>
>> Well of course that makes plenty of sense. Testing different cTakes
>> configurations you would expect different output. In our testing we've
>> found several cases where running with the same configuration outputs
>> different data under different moons. Having consistent results helps us
>> know if we've made improvements to our quality or not. Having output
>> that is in a predictable order makes checking to see if there are
>> differences much cheaper when you are dealing with larger data sets.
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>> One might want compare the Sentence detector that uses end of line
>>> characters as sentence splitters with one that does not.  Such a
>>> change in sentence splitting would not only effect the sentence type
>>> discoveries but also practically every type that follows.
>>>
>>> Another might want to compare a note with "skin cancer" vs. one in
>>> which you replace "skin cancer" with "melanoma" just to see what the
>>> CUI differences might be.  There are changes in two words vs. one,
>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>> in CUIs.
>>>
>>> Of course, if you are just running notes on a new moon and then
>>> again on a full moon ...
>>>
>>> Sean
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> Sean,
>>>
>>> "...being different because of a possibly intentional difference."
>>>
>>> I would like you to elaborate a bit on the what would be
>>> intentionally different between the processing of the same document
>>> multiple times. It would help my understanding of cTakes.
>>>
>>> Thanks,
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>> Steve Bethard wrote:
>>>>> I spent some time writing a script for diff-ing CASes
>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>> this type of approach.  Comparison of program output is a
>>>> post-process task, and unless absolutely necessary code to juggle
>>>> data and metadata belongs there.  Attempts to force every module
>>>> past, present and Future to abide by fixed orderings, enumerations
>>>> etc. is not as simple a task as one might initially think -
>>>> especially if third-party libraries are involved.  I won't get into
>>>> problems associated with why one is comparing output (swapped
>>>> module?) and IDs, orders etc. being different because of a possibly
>>>> intentional difference.
>>>>
>>>> In addition to or instead of creating a post-processing script, one
>>>> could write a new "cas-consumer" that writes output in a desired
>>>> format - but this should not require changes to engines.
>>>>
>>>> "If it ain't broke, don't fix it"
>>>>
>>>> Sean
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>> <br...@perfectsearchcorp.com> wrote:
>>>>> Since I started working with cTakes some time ago, I have found it
>>>>> difficult to compare the output between subsequent runs on the same
>>>>> files because annotations are often assigned different IDs, are
>>>>> listed in different order, etc.
>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>> that intended to address some of these kinds of issues. It's still
>>>> here in cTAKES:
>>>>
>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>>> /CompareFeatureStructures.java
>>>>
>>>> You might see if you could use or adapt that to your needs.
>>>>
>>>> Steve
>>
>


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
It concerns me a bit by making the code return consistent results would
be so concerning. This should be the default mode of operation.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:59 AM, britt fitch wrote:
> I think changing the code raises at least some concerns of affecting
> others, while adding a custom consumer raises zero. Given how easy it
> is to write a custom consumer, that is my vote. 
>
>   	  	  	 
>
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
>
> On Oct 7, 2014, at 11:56 AM, Kim Ebert
> <kim.ebert@perfectsearchcorp.com
> <ma...@perfectsearchcorp.com>> wrote:
>
>> I think we may really prefer the first method. Since it doesn't appear
>> that there are any consequences with moving forward with changing the
>> code, we would really like to move forward with this approach.
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 09:35 AM, britt fitch wrote:
>>> The option Sean mentioned of writing your own custom consumer (without
>>> the UIMA id that is causing your issues) should meet these needs I
>>> believe. 
>>>
>>>       
>>>
>>> Britt Fitch
>>> Wired Informatics
>>> 265 Franklin St Ste 1702
>>> Boston, MA 02110
>>> http://wiredinformatics.com
>>> Britt.Fitch@wiredinformatics.com
>>>
>>> On Oct 7, 2014, at 11:29 AM, Kim Ebert
>>> <kim.ebert@perfectsearchcorp.com
>>> <ma...@perfectsearchcorp.com>> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Well of course that makes plenty of sense. Testing different cTakes
>>>> configurations you would expect different output. In our testing we've
>>>> found several cases where running with the same configuration outputs
>>>> different data under different moons. Having consistent results
>>>> helps us
>>>> know if we've made improvements to our quality or not. Having output
>>>> that is in a predictable order makes checking to see if there are
>>>> differences much cheaper when you are dealing with larger data sets.
>>>>
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>>
>>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>>> Hi Kim,
>>>>>
>>>>> One might want compare the Sentence detector that uses end of line
>>>>> characters as sentence splitters with one that does not.  Such a
>>>>> change in sentence splitting would not only effect the sentence type
>>>>> discoveries but also practically every type that follows.
>>>>>
>>>>> Another might want to compare a note with "skin cancer" vs. one in
>>>>> which you replace "skin cancer" with "melanoma" just to see what the
>>>>> CUI differences might be.  There are changes in two words vs. one,
>>>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>>>> in CUIs.
>>>>>
>>>>> Of course, if you are just running notes on a new moon and then
>>>>> again on a full moon ...
>>>>>
>>>>> Sean
>>>>>
>>>>> -----Original Message-----
>>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: cTakes output predictability
>>>>>
>>>>> Sean,
>>>>>
>>>>> "...being different because of a possibly intentional difference."
>>>>>
>>>>> I would like you to elaborate a bit on the what would be
>>>>> intentionally different between the processing of the same document
>>>>> multiple times. It would help my understanding of cTakes.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Kim Ebert
>>>>> 1.801.669.7342
>>>>> Perfect Search Corp
>>>>> http://www.perfectsearchcorp.com/
>>>>>
>>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>>> Steve Bethard wrote:
>>>>>>> I spent some time writing a script for diff-ing CASes
>>>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>>>> this type of approach.  Comparison of program output is a
>>>>>> post-process task, and unless absolutely necessary code to juggle
>>>>>> data and metadata belongs there.  Attempts to force every module
>>>>>> past, present and Future to abide by fixed orderings, enumerations
>>>>>> etc. is not as simple a task as one might initially think -
>>>>>> especially if third-party libraries are involved.  I won't get into
>>>>>> problems associated with why one is comparing output (swapped
>>>>>> module?) and IDs, orders etc. being different because of a possibly
>>>>>> intentional difference.
>>>>>>
>>>>>> In addition to or instead of creating a post-processing script, one
>>>>>> could write a new "cas-consumer" that writes output in a desired
>>>>>> format - but this should not require changes to engines.
>>>>>>
>>>>>> "If it ain't broke, don't fix it"
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Re: cTakes output predictability
>>>>>>
>>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>>> Since I started working with cTakes some time ago, I have found it
>>>>>>> difficult to compare the output between subsequent runs on the same
>>>>>>> files because annotations are often assigned different IDs, are
>>>>>>> listed in different order, etc.
>>>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>>>> that intended to address some of these kinds of issues. It's still
>>>>>> here in cTAKES:
>>>>>>
>>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>>>>> /CompareFeatureStructures.java
>>>>>>
>>>>>> You might see if you could use or adapt that to your needs.
>>>>>>
>>>>>> Steve
>


Re: cTakes output predictability

Posted by britt fitch <br...@wiredinformatics.com>.
I think changing the code raises at least some concerns of affecting others, while adding a custom consumer raises zero. Given how easy it is to write a custom consumer, that is my vote. 

 	 	 	 
Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

On Oct 7, 2014, at 11:56 AM, Kim Ebert <ki...@perfectsearchcorp.com> wrote:

> I think we may really prefer the first method. Since it doesn't appear
> that there are any consequences with moving forward with changing the
> code, we would really like to move forward with this approach.
> 
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
> 
> On 10/07/2014 09:35 AM, britt fitch wrote:
>> The option Sean mentioned of writing your own custom consumer (without
>> the UIMA id that is causing your issues) should meet these needs I
>> believe. 
>> 
>>  	  	  	 
>> 
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com
>> Britt.Fitch@wiredinformatics.com
>> 
>> On Oct 7, 2014, at 11:29 AM, Kim Ebert
>> <kim.ebert@perfectsearchcorp.com
>> <ma...@perfectsearchcorp.com>> wrote:
>> 
>>> Hi Sean,
>>> 
>>> Well of course that makes plenty of sense. Testing different cTakes
>>> configurations you would expect different output. In our testing we've
>>> found several cases where running with the same configuration outputs
>>> different data under different moons. Having consistent results helps us
>>> know if we've made improvements to our quality or not. Having output
>>> that is in a predictable order makes checking to see if there are
>>> differences much cheaper when you are dealing with larger data sets.
>>> 
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>> 
>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>> Hi Kim,
>>>> 
>>>> One might want compare the Sentence detector that uses end of line
>>>> characters as sentence splitters with one that does not.  Such a
>>>> change in sentence splitting would not only effect the sentence type
>>>> discoveries but also practically every type that follows.
>>>> 
>>>> Another might want to compare a note with "skin cancer" vs. one in
>>>> which you replace "skin cancer" with "melanoma" just to see what the
>>>> CUI differences might be.  There are changes in two words vs. one,
>>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>>> in CUIs.
>>>> 
>>>> Of course, if you are just running notes on a new moon and then
>>>> again on a full moon ...
>>>> 
>>>> Sean
>>>> 
>>>> -----Original Message-----
>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>> 
>>>> Sean,
>>>> 
>>>> "...being different because of a possibly intentional difference."
>>>> 
>>>> I would like you to elaborate a bit on the what would be
>>>> intentionally different between the processing of the same document
>>>> multiple times. It would help my understanding of cTakes.
>>>> 
>>>> Thanks,
>>>> 
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>> 
>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>> Steve Bethard wrote:
>>>>>> I spent some time writing a script for diff-ing CASes
>>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>>> this type of approach.  Comparison of program output is a
>>>>> post-process task, and unless absolutely necessary code to juggle
>>>>> data and metadata belongs there.  Attempts to force every module
>>>>> past, present and Future to abide by fixed orderings, enumerations
>>>>> etc. is not as simple a task as one might initially think -
>>>>> especially if third-party libraries are involved.  I won't get into
>>>>> problems associated with why one is comparing output (swapped
>>>>> module?) and IDs, orders etc. being different because of a possibly
>>>>> intentional difference.
>>>>> 
>>>>> In addition to or instead of creating a post-processing script, one
>>>>> could write a new "cas-consumer" that writes output in a desired
>>>>> format - but this should not require changes to engines.
>>>>> 
>>>>> "If it ain't broke, don't fix it"
>>>>> 
>>>>> Sean
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: cTakes output predictability
>>>>> 
>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>> Since I started working with cTakes some time ago, I have found it
>>>>>> difficult to compare the output between subsequent runs on the same
>>>>>> files because annotations are often assigned different IDs, are
>>>>>> listed in different order, etc.
>>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>>> that intended to address some of these kinds of issues. It's still
>>>>> here in cTAKES:
>>>>> 
>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>>>> /CompareFeatureStructures.java
>>>>> 
>>>>> You might see if you could use or adapt that to your needs.
>>>>> 
>>>>> Steve


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Hi Sean,

Yes, I mean actual type values not matching.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 10:46 AM, Finan, Sean wrote:
> Hi Kim,
>
>> It concerns me a bit by making the code return consistent results would be so concerning. 
> Could you please clarify what you mean by "consistent results"?  Do you mean ordering and IDs or are you talking about actual type values not matching?
>
>> This should be the default mode of operation.
> Depending upon what you meant above, I may agree or disagree.
>
>> Since it doesn't appear that there are any consequences with moving forward with changing the code
> Why do you say this?  
>
> I think that there may be more required changes than you realize.  Every insertion into the CAS must be of ordered data.  This means that, for instance, named entities discovered by dictionary will need to be inserted in some predictable order, such as by alphabetized cui per every alphabetized tui (and other code) per ordered text span.  You will need to check and recheck every point at which the CAS is modified by every module.  Right now there are at least three or four places in two cTakes dictionary modules where a change would be required - and that doesn't include YTEX lookup.
>
> If you really feel strongly about this and are going to change cTakes code, then I suggest (at the risk of sounding like a complete jerk) that you also consider the following:
> 1.  Don't check anything into trunk until all is well with your changes and tests
> Just in case you abandon the effort
> 2.  Write unit tests for every change   
> True, Map to LinkedMap shouldn't break anything, but they are good to have, and may prevent others in the future from switching back to a non-linked map or any unordered collection (set not list, etc.).  It also makes a better place for explanation in Javadoc than inlines above the code.
> 3.  Run memory requirement tests before all of your changes and then again after your changes
> I'm actually curious about how much memory might be eaten with linkages everywhere
> 4.  Run performance (speed) tests before and after
> On a large corpus to ensure that garbage collection is involved
> 5.  Do the above with every combination possible in current workflows: every combination of available sentence detector, pos tagger, smoking status detector, dictionary lookup, cas consumer, etc.
> As soon as somebody says "all output is consistently ordered between runs" it had better be so for every possible workflow
> 6.  Write system tests to ensure ordered/predicted outputs with each combination
> Otherwise somebody may break it
> 7.  Document the what, how, and why for future development
> Otherwise somebody won't know to stick to the new rules
> 8.  Assist anybody as needed that in the future breaks one of these unit or system tests with a fix or new feature
> By mandating such a rule you are assuming responsibility for it
> 9.  Assist anybody as needed that in the future adds a new module or workflow to cTakes to abide by the ordering requirement
> By mandating such a rule you are assuming responsibility for it
> 10.  Assist anybody as needed that in the future adds a new module or workflow to add system tests to ensure maintenance of the ordering requirement
> By mandating such a rule you are assuming responsibility for it
>
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 11:57 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> I think we may really prefer the first method. Since it doesn't appear that there are any consequences with moving forward with changing the code, we would really like to move forward with this approach.
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 09:35 AM, britt fitch wrote:
>> The option Sean mentioned of writing your own custom consumer (without 
>> the UIMA id that is causing your issues) should meet these needs I 
>> believe.
>>
>>   	  	  	 
>>
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com
>> Britt.Fitch@wiredinformatics.com
>>
>> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
>> <kim.ebert@perfectsearchcorp.com 
>> <ma...@perfectsearchcorp.com>> wrote:
>>
>>> Hi Sean,
>>>
>>> Well of course that makes plenty of sense. Testing different cTakes 
>>> configurations you would expect different output. In our testing 
>>> we've found several cases where running with the same configuration 
>>> outputs different data under different moons. Having consistent 
>>> results helps us know if we've made improvements to our quality or 
>>> not. Having output that is in a predictable order makes checking to 
>>> see if there are differences much cheaper when you are dealing with larger data sets.
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>> Hi Kim,
>>>>
>>>> One might want compare the Sentence detector that uses end of line 
>>>> characters as sentence splitters with one that does not.  Such a 
>>>> change in sentence splitting would not only effect the sentence type 
>>>> discoveries but also practically every type that follows.
>>>>
>>>> Another might want to compare a note with "skin cancer" vs. one in 
>>>> which you replace "skin cancer" with "melanoma" just to see what the 
>>>> CUI differences might be.  There are changes in two words vs. one,
>>>> 11 characters vs. 8, a removed adjective(?), and of course changes 
>>>> in CUIs.
>>>>
>>>> Of course, if you are just running notes on a new moon and then 
>>>> again on a full moon ...
>>>>
>>>> Sean
>>>>
>>>> -----Original Message-----
>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> Sean,
>>>>
>>>> "...being different because of a possibly intentional difference."
>>>>
>>>> I would like you to elaborate a bit on the what would be 
>>>> intentionally different between the processing of the same document 
>>>> multiple times. It would help my understanding of cTakes.
>>>>
>>>> Thanks,
>>>>
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>>
>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>> Steve Bethard wrote:
>>>>>> I spent some time writing a script for diff-ing CASes
>>>>> I urge anyone interested in comparing cTakes CASes / output to use 
>>>>> this type of approach.  Comparison of program output is a 
>>>>> post-process task, and unless absolutely necessary code to juggle 
>>>>> data and metadata belongs there.  Attempts to force every module 
>>>>> past, present and Future to abide by fixed orderings, enumerations 
>>>>> etc. is not as simple a task as one might initially think - 
>>>>> especially if third-party libraries are involved.  I won't get into 
>>>>> problems associated with why one is comparing output (swapped
>>>>> module?) and IDs, orders etc. being different because of a possibly 
>>>>> intentional difference.
>>>>>
>>>>> In addition to or instead of creating a post-processing script, one 
>>>>> could write a new "cas-consumer" that writes output in a desired 
>>>>> format - but this should not require changes to engines.
>>>>>
>>>>> "If it ain't broke, don't fix it"
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: cTakes output predictability
>>>>>
>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>> Since I started working with cTakes some time ago, I have found it 
>>>>>> difficult to compare the output between subsequent runs on the 
>>>>>> same files because annotations are often assigned different IDs, 
>>>>>> are listed in different order, etc.
>>>>> At one point, I spent some time writing a script for diff-ing CASes 
>>>>> that intended to address some of these kinds of issues. It's still 
>>>>> here in cTAKES:
>>>>>
>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy
>>>>> sis
>>>>> /CompareFeatureStructures.java
>>>>>
>>>>> You might see if you could use or adapt that to your needs.
>>>>>
>>>>> Steve
>


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Hi Bruce,

Could you send the record over that you are seeing this on?

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 11:20 AM, Bruce Tietjen wrote:
> I did not intend to step on anyone's toes.
>
> One of the reasons I proposed the changes was to try to make it extremely
> obvious when there are significant difference in output from the cTakes
> pipeline when running the same document again, and once identified, make it
> easier to identify the source of the difference.
>
> Because of the huge number of differences between the output using the
> FileWriterCasConsumer.xml, first detecting that there is a significant
> differences and identifying them for a large set of documents is a daunting
> task.
>
> The following is an example of some significant differences that I have
> detected between two subsequent runs on the same document using the current
> release of cTakes. (There are actually quite a few documents that exhibit
> this kind of behavior. This is only one example.)
>
>
> Snippet from first run:
>
>     <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
> _indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110"
> _ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
> _indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0"
> _ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>
>
> Snippet from subsequent trun:
>
>     <org.apache.ctakes.typesystem.type.textsem.ProcedureMention
> _indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125"
> _ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
> _indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/>
>
>
> Note that in the first instance, there were two MedicationMentions, but in
> the second, there is only one.
>
> Yes, everyone could write their own custom compare code, but wouldn't it be
> more valuable to the community to make that task easier?
>
> Thanks,
>
> Bruce Tietjen
>
>
>
>  [image: IMAT Solutions] <http://imatsolutions.com>
>  Bruce Tietjen
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tietjen@imatsolutions.com
>
> On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert <ki...@perfectsearchcorp.com>
> wrote:
>
>> Hi Sean,
>>
>> No, your not a jerk. These are things worth considering, and I
>> understand your concerns with touching various points of the codebase.
>>
>> I'll talk with our group over here and see where we want to go. We are
>> really interested in cTakes behaving well, so we are usually pretty
>> careful in testing our changes before committing anything.
>>
>> Thanks,
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 10:46 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>>> It concerns me a bit by making the code return consistent results would
>> be so concerning.
>>> Could you please clarify what you mean by "consistent results"?  Do you
>> mean ordering and IDs or are you talking about actual type values not
>> matching?
>>>> This should be the default mode of operation.
>>> Depending upon what you meant above, I may agree or disagree.
>>>
>>>> Since it doesn't appear that there are any consequences with moving
>> forward with changing the code
>>> Why do you say this?
>>>
>>> I think that there may be more required changes than you realize.  Every
>> insertion into the CAS must be of ordered data.  This means that, for
>> instance, named entities discovered by dictionary will need to be inserted
>> in some predictable order, such as by alphabetized cui per every
>> alphabetized tui (and other code) per ordered text span.  You will need to
>> check and recheck every point at which the CAS is modified by every
>> module.  Right now there are at least three or four places in two cTakes
>> dictionary modules where a change would be required - and that doesn't
>> include YTEX lookup.
>>> If you really feel strongly about this and are going to change cTakes
>> code, then I suggest (at the risk of sounding like a complete jerk) that
>> you also consider the following:
>>> 1.  Don't check anything into trunk until all is well with your changes
>> and tests
>>> Just in case you abandon the effort
>>> 2.  Write unit tests for every change
>>> True, Map to LinkedMap shouldn't break anything, but they are good to
>> have, and may prevent others in the future from switching back to a
>> non-linked map or any unordered collection (set not list, etc.).  It also
>> makes a better place for explanation in Javadoc than inlines above the code.
>>> 3.  Run memory requirement tests before all of your changes and then
>> again after your changes
>>> I'm actually curious about how much memory might be eaten with linkages
>> everywhere
>>> 4.  Run performance (speed) tests before and after
>>> On a large corpus to ensure that garbage collection is involved
>>> 5.  Do the above with every combination possible in current workflows:
>> every combination of available sentence detector, pos tagger, smoking
>> status detector, dictionary lookup, cas consumer, etc.
>>> As soon as somebody says "all output is consistently ordered between
>> runs" it had better be so for every possible workflow
>>> 6.  Write system tests to ensure ordered/predicted outputs with each
>> combination
>>> Otherwise somebody may break it
>>> 7.  Document the what, how, and why for future development
>>> Otherwise somebody won't know to stick to the new rules
>>> 8.  Assist anybody as needed that in the future breaks one of these unit
>> or system tests with a fix or new feature
>>> By mandating such a rule you are assuming responsibility for it
>>> 9.  Assist anybody as needed that in the future adds a new module or
>> workflow to cTakes to abide by the ordering requirement
>>> By mandating such a rule you are assuming responsibility for it
>>> 10.  Assist anybody as needed that in the future adds a new module or
>> workflow to add system tests to ensure maintenance of the ordering
>> requirement
>>> By mandating such a rule you are assuming responsibility for it
>>>
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 11:57 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> I think we may really prefer the first method. Since it doesn't appear
>> that there are any consequences with moving forward with changing the code,
>> we would really like to move forward with this approach.
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 09:35 AM, britt fitch wrote:
>>>> The option Sean mentioned of writing your own custom consumer (without
>>>> the UIMA id that is causing your issues) should meet these needs I
>>>> believe.
>>>>
>>>>
>>>>
>>>> Britt Fitch
>>>> Wired Informatics
>>>> 265 Franklin St Ste 1702
>>>> Boston, MA 02110
>>>> http://wiredinformatics.com
>>>> Britt.Fitch@wiredinformatics.com
>>>>
>>>> On Oct 7, 2014, at 11:29 AM, Kim Ebert
>>>> <kim.ebert@perfectsearchcorp.com
>>>> <ma...@perfectsearchcorp.com>> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Well of course that makes plenty of sense. Testing different cTakes
>>>>> configurations you would expect different output. In our testing
>>>>> we've found several cases where running with the same configuration
>>>>> outputs different data under different moons. Having consistent
>>>>> results helps us know if we've made improvements to our quality or
>>>>> not. Having output that is in a predictable order makes checking to
>>>>> see if there are differences much cheaper when you are dealing with
>> larger data sets.
>>>>> Kim Ebert
>>>>> 1.801.669.7342
>>>>> Perfect Search Corp
>>>>> http://www.perfectsearchcorp.com/
>>>>>
>>>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>>>> Hi Kim,
>>>>>>
>>>>>> One might want compare the Sentence detector that uses end of line
>>>>>> characters as sentence splitters with one that does not.  Such a
>>>>>> change in sentence splitting would not only effect the sentence type
>>>>>> discoveries but also practically every type that follows.
>>>>>>
>>>>>> Another might want to compare a note with "skin cancer" vs. one in
>>>>>> which you replace "skin cancer" with "melanoma" just to see what the
>>>>>> CUI differences might be.  There are changes in two words vs. one,
>>>>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>>>>> in CUIs.
>>>>>>
>>>>>> Of course, if you are just running notes on a new moon and then
>>>>>> again on a full moon ...
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Re: cTakes output predictability
>>>>>>
>>>>>> Sean,
>>>>>>
>>>>>> "...being different because of a possibly intentional difference."
>>>>>>
>>>>>> I would like you to elaborate a bit on the what would be
>>>>>> intentionally different between the processing of the same document
>>>>>> multiple times. It would help my understanding of cTakes.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Kim Ebert
>>>>>> 1.801.669.7342
>>>>>> Perfect Search Corp
>>>>>> http://www.perfectsearchcorp.com/
>>>>>>
>>>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>>>> Steve Bethard wrote:
>>>>>>>> I spent some time writing a script for diff-ing CASes
>>>>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>>>>> this type of approach.  Comparison of program output is a
>>>>>>> post-process task, and unless absolutely necessary code to juggle
>>>>>>> data and metadata belongs there.  Attempts to force every module
>>>>>>> past, present and Future to abide by fixed orderings, enumerations
>>>>>>> etc. is not as simple a task as one might initially think -
>>>>>>> especially if third-party libraries are involved.  I won't get into
>>>>>>> problems associated with why one is comparing output (swapped
>>>>>>> module?) and IDs, orders etc. being different because of a possibly
>>>>>>> intentional difference.
>>>>>>>
>>>>>>> In addition to or instead of creating a post-processing script, one
>>>>>>> could write a new "cas-consumer" that writes output in a desired
>>>>>>> format - but this should not require changes to engines.
>>>>>>>
>>>>>>> "If it ain't broke, don't fix it"
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>>>> To: dev@ctakes.apache.org
>>>>>>> Subject: Re: cTakes output predictability
>>>>>>>
>>>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>>>> Since I started working with cTakes some time ago, I have found it
>>>>>>>> difficult to compare the output between subsequent runs on the
>>>>>>>> same files because annotations are often assigned different IDs,
>>>>>>>> are listed in different order, etc.
>>>>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>>>>> that intended to address some of these kinds of issues. It's still
>>>>>>> here in cTAKES:
>>>>>>>
>>>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy
>>>>>>> sis
>>>>>>> /CompareFeatureStructures.java
>>>>>>>
>>>>>>> You might see if you could use or adapt that to your needs.
>>>>>>>
>>>>>>> Steve
>>


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Hi Sean,

Alright, it seems that rather than doing the sorted approach, we want to
manage these individually. I'll create tickets on all of the items we
have found so far. This is just one example. Then maybe we can move our
discussion of how to solve each one to discussions around that ticket
instead of this really long email thread.

I just wanted to check which way we wanted to go on these.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 03:07 PM, Finan, Sean wrote:
> Hi Kim,
>
> Great Catch!
>
> I think that by now this thread may be discarded by most as spam.  So, I'm back (apologies - I know that you are tired of me by now).
>
> I checked the code that you pointed to ...  I really dislike looking at older cTakes code because I'm filled with an overwhelming urge to refactor.
>
> If I understand the code correctly (it could use some doc), it runs negation engines and then if any negation exists it creates a single hit signifying negation.  Like a heavyweight Boolean.   Unfortunately, as you know, because Collection "s"  is a Set and it throws in the first token to come along ...  
>
> An isolated change here would probably be better than going through the entire code base and switching to LinkedHashMaps, Lists, etc. - plus it would fix your problem.
>
> You could (for reuse by others, assuming that one doesn't already exist) create a singleton BaseTokenComparator implements Comparator<BaseToken>  with something like:
>    public int compare( final BaseToken textSpan1, final BaseToken textSpan2 ) {
>       if ( textSpan1. getStartOffset () != textSpan2. getStartOffset () ) {
>          return textSpan1. getStartOffset () - textSpan2. getStartOffset ();
>       }
>       return textSpan1. getEndOffset () - textSpan2. getEndOffset ();
>    }
>
> And in NegationContextAnalyzer line ~48
> Final List<NegationIndicator> negatorsList = new ArrayList( _negIndicatorFSM.execute(fsmTokenList) );
> If ( !negatorsList.isEmpty() ) {
> 	Collections.sort( negatorsList, BaseTokenComparator.getInstance() );	
> 	Return new ContextHit( negatorsList.get(0).getStartOffset(), negatorsList.get(0).getEndOffset() );
>
> Or you could write a (faster) method to use in place of the List and Sort like:
> BaseToken getFirstTextSpan( final Iterable<BaseToken> tokens ) {
> 	BaseToken firstToken  = null;
> 	For ( BaseToken token : tokens ) {
> 		If ( firstToken == null || token.getStartOffset() < firstToken.getStartOffset() ) {
> 			firstToken = token;
> 			continue;
> 		}
> 		If ( token.getStartOffset() == firstToken.getStartOffset() && token.getEndOffset() < firstToken.getEndOffset() ) {
> 			firstToken = token;
> 		}
> 	}
> 	Return firstToken; 
> 		
>
> Of course, a perfectly reasonable question to pose to the community is something like "Is the best stored negation context the first or largest or ???"  Perhaps the first negator span isn't the most wanted for later use - perhaps it is the most-encompassing span so that multiple words can be reused.  You could throw that out under a new thread title and perhaps the original authors or current users would speak up as to what might be best.  Personally I have no idea.
>
> Anyway, great catch!
>
> Sean
>
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 3:11 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> Hi all,
>
> I'm not sure these should be classified as bugs. They look l like design decisions at some point, but they do have impact in the consistency of the results. If they are right are not might be something to debate later down the road, but it would be nice to be consistent in the output.
>
> For example, I have the following text.
>
> "I do not see any"
>
> Can result in the following ContextAnnotations:
>
> <org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
> _indexed="1" _id="130" _ref_sofa="1" begin="*13*" end="*16*" id="0"
> typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
> uncertainty="0" conditional="false" generic="false" historyOf="0"
> FocusText="I" Scope="RIGHT"/>
>
> or
>
> <org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
> _indexed="1" _id="130" _ref_sofa="1" begin="*5*" end="*16*" id="0"
> typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
> uncertainty="0" conditional="false" generic="false" historyOf="0"
> FocusText="I" Scope="RIGHT"/>
>
> or
>
> <org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
> _indexed="1" _id="130" _ref_sofa="1" begin="*5*" end="*8*" id="0"
> typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
> uncertainty="0" conditional="false" generic="false" historyOf="0"
> FocusText="I" Scope="RIGHT"/>
>
> Well, after doing some digging it turns out that org.apache.ctakes.necontexts.negation.NegationContextAnalyzer is to blame.
>
> The code looks like the following:
>
>     public ContextHit analyzeContext(List<? extends Annotation> contextTokens, int scopeOrientation)
>             throws AnalysisEngineProcessException {
>         List<TextToken> fsmTokenList = wrapAsFsmTokens(contextTokens);
>
>         try {
>             Set<NegationIndicator> s =
> _negIndicatorFSM.execute(fsmTokenList);
>
>             *if (s.size() > 0) {*
>                 NegationIndicator neg = s.iterator().next();
>                *return new ContextHit(neg.getStartOffset(),
> neg.getEndOffset());*
>             } else {
>                 return null;
>             }
>         } catch (Exception e) {
>             throw new AnalysisEngineProcessException(e);
>         }
>     }
>
> This will at most return one item from the Set. Since the set is an unordered hash, this will result in one of three options to be returned.
> Is this a bug, or a design decision. Which one is right? Which one is wrong? It maybe this is a disign decision, but it would be nice if we are consistently right, or consistently wrong. Many other instances of this result in similar issues.
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 12:43 PM, Finan, Sean wrote:
>> I'm just about sapped on this topic.  What comes below is my final writing.
>>
>> Kim wrote:
>>> Yes, I mean actual type values not matching.
>> Ok, this is a very serious problem and should have nothing to do with ordering and/or IDs.  I repeat: this should have nothing to do with ordering or ids.  Reordering or changing ID assignment, while possibly producing repeatable output, will not necessary fix the actual bug.  Please write a Jira for each item, and (imo) we should think about withholding any non-bug-fix release until they have been dealt with.
>>
>> Bruce wrote:
>>> I did not intend to step on anyone's toes.
>> No worries - I don't think that any toes have been stepped upon. It is good that questions and concerns are shared with the group.  
>>
>>> Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.
>> Assuming that the second drug mention doesn't appear elsewhere in output2 then this needs to be addressed.  Please log a tar.  Relating this to the order/id issue, which number of mentions is correct (2)?  If you reorder will that consistently output two medications instead of one or one medication instead of two?  This is most likely a bug in the identification and/or storage and/or retrieval code and needs to be fixed there.
>>
>>> Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?
>> I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations could be started and people could add to it as needed.  I would also hope that a reusable post-process comparison utility could be started and improved/maintained.
>>
>> Sean
>>
>>
>> -----Original Message-----
>> From: Bruce Tietjen [mailto:bruce.tietjen@perfectsearchcorp.com]
>> Sent: Tuesday, October 07, 2014 1:21 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes output predictability
>>
>> I did not intend to step on anyone's toes.
>>
>> One of the reasons I proposed the changes was to try to make it extremely obvious when there are significant difference in output from the cTakes pipeline when running the same document again, and once identified, make it easier to identify the source of the difference.
>>
>> Because of the huge number of differences between the output using the FileWriterCasConsumer.xml, first detecting that there is a significant differences and identifying them for a large set of documents is a daunting task.
>>
>> The following is an example of some significant differences that I 
>> have detected between two subsequent runs on the same document using 
>> the current release of cTakes. (There are actually quite a few 
>> documents that exhibit this kind of behavior. This is only one 
>> example.)
>>
>>
>> Snippet from first run:
>>
>>     <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
>> _indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/>
>>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
>> _indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95"
>> _ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT"
>> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
>> conditional="false" generic="true" subject="patient" historyOf="0"/>
>>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
>> _indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110"
>> _ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT"
>> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
>> conditional="false" generic="false" subject="patient" historyOf="0"/>
>>     <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
>> _indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0"
>> _ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT"
>> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
>> conditional="false" generic="false" subject="patient" historyOf="0"/>
>>
>>
>> Snippet from subsequent trun:
>>
>>     <org.apache.ctakes.typesystem.type.textsem.ProcedureMention
>> _indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125"
>> _ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT"
>> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
>> conditional="false" generic="false" subject="patient" historyOf="0"/>
>>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
>> _indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95"
>> _ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT"
>> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
>> conditional="false" generic="true" subject="patient" historyOf="0"/>
>>     <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
>> _indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/>
>>
>>
>> Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.
>>
>> Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?
>>
>> Thanks,
>>
>> Bruce Tietjen
>>
>>
>>
>>  [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen 
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tietjen@imatsolutions.com
>>
>> On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert 
>> <ki...@perfectsearchcorp.com>
>> wrote:
>>
>>> Hi Sean,
>>>
>>> No, your not a jerk. These are things worth considering, and I 
>>> understand your concerns with touching various points of the codebase.
>>>
>>> I'll talk with our group over here and see where we want to go. We 
>>> are really interested in cTakes behaving well, so we are usually 
>>> pretty careful in testing our changes before committing anything.
>>>
>>> Thanks,
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 10:46 AM, Finan, Sean wrote:
>>>> Hi Kim,
>>>>
>>>>> It concerns me a bit by making the code return consistent results 
>>>>> would
>>> be so concerning.
>>>> Could you please clarify what you mean by "consistent results"?  Do 
>>>> you
>>> mean ordering and IDs or are you talking about actual type values not 
>>> matching?
>>>>> This should be the default mode of operation.
>>>> Depending upon what you meant above, I may agree or disagree.
>>>>
>>>>> Since it doesn't appear that there are any consequences with moving
>>> forward with changing the code
>>>> Why do you say this?
>>>>
>>>> I think that there may be more required changes than you realize.  
>>>> Every
>>> insertion into the CAS must be of ordered data.  This means that, for 
>>> instance, named entities discovered by dictionary will need to be 
>>> inserted in some predictable order, such as by alphabetized cui per 
>>> every alphabetized tui (and other code) per ordered text span.  You 
>>> will need to check and recheck every point at which the CAS is 
>>> modified by every module.  Right now there are at least three or four 
>>> places in two cTakes dictionary modules where a change would be 
>>> required - and that doesn't include YTEX lookup.
>>>> If you really feel strongly about this and are going to change 
>>>> cTakes
>>> code, then I suggest (at the risk of sounding like a complete jerk) 
>>> that you also consider the following:
>>>> 1.  Don't check anything into trunk until all is well with your 
>>>> changes
>>> and tests
>>>> Just in case you abandon the effort
>>>> 2.  Write unit tests for every change True, Map to LinkedMap 
>>>> shouldn't break anything, but they are good to
>>> have, and may prevent others in the future from switching back to a 
>>> non-linked map or any unordered collection (set not list, etc.).  It 
>>> also makes a better place for explanation in Javadoc than inlines above the code.
>>>> 3.  Run memory requirement tests before all of your changes and then
>>> again after your changes
>>>> I'm actually curious about how much memory might be eaten with 
>>>> linkages
>>> everywhere
>>>> 4.  Run performance (speed) tests before and after On a large corpus 
>>>> to ensure that garbage collection is involved 5.  Do the above with 
>>>> every combination possible in current workflows:
>>> every combination of available sentence detector, pos tagger, smoking 
>>> status detector, dictionary lookup, cas consumer, etc.
>>>> As soon as somebody says "all output is consistently ordered between
>>> runs" it had better be so for every possible workflow
>>>> 6.  Write system tests to ensure ordered/predicted outputs with each
>>> combination
>>>> Otherwise somebody may break it
>>>> 7.  Document the what, how, and why for future development Otherwise 
>>>> somebody won't know to stick to the new rules 8.  Assist anybody as 
>>>> needed that in the future breaks one of these unit
>>> or system tests with a fix or new feature
>>>> By mandating such a rule you are assuming responsibility for it 9.  
>>>> Assist anybody as needed that in the future adds a new module or
>>> workflow to cTakes to abide by the ordering requirement
>>>> By mandating such a rule you are assuming responsibility for it 10.  
>>>> Assist anybody as needed that in the future adds a new module or
>>> workflow to add system tests to ensure maintenance of the ordering 
>>> requirement
>>>> By mandating such a rule you are assuming responsibility for it
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>> Sent: Tuesday, October 07, 2014 11:57 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> I think we may really prefer the first method. Since it doesn't 
>>>> appear
>>> that there are any consequences with moving forward with changing the 
>>> code, we would really like to move forward with this approach.
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>>
>>>> On 10/07/2014 09:35 AM, britt fitch wrote:
>>>>> The option Sean mentioned of writing your own custom consumer 
>>>>> (without the UIMA id that is causing your issues) should meet these 
>>>>> needs I believe.
>>>>>
>>>>>
>>>>>
>>>>> Britt Fitch
>>>>> Wired Informatics
>>>>> 265 Franklin St Ste 1702
>>>>> Boston, MA 02110
>>>>> http://wiredinformatics.com
>>>>> Britt.Fitch@wiredinformatics.com
>>>>>
>>>>> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
>>>>> <kim.ebert@perfectsearchcorp.com 
>>>>> <ma...@perfectsearchcorp.com>> wrote:
>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Well of course that makes plenty of sense. Testing different 
>>>>>> cTakes configurations you would expect different output. In our 
>>>>>> testing we've found several cases where running with the same 
>>>>>> configuration outputs different data under different moons. Having 
>>>>>> consistent results helps us know if we've made improvements to our 
>>>>>> quality or not. Having output that is in a predictable order makes 
>>>>>> checking to see if there are differences much cheaper when you are 
>>>>>> dealing with
>>> larger data sets.
>>>>>> Kim Ebert
>>>>>> 1.801.669.7342
>>>>>> Perfect Search Corp
>>>>>> http://www.perfectsearchcorp.com/
>>>>>>
>>>>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>>>>> Hi Kim,
>>>>>>>
>>>>>>> One might want compare the Sentence detector that uses end of 
>>>>>>> line characters as sentence splitters with one that does not.
>>>>>>> Such a change in sentence splitting would not only effect the 
>>>>>>> sentence type discoveries but also practically every type that follows.
>>>>>>>
>>>>>>> Another might want to compare a note with "skin cancer" vs. one 
>>>>>>> in which you replace "skin cancer" with "melanoma" just to see 
>>>>>>> what the CUI differences might be.  There are changes in two 
>>>>>>> words vs. one,
>>>>>>> 11 characters vs. 8, a removed adjective(?), and of course 
>>>>>>> changes in CUIs.
>>>>>>>
>>>>>>> Of course, if you are just running notes on a new moon and then 
>>>>>>> again on a full moon ...
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>>>>> To: dev@ctakes.apache.org
>>>>>>> Subject: Re: cTakes output predictability
>>>>>>>
>>>>>>> Sean,
>>>>>>>
>>>>>>> "...being different because of a possibly intentional difference."
>>>>>>>
>>>>>>> I would like you to elaborate a bit on the what would be 
>>>>>>> intentionally different between the processing of the same 
>>>>>>> document multiple times. It would help my understanding of cTakes.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Kim Ebert
>>>>>>> 1.801.669.7342
>>>>>>> Perfect Search Corp
>>>>>>> http://www.perfectsearchcorp.com/
>>>>>>>
>>>>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>>>>> Steve Bethard wrote:
>>>>>>>>> I spent some time writing a script for diff-ing CASes
>>>>>>>> I urge anyone interested in comparing cTakes CASes / output to 
>>>>>>>> use this type of approach.  Comparison of program output is a 
>>>>>>>> post-process task, and unless absolutely necessary code to 
>>>>>>>> juggle data and metadata belongs there.  Attempts to force every 
>>>>>>>> module past, present and Future to abide by fixed orderings, 
>>>>>>>> enumerations etc. is not as simple a task as one might initially 
>>>>>>>> think - especially if third-party libraries are involved.  I 
>>>>>>>> won't get into problems associated with why one is comparing 
>>>>>>>> output (swapped
>>>>>>>> module?) and IDs, orders etc. being different because of a 
>>>>>>>> possibly intentional difference.
>>>>>>>>
>>>>>>>> In addition to or instead of creating a post-processing script, 
>>>>>>>> one could write a new "cas-consumer" that writes output in a 
>>>>>>>> desired format - but this should not require changes to engines.
>>>>>>>>
>>>>>>>> "If it ain't broke, don't fix it"
>>>>>>>>
>>>>>>>> Sean
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>>>>> To: dev@ctakes.apache.org
>>>>>>>> Subject: Re: cTakes output predictability
>>>>>>>>
>>>>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>>>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>>>>> Since I started working with cTakes some time ago, I have found 
>>>>>>>>> it difficult to compare the output between subsequent runs on 
>>>>>>>>> the same files because annotations are often assigned different 
>>>>>>>>> IDs, are listed in different order, etc.
>>>>>>>> At one point, I spent some time writing a script for diff-ing 
>>>>>>>> CASes that intended to address some of these kinds of issues.
>>>>>>>> It's still here in cTAKES:
>>>>>>>>
>>>>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/an
>>>>>>>> aly
>>>>>>>> sis
>>>>>>>> /CompareFeatureStructures.java
>>>>>>>>
>>>>>>>> You might see if you could use or adapt that to your needs.
>>>>>>>>
>>>>>>>> Steve


RE: cTakes output predictability

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Kim,

Great Catch!

I think that by now this thread may be discarded by most as spam.  So, I'm back (apologies - I know that you are tired of me by now).

I checked the code that you pointed to ...  I really dislike looking at older cTakes code because I'm filled with an overwhelming urge to refactor.

If I understand the code correctly (it could use some doc), it runs negation engines and then if any negation exists it creates a single hit signifying negation.  Like a heavyweight Boolean.   Unfortunately, as you know, because Collection "s"  is a Set and it throws in the first token to come along ...  

An isolated change here would probably be better than going through the entire code base and switching to LinkedHashMaps, Lists, etc. - plus it would fix your problem.

You could (for reuse by others, assuming that one doesn't already exist) create a singleton BaseTokenComparator implements Comparator<BaseToken>  with something like:
   public int compare( final BaseToken textSpan1, final BaseToken textSpan2 ) {
      if ( textSpan1. getStartOffset () != textSpan2. getStartOffset () ) {
         return textSpan1. getStartOffset () - textSpan2. getStartOffset ();
      }
      return textSpan1. getEndOffset () - textSpan2. getEndOffset ();
   }

And in NegationContextAnalyzer line ~48
Final List<NegationIndicator> negatorsList = new ArrayList( _negIndicatorFSM.execute(fsmTokenList) );
If ( !negatorsList.isEmpty() ) {
	Collections.sort( negatorsList, BaseTokenComparator.getInstance() );	
	Return new ContextHit( negatorsList.get(0).getStartOffset(), negatorsList.get(0).getEndOffset() );

Or you could write a (faster) method to use in place of the List and Sort like:
BaseToken getFirstTextSpan( final Iterable<BaseToken> tokens ) {
	BaseToken firstToken  = null;
	For ( BaseToken token : tokens ) {
		If ( firstToken == null || token.getStartOffset() < firstToken.getStartOffset() ) {
			firstToken = token;
			continue;
		}
		If ( token.getStartOffset() == firstToken.getStartOffset() && token.getEndOffset() < firstToken.getEndOffset() ) {
			firstToken = token;
		}
	}
	Return firstToken; 
		

Of course, a perfectly reasonable question to pose to the community is something like "Is the best stored negation context the first or largest or ???"  Perhaps the first negator span isn't the most wanted for later use - perhaps it is the most-encompassing span so that multiple words can be reused.  You could throw that out under a new thread title and perhaps the original authors or current users would speak up as to what might be best.  Personally I have no idea.

Anyway, great catch!

Sean


-----Original Message-----
From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 3:11 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Hi all,

I'm not sure these should be classified as bugs. They look l like design decisions at some point, but they do have impact in the consistency of the results. If they are right are not might be something to debate later down the road, but it would be nice to be consistent in the output.

For example, I have the following text.

"I do not see any"

Can result in the following ContextAnnotations:

<org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed="1" _id="130" _ref_sofa="1" begin="*13*" end="*16*" id="0"
typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
uncertainty="0" conditional="false" generic="false" historyOf="0"
FocusText="I" Scope="RIGHT"/>

or

<org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed="1" _id="130" _ref_sofa="1" begin="*5*" end="*16*" id="0"
typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
uncertainty="0" conditional="false" generic="false" historyOf="0"
FocusText="I" Scope="RIGHT"/>

or

<org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed="1" _id="130" _ref_sofa="1" begin="*5*" end="*8*" id="0"
typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
uncertainty="0" conditional="false" generic="false" historyOf="0"
FocusText="I" Scope="RIGHT"/>

Well, after doing some digging it turns out that org.apache.ctakes.necontexts.negation.NegationContextAnalyzer is to blame.

The code looks like the following:

    public ContextHit analyzeContext(List<? extends Annotation> contextTokens, int scopeOrientation)
            throws AnalysisEngineProcessException {
        List<TextToken> fsmTokenList = wrapAsFsmTokens(contextTokens);

        try {
            Set<NegationIndicator> s =
_negIndicatorFSM.execute(fsmTokenList);

            *if (s.size() > 0) {*
                NegationIndicator neg = s.iterator().next();
               *return new ContextHit(neg.getStartOffset(),
neg.getEndOffset());*
            } else {
                return null;
            }
        } catch (Exception e) {
            throw new AnalysisEngineProcessException(e);
        }
    }

This will at most return one item from the Set. Since the set is an unordered hash, this will result in one of three options to be returned.
Is this a bug, or a design decision. Which one is right? Which one is wrong? It maybe this is a disign decision, but it would be nice if we are consistently right, or consistently wrong. Many other instances of this result in similar issues.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 12:43 PM, Finan, Sean wrote:
> I'm just about sapped on this topic.  What comes below is my final writing.
>
> Kim wrote:
>> Yes, I mean actual type values not matching.
> Ok, this is a very serious problem and should have nothing to do with ordering and/or IDs.  I repeat: this should have nothing to do with ordering or ids.  Reordering or changing ID assignment, while possibly producing repeatable output, will not necessary fix the actual bug.  Please write a Jira for each item, and (imo) we should think about withholding any non-bug-fix release until they have been dealt with.
>
> Bruce wrote:
>> I did not intend to step on anyone's toes.
> No worries - I don't think that any toes have been stepped upon. It is good that questions and concerns are shared with the group.  
>
>> Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.
> Assuming that the second drug mention doesn't appear elsewhere in output2 then this needs to be addressed.  Please log a tar.  Relating this to the order/id issue, which number of mentions is correct (2)?  If you reorder will that consistently output two medications instead of one or one medication instead of two?  This is most likely a bug in the identification and/or storage and/or retrieval code and needs to be fixed there.
>
>> Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?
> I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations could be started and people could add to it as needed.  I would also hope that a reusable post-process comparison utility could be started and improved/maintained.
>
> Sean
>
>
> -----Original Message-----
> From: Bruce Tietjen [mailto:bruce.tietjen@perfectsearchcorp.com]
> Sent: Tuesday, October 07, 2014 1:21 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> I did not intend to step on anyone's toes.
>
> One of the reasons I proposed the changes was to try to make it extremely obvious when there are significant difference in output from the cTakes pipeline when running the same document again, and once identified, make it easier to identify the source of the difference.
>
> Because of the huge number of differences between the output using the FileWriterCasConsumer.xml, first detecting that there is a significant differences and identifying them for a large set of documents is a daunting task.
>
> The following is an example of some significant differences that I 
> have detected between two subsequent runs on the same document using 
> the current release of cTakes. (There are actually quite a few 
> documents that exhibit this kind of behavior. This is only one 
> example.)
>
>
> Snippet from first run:
>
>     <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
> _indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110"
> _ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
> _indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0"
> _ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>
>
> Snippet from subsequent trun:
>
>     <org.apache.ctakes.typesystem.type.textsem.ProcedureMention
> _indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125"
> _ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
> _indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/>
>
>
> Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.
>
> Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?
>
> Thanks,
>
> Bruce Tietjen
>
>
>
>  [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen 
> Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tietjen@imatsolutions.com
>
> On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert 
> <ki...@perfectsearchcorp.com>
> wrote:
>
>> Hi Sean,
>>
>> No, your not a jerk. These are things worth considering, and I 
>> understand your concerns with touching various points of the codebase.
>>
>> I'll talk with our group over here and see where we want to go. We 
>> are really interested in cTakes behaving well, so we are usually 
>> pretty careful in testing our changes before committing anything.
>>
>> Thanks,
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 10:46 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>>> It concerns me a bit by making the code return consistent results 
>>>> would
>> be so concerning.
>>> Could you please clarify what you mean by "consistent results"?  Do 
>>> you
>> mean ordering and IDs or are you talking about actual type values not 
>> matching?
>>>> This should be the default mode of operation.
>>> Depending upon what you meant above, I may agree or disagree.
>>>
>>>> Since it doesn't appear that there are any consequences with moving
>> forward with changing the code
>>> Why do you say this?
>>>
>>> I think that there may be more required changes than you realize.  
>>> Every
>> insertion into the CAS must be of ordered data.  This means that, for 
>> instance, named entities discovered by dictionary will need to be 
>> inserted in some predictable order, such as by alphabetized cui per 
>> every alphabetized tui (and other code) per ordered text span.  You 
>> will need to check and recheck every point at which the CAS is 
>> modified by every module.  Right now there are at least three or four 
>> places in two cTakes dictionary modules where a change would be 
>> required - and that doesn't include YTEX lookup.
>>> If you really feel strongly about this and are going to change 
>>> cTakes
>> code, then I suggest (at the risk of sounding like a complete jerk) 
>> that you also consider the following:
>>> 1.  Don't check anything into trunk until all is well with your 
>>> changes
>> and tests
>>> Just in case you abandon the effort
>>> 2.  Write unit tests for every change True, Map to LinkedMap 
>>> shouldn't break anything, but they are good to
>> have, and may prevent others in the future from switching back to a 
>> non-linked map or any unordered collection (set not list, etc.).  It 
>> also makes a better place for explanation in Javadoc than inlines above the code.
>>> 3.  Run memory requirement tests before all of your changes and then
>> again after your changes
>>> I'm actually curious about how much memory might be eaten with 
>>> linkages
>> everywhere
>>> 4.  Run performance (speed) tests before and after On a large corpus 
>>> to ensure that garbage collection is involved 5.  Do the above with 
>>> every combination possible in current workflows:
>> every combination of available sentence detector, pos tagger, smoking 
>> status detector, dictionary lookup, cas consumer, etc.
>>> As soon as somebody says "all output is consistently ordered between
>> runs" it had better be so for every possible workflow
>>> 6.  Write system tests to ensure ordered/predicted outputs with each
>> combination
>>> Otherwise somebody may break it
>>> 7.  Document the what, how, and why for future development Otherwise 
>>> somebody won't know to stick to the new rules 8.  Assist anybody as 
>>> needed that in the future breaks one of these unit
>> or system tests with a fix or new feature
>>> By mandating such a rule you are assuming responsibility for it 9.  
>>> Assist anybody as needed that in the future adds a new module or
>> workflow to cTakes to abide by the ordering requirement
>>> By mandating such a rule you are assuming responsibility for it 10.  
>>> Assist anybody as needed that in the future adds a new module or
>> workflow to add system tests to ensure maintenance of the ordering 
>> requirement
>>> By mandating such a rule you are assuming responsibility for it
>>>
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 11:57 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> I think we may really prefer the first method. Since it doesn't 
>>> appear
>> that there are any consequences with moving forward with changing the 
>> code, we would really like to move forward with this approach.
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 09:35 AM, britt fitch wrote:
>>>> The option Sean mentioned of writing your own custom consumer 
>>>> (without the UIMA id that is causing your issues) should meet these 
>>>> needs I believe.
>>>>
>>>>
>>>>
>>>> Britt Fitch
>>>> Wired Informatics
>>>> 265 Franklin St Ste 1702
>>>> Boston, MA 02110
>>>> http://wiredinformatics.com
>>>> Britt.Fitch@wiredinformatics.com
>>>>
>>>> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
>>>> <kim.ebert@perfectsearchcorp.com 
>>>> <ma...@perfectsearchcorp.com>> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Well of course that makes plenty of sense. Testing different 
>>>>> cTakes configurations you would expect different output. In our 
>>>>> testing we've found several cases where running with the same 
>>>>> configuration outputs different data under different moons. Having 
>>>>> consistent results helps us know if we've made improvements to our 
>>>>> quality or not. Having output that is in a predictable order makes 
>>>>> checking to see if there are differences much cheaper when you are 
>>>>> dealing with
>> larger data sets.
>>>>> Kim Ebert
>>>>> 1.801.669.7342
>>>>> Perfect Search Corp
>>>>> http://www.perfectsearchcorp.com/
>>>>>
>>>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>>>> Hi Kim,
>>>>>>
>>>>>> One might want compare the Sentence detector that uses end of 
>>>>>> line characters as sentence splitters with one that does not.
>>>>>> Such a change in sentence splitting would not only effect the 
>>>>>> sentence type discoveries but also practically every type that follows.
>>>>>>
>>>>>> Another might want to compare a note with "skin cancer" vs. one 
>>>>>> in which you replace "skin cancer" with "melanoma" just to see 
>>>>>> what the CUI differences might be.  There are changes in two 
>>>>>> words vs. one,
>>>>>> 11 characters vs. 8, a removed adjective(?), and of course 
>>>>>> changes in CUIs.
>>>>>>
>>>>>> Of course, if you are just running notes on a new moon and then 
>>>>>> again on a full moon ...
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Re: cTakes output predictability
>>>>>>
>>>>>> Sean,
>>>>>>
>>>>>> "...being different because of a possibly intentional difference."
>>>>>>
>>>>>> I would like you to elaborate a bit on the what would be 
>>>>>> intentionally different between the processing of the same 
>>>>>> document multiple times. It would help my understanding of cTakes.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Kim Ebert
>>>>>> 1.801.669.7342
>>>>>> Perfect Search Corp
>>>>>> http://www.perfectsearchcorp.com/
>>>>>>
>>>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>>>> Steve Bethard wrote:
>>>>>>>> I spent some time writing a script for diff-ing CASes
>>>>>>> I urge anyone interested in comparing cTakes CASes / output to 
>>>>>>> use this type of approach.  Comparison of program output is a 
>>>>>>> post-process task, and unless absolutely necessary code to 
>>>>>>> juggle data and metadata belongs there.  Attempts to force every 
>>>>>>> module past, present and Future to abide by fixed orderings, 
>>>>>>> enumerations etc. is not as simple a task as one might initially 
>>>>>>> think - especially if third-party libraries are involved.  I 
>>>>>>> won't get into problems associated with why one is comparing 
>>>>>>> output (swapped
>>>>>>> module?) and IDs, orders etc. being different because of a 
>>>>>>> possibly intentional difference.
>>>>>>>
>>>>>>> In addition to or instead of creating a post-processing script, 
>>>>>>> one could write a new "cas-consumer" that writes output in a 
>>>>>>> desired format - but this should not require changes to engines.
>>>>>>>
>>>>>>> "If it ain't broke, don't fix it"
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>>>> To: dev@ctakes.apache.org
>>>>>>> Subject: Re: cTakes output predictability
>>>>>>>
>>>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>>>> Since I started working with cTakes some time ago, I have found 
>>>>>>>> it difficult to compare the output between subsequent runs on 
>>>>>>>> the same files because annotations are often assigned different 
>>>>>>>> IDs, are listed in different order, etc.
>>>>>>> At one point, I spent some time writing a script for diff-ing 
>>>>>>> CASes that intended to address some of these kinds of issues.
>>>>>>> It's still here in cTAKES:
>>>>>>>
>>>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/an
>>>>>>> aly
>>>>>>> sis
>>>>>>> /CompareFeatureStructures.java
>>>>>>>
>>>>>>> You might see if you could use or adapt that to your needs.
>>>>>>>
>>>>>>> Steve
>>


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Hi all,

I'm not sure these should be classified as bugs. They look l like design
decisions at some point, but they do have impact in the consistency of
the results. If they are right are not might be something to debate
later down the road, but it would be nice to be consistent in the output.

For example, I have the following text.

"I do not see any"

Can result in the following ContextAnnotations:

<org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed="1" _id="130" _ref_sofa="1" begin="*13*" end="*16*" id="0"
typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
uncertainty="0" conditional="false" generic="false" historyOf="0"
FocusText="I" Scope="RIGHT"/>

or

<org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed="1" _id="130" _ref_sofa="1" begin="*5*" end="*16*" id="0"
typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
uncertainty="0" conditional="false" generic="false" historyOf="0"
FocusText="I" Scope="RIGHT"/>

or

<org.apache.ctakes.typesystem.type.textsem.ContextAnnotation
_indexed="1" _id="130" _ref_sofa="1" begin="*5*" end="*8*" id="0"
typeID="0" discoveryTechnique="0" confidence="0.0" polarity="0"
uncertainty="0" conditional="false" generic="false" historyOf="0"
FocusText="I" Scope="RIGHT"/>

Well, after doing some digging it turns out that
org.apache.ctakes.necontexts.negation.NegationContextAnalyzer is to blame.

The code looks like the following:

    public ContextHit analyzeContext(List<? extends Annotation>
contextTokens, int scopeOrientation)
            throws AnalysisEngineProcessException {
        List<TextToken> fsmTokenList = wrapAsFsmTokens(contextTokens);

        try {
            Set<NegationIndicator> s =
_negIndicatorFSM.execute(fsmTokenList);

            *if (s.size() > 0) {*
                NegationIndicator neg = s.iterator().next();
               *return new ContextHit(neg.getStartOffset(),
neg.getEndOffset());*
            } else {
                return null;
            }
        } catch (Exception e) {
            throw new AnalysisEngineProcessException(e);
        }
    }

This will at most return one item from the Set. Since the set is an
unordered hash, this will result in one of three options to be returned.
Is this a bug, or a design decision. Which one is right? Which one is
wrong? It maybe this is a disign decision, but it would be nice if we
are consistently right, or consistently wrong. Many other instances of
this result in similar issues.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 12:43 PM, Finan, Sean wrote:
> I'm just about sapped on this topic.  What comes below is my final writing.
>
> Kim wrote:
>> Yes, I mean actual type values not matching.
> Ok, this is a very serious problem and should have nothing to do with ordering and/or IDs.  I repeat: this should have nothing to do with ordering or ids.  Reordering or changing ID assignment, while possibly producing repeatable output, will not necessary fix the actual bug.  Please write a Jira for each item, and (imo) we should think about withholding any non-bug-fix release until they have been dealt with.
>
> Bruce wrote:
>> I did not intend to step on anyone's toes.
> No worries - I don't think that any toes have been stepped upon. It is good that questions and concerns are shared with the group.  
>
>> Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.
> Assuming that the second drug mention doesn't appear elsewhere in output2 then this needs to be addressed.  Please log a tar.  Relating this to the order/id issue, which number of mentions is correct (2)?  If you reorder will that consistently output two medications instead of one or one medication instead of two?  This is most likely a bug in the identification and/or storage and/or retrieval code and needs to be fixed there.
>
>> Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?
> I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations could be started and people could add to it as needed.  I would also hope that a reusable post-process comparison utility could be started and improved/maintained.
>
> Sean
>
>
> -----Original Message-----
> From: Bruce Tietjen [mailto:bruce.tietjen@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 1:21 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> I did not intend to step on anyone's toes.
>
> One of the reasons I proposed the changes was to try to make it extremely obvious when there are significant difference in output from the cTakes pipeline when running the same document again, and once identified, make it easier to identify the source of the difference.
>
> Because of the huge number of differences between the output using the FileWriterCasConsumer.xml, first detecting that there is a significant differences and identifying them for a large set of documents is a daunting task.
>
> The following is an example of some significant differences that I have detected between two subsequent runs on the same document using the current release of cTakes. (There are actually quite a few documents that exhibit this kind of behavior. This is only one example.)
>
>
> Snippet from first run:
>
>     <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
> _indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110"
> _ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
> _indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0"
> _ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>
>
> Snippet from subsequent trun:
>
>     <org.apache.ctakes.typesystem.type.textsem.ProcedureMention
> _indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125"
> _ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
> conditional="false" generic="false" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.textsem.MedicationMention
> _indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95"
> _ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT"
> discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
> conditional="false" generic="true" subject="patient" historyOf="0"/>
>     <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
> _indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/>
>
>
> Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.
>
> Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?
>
> Thanks,
>
> Bruce Tietjen
>
>
>
>  [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen Senior Software Engineer
> [image: Mobile:] 801.634.1547
> bruce.tietjen@imatsolutions.com
>
> On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert <ki...@perfectsearchcorp.com>
> wrote:
>
>> Hi Sean,
>>
>> No, your not a jerk. These are things worth considering, and I 
>> understand your concerns with touching various points of the codebase.
>>
>> I'll talk with our group over here and see where we want to go. We are 
>> really interested in cTakes behaving well, so we are usually pretty 
>> careful in testing our changes before committing anything.
>>
>> Thanks,
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 10:46 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>>> It concerns me a bit by making the code return consistent results 
>>>> would
>> be so concerning.
>>> Could you please clarify what you mean by "consistent results"?  Do 
>>> you
>> mean ordering and IDs or are you talking about actual type values not 
>> matching?
>>>> This should be the default mode of operation.
>>> Depending upon what you meant above, I may agree or disagree.
>>>
>>>> Since it doesn't appear that there are any consequences with moving
>> forward with changing the code
>>> Why do you say this?
>>>
>>> I think that there may be more required changes than you realize.  
>>> Every
>> insertion into the CAS must be of ordered data.  This means that, for 
>> instance, named entities discovered by dictionary will need to be 
>> inserted in some predictable order, such as by alphabetized cui per 
>> every alphabetized tui (and other code) per ordered text span.  You 
>> will need to check and recheck every point at which the CAS is 
>> modified by every module.  Right now there are at least three or four 
>> places in two cTakes dictionary modules where a change would be 
>> required - and that doesn't include YTEX lookup.
>>> If you really feel strongly about this and are going to change 
>>> cTakes
>> code, then I suggest (at the risk of sounding like a complete jerk) 
>> that you also consider the following:
>>> 1.  Don't check anything into trunk until all is well with your 
>>> changes
>> and tests
>>> Just in case you abandon the effort
>>> 2.  Write unit tests for every change True, Map to LinkedMap 
>>> shouldn't break anything, but they are good to
>> have, and may prevent others in the future from switching back to a 
>> non-linked map or any unordered collection (set not list, etc.).  It 
>> also makes a better place for explanation in Javadoc than inlines above the code.
>>> 3.  Run memory requirement tests before all of your changes and then
>> again after your changes
>>> I'm actually curious about how much memory might be eaten with 
>>> linkages
>> everywhere
>>> 4.  Run performance (speed) tests before and after On a large corpus 
>>> to ensure that garbage collection is involved 5.  Do the above with 
>>> every combination possible in current workflows:
>> every combination of available sentence detector, pos tagger, smoking 
>> status detector, dictionary lookup, cas consumer, etc.
>>> As soon as somebody says "all output is consistently ordered between
>> runs" it had better be so for every possible workflow
>>> 6.  Write system tests to ensure ordered/predicted outputs with each
>> combination
>>> Otherwise somebody may break it
>>> 7.  Document the what, how, and why for future development Otherwise 
>>> somebody won't know to stick to the new rules 8.  Assist anybody as 
>>> needed that in the future breaks one of these unit
>> or system tests with a fix or new feature
>>> By mandating such a rule you are assuming responsibility for it 9.  
>>> Assist anybody as needed that in the future adds a new module or
>> workflow to cTakes to abide by the ordering requirement
>>> By mandating such a rule you are assuming responsibility for it 10.  
>>> Assist anybody as needed that in the future adds a new module or
>> workflow to add system tests to ensure maintenance of the ordering 
>> requirement
>>> By mandating such a rule you are assuming responsibility for it
>>>
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 11:57 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> I think we may really prefer the first method. Since it doesn't 
>>> appear
>> that there are any consequences with moving forward with changing the 
>> code, we would really like to move forward with this approach.
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 09:35 AM, britt fitch wrote:
>>>> The option Sean mentioned of writing your own custom consumer 
>>>> (without the UIMA id that is causing your issues) should meet these 
>>>> needs I believe.
>>>>
>>>>
>>>>
>>>> Britt Fitch
>>>> Wired Informatics
>>>> 265 Franklin St Ste 1702
>>>> Boston, MA 02110
>>>> http://wiredinformatics.com
>>>> Britt.Fitch@wiredinformatics.com
>>>>
>>>> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
>>>> <kim.ebert@perfectsearchcorp.com 
>>>> <ma...@perfectsearchcorp.com>> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Well of course that makes plenty of sense. Testing different 
>>>>> cTakes configurations you would expect different output. In our 
>>>>> testing we've found several cases where running with the same 
>>>>> configuration outputs different data under different moons. Having 
>>>>> consistent results helps us know if we've made improvements to our 
>>>>> quality or not. Having output that is in a predictable order makes 
>>>>> checking to see if there are differences much cheaper when you are 
>>>>> dealing with
>> larger data sets.
>>>>> Kim Ebert
>>>>> 1.801.669.7342
>>>>> Perfect Search Corp
>>>>> http://www.perfectsearchcorp.com/
>>>>>
>>>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>>>> Hi Kim,
>>>>>>
>>>>>> One might want compare the Sentence detector that uses end of 
>>>>>> line characters as sentence splitters with one that does not.  
>>>>>> Such a change in sentence splitting would not only effect the 
>>>>>> sentence type discoveries but also practically every type that follows.
>>>>>>
>>>>>> Another might want to compare a note with "skin cancer" vs. one 
>>>>>> in which you replace "skin cancer" with "melanoma" just to see 
>>>>>> what the CUI differences might be.  There are changes in two 
>>>>>> words vs. one,
>>>>>> 11 characters vs. 8, a removed adjective(?), and of course 
>>>>>> changes in CUIs.
>>>>>>
>>>>>> Of course, if you are just running notes on a new moon and then 
>>>>>> again on a full moon ...
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Re: cTakes output predictability
>>>>>>
>>>>>> Sean,
>>>>>>
>>>>>> "...being different because of a possibly intentional difference."
>>>>>>
>>>>>> I would like you to elaborate a bit on the what would be 
>>>>>> intentionally different between the processing of the same 
>>>>>> document multiple times. It would help my understanding of cTakes.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Kim Ebert
>>>>>> 1.801.669.7342
>>>>>> Perfect Search Corp
>>>>>> http://www.perfectsearchcorp.com/
>>>>>>
>>>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>>>> Steve Bethard wrote:
>>>>>>>> I spent some time writing a script for diff-ing CASes
>>>>>>> I urge anyone interested in comparing cTakes CASes / output to 
>>>>>>> use this type of approach.  Comparison of program output is a 
>>>>>>> post-process task, and unless absolutely necessary code to 
>>>>>>> juggle data and metadata belongs there.  Attempts to force every 
>>>>>>> module past, present and Future to abide by fixed orderings, 
>>>>>>> enumerations etc. is not as simple a task as one might initially 
>>>>>>> think - especially if third-party libraries are involved.  I 
>>>>>>> won't get into problems associated with why one is comparing 
>>>>>>> output (swapped
>>>>>>> module?) and IDs, orders etc. being different because of a 
>>>>>>> possibly intentional difference.
>>>>>>>
>>>>>>> In addition to or instead of creating a post-processing script, 
>>>>>>> one could write a new "cas-consumer" that writes output in a 
>>>>>>> desired format - but this should not require changes to engines.
>>>>>>>
>>>>>>> "If it ain't broke, don't fix it"
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>>>> To: dev@ctakes.apache.org
>>>>>>> Subject: Re: cTakes output predictability
>>>>>>>
>>>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>>>> Since I started working with cTakes some time ago, I have found 
>>>>>>>> it difficult to compare the output between subsequent runs on 
>>>>>>>> the same files because annotations are often assigned different 
>>>>>>>> IDs, are listed in different order, etc.
>>>>>>> At one point, I spent some time writing a script for diff-ing 
>>>>>>> CASes that intended to address some of these kinds of issues. 
>>>>>>> It's still here in cTAKES:
>>>>>>>
>>>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/an
>>>>>>> aly
>>>>>>> sis
>>>>>>> /CompareFeatureStructures.java
>>>>>>>
>>>>>>> You might see if you could use or adapt that to your needs.
>>>>>>>
>>>>>>> Steve
>>


RE: cTakes output predictability

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
I'm just about sapped on this topic.  What comes below is my final writing.

Kim wrote:
>Yes, I mean actual type values not matching.

Ok, this is a very serious problem and should have nothing to do with ordering and/or IDs.  I repeat: this should have nothing to do with ordering or ids.  Reordering or changing ID assignment, while possibly producing repeatable output, will not necessary fix the actual bug.  Please write a Jira for each item, and (imo) we should think about withholding any non-bug-fix release until they have been dealt with.

Bruce wrote:
> I did not intend to step on anyone's toes.
No worries - I don't think that any toes have been stepped upon. It is good that questions and concerns are shared with the group.  

> Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.
Assuming that the second drug mention doesn't appear elsewhere in output2 then this needs to be addressed.  Please log a tar.  Relating this to the order/id issue, which number of mentions is correct (2)?  If you reorder will that consistently output two medications instead of one or one medication instead of two?  This is most likely a bug in the identification and/or storage and/or retrieval code and needs to be fixed there.

>Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?

I would hope that a reusable Cas-Consumer that sorts and re-IDs annotations could be started and people could add to it as needed.  I would also hope that a reusable post-process comparison utility could be started and improved/maintained.

Sean


-----Original Message-----
From: Bruce Tietjen [mailto:bruce.tietjen@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 1:21 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I did not intend to step on anyone's toes.

One of the reasons I proposed the changes was to try to make it extremely obvious when there are significant difference in output from the cTakes pipeline when running the same document again, and once identified, make it easier to identify the source of the difference.

Because of the huge number of differences between the output using the FileWriterCasConsumer.xml, first detecting that there is a significant differences and identifying them for a large set of documents is a daunting task.

The following is an example of some significant differences that I have detected between two subsequent runs on the same document using the current release of cTakes. (There are actually quite a few documents that exhibit this kind of behavior. This is only one example.)


Snippet from first run:

    <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
_indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/>
    <org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95"
_ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
conditional="false" generic="true" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110"
_ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
conditional="false" generic="false" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
_indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0"
_ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
conditional="false" generic="false" subject="patient" historyOf="0"/>


Snippet from subsequent trun:

    <org.apache.ctakes.typesystem.type.textsem.ProcedureMention
_indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125"
_ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
conditional="false" generic="false" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95"
_ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
conditional="false" generic="true" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
_indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/>


Note that in the first instance, there were two MedicationMentions, but in the second, there is only one.

Yes, everyone could write their own custom compare code, but wouldn't it be more valuable to the community to make that task easier?

Thanks,

Bruce Tietjen



 [image: IMAT Solutions] <http://imatsolutions.com>  Bruce Tietjen Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tietjen@imatsolutions.com

On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert <ki...@perfectsearchcorp.com>
wrote:

> Hi Sean,
>
> No, your not a jerk. These are things worth considering, and I 
> understand your concerns with touching various points of the codebase.
>
> I'll talk with our group over here and see where we want to go. We are 
> really interested in cTakes behaving well, so we are usually pretty 
> careful in testing our changes before committing anything.
>
> Thanks,
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 10:46 AM, Finan, Sean wrote:
> > Hi Kim,
> >
> >> It concerns me a bit by making the code return consistent results 
> >> would
> be so concerning.
> > Could you please clarify what you mean by "consistent results"?  Do 
> > you
> mean ordering and IDs or are you talking about actual type values not 
> matching?
> >
> >> This should be the default mode of operation.
> > Depending upon what you meant above, I may agree or disagree.
> >
> >> Since it doesn't appear that there are any consequences with moving
> forward with changing the code
> > Why do you say this?
> >
> > I think that there may be more required changes than you realize.  
> > Every
> insertion into the CAS must be of ordered data.  This means that, for 
> instance, named entities discovered by dictionary will need to be 
> inserted in some predictable order, such as by alphabetized cui per 
> every alphabetized tui (and other code) per ordered text span.  You 
> will need to check and recheck every point at which the CAS is 
> modified by every module.  Right now there are at least three or four 
> places in two cTakes dictionary modules where a change would be 
> required - and that doesn't include YTEX lookup.
> >
> > If you really feel strongly about this and are going to change 
> > cTakes
> code, then I suggest (at the risk of sounding like a complete jerk) 
> that you also consider the following:
> > 1.  Don't check anything into trunk until all is well with your 
> > changes
> and tests
> > Just in case you abandon the effort
> > 2.  Write unit tests for every change True, Map to LinkedMap 
> > shouldn't break anything, but they are good to
> have, and may prevent others in the future from switching back to a 
> non-linked map or any unordered collection (set not list, etc.).  It 
> also makes a better place for explanation in Javadoc than inlines above the code.
> > 3.  Run memory requirement tests before all of your changes and then
> again after your changes
> > I'm actually curious about how much memory might be eaten with 
> > linkages
> everywhere
> > 4.  Run performance (speed) tests before and after On a large corpus 
> > to ensure that garbage collection is involved 5.  Do the above with 
> > every combination possible in current workflows:
> every combination of available sentence detector, pos tagger, smoking 
> status detector, dictionary lookup, cas consumer, etc.
> > As soon as somebody says "all output is consistently ordered between
> runs" it had better be so for every possible workflow
> > 6.  Write system tests to ensure ordered/predicted outputs with each
> combination
> > Otherwise somebody may break it
> > 7.  Document the what, how, and why for future development Otherwise 
> > somebody won't know to stick to the new rules 8.  Assist anybody as 
> > needed that in the future breaks one of these unit
> or system tests with a fix or new feature
> > By mandating such a rule you are assuming responsibility for it 9.  
> > Assist anybody as needed that in the future adds a new module or
> workflow to cTakes to abide by the ordering requirement
> > By mandating such a rule you are assuming responsibility for it 10.  
> > Assist anybody as needed that in the future adds a new module or
> workflow to add system tests to ensure maintenance of the ordering 
> requirement
> > By mandating such a rule you are assuming responsibility for it
> >
> >
> > -----Original Message-----
> > From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
> > Sent: Tuesday, October 07, 2014 11:57 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: cTakes output predictability
> >
> > I think we may really prefer the first method. Since it doesn't 
> > appear
> that there are any consequences with moving forward with changing the 
> code, we would really like to move forward with this approach.
> >
> > Kim Ebert
> > 1.801.669.7342
> > Perfect Search Corp
> > http://www.perfectsearchcorp.com/
> >
> > On 10/07/2014 09:35 AM, britt fitch wrote:
> >> The option Sean mentioned of writing your own custom consumer 
> >> (without the UIMA id that is causing your issues) should meet these 
> >> needs I believe.
> >>
> >>
> >>
> >> Britt Fitch
> >> Wired Informatics
> >> 265 Franklin St Ste 1702
> >> Boston, MA 02110
> >> http://wiredinformatics.com
> >> Britt.Fitch@wiredinformatics.com
> >>
> >> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
> >> <kim.ebert@perfectsearchcorp.com 
> >> <ma...@perfectsearchcorp.com>> wrote:
> >>
> >>> Hi Sean,
> >>>
> >>> Well of course that makes plenty of sense. Testing different 
> >>> cTakes configurations you would expect different output. In our 
> >>> testing we've found several cases where running with the same 
> >>> configuration outputs different data under different moons. Having 
> >>> consistent results helps us know if we've made improvements to our 
> >>> quality or not. Having output that is in a predictable order makes 
> >>> checking to see if there are differences much cheaper when you are 
> >>> dealing with
> larger data sets.
> >>>
> >>> Kim Ebert
> >>> 1.801.669.7342
> >>> Perfect Search Corp
> >>> http://www.perfectsearchcorp.com/
> >>>
> >>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
> >>>> Hi Kim,
> >>>>
> >>>> One might want compare the Sentence detector that uses end of 
> >>>> line characters as sentence splitters with one that does not.  
> >>>> Such a change in sentence splitting would not only effect the 
> >>>> sentence type discoveries but also practically every type that follows.
> >>>>
> >>>> Another might want to compare a note with "skin cancer" vs. one 
> >>>> in which you replace "skin cancer" with "melanoma" just to see 
> >>>> what the CUI differences might be.  There are changes in two 
> >>>> words vs. one,
> >>>> 11 characters vs. 8, a removed adjective(?), and of course 
> >>>> changes in CUIs.
> >>>>
> >>>> Of course, if you are just running notes on a new moon and then 
> >>>> again on a full moon ...
> >>>>
> >>>> Sean
> >>>>
> >>>> -----Original Message-----
> >>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
> >>>> Sent: Tuesday, October 07, 2014 10:41 AM
> >>>> To: dev@ctakes.apache.org
> >>>> Subject: Re: cTakes output predictability
> >>>>
> >>>> Sean,
> >>>>
> >>>> "...being different because of a possibly intentional difference."
> >>>>
> >>>> I would like you to elaborate a bit on the what would be 
> >>>> intentionally different between the processing of the same 
> >>>> document multiple times. It would help my understanding of cTakes.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Kim Ebert
> >>>> 1.801.669.7342
> >>>> Perfect Search Corp
> >>>> http://www.perfectsearchcorp.com/
> >>>>
> >>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
> >>>>> Steve Bethard wrote:
> >>>>>> I spent some time writing a script for diff-ing CASes
> >>>>> I urge anyone interested in comparing cTakes CASes / output to 
> >>>>> use this type of approach.  Comparison of program output is a 
> >>>>> post-process task, and unless absolutely necessary code to 
> >>>>> juggle data and metadata belongs there.  Attempts to force every 
> >>>>> module past, present and Future to abide by fixed orderings, 
> >>>>> enumerations etc. is not as simple a task as one might initially 
> >>>>> think - especially if third-party libraries are involved.  I 
> >>>>> won't get into problems associated with why one is comparing 
> >>>>> output (swapped
> >>>>> module?) and IDs, orders etc. being different because of a 
> >>>>> possibly intentional difference.
> >>>>>
> >>>>> In addition to or instead of creating a post-processing script, 
> >>>>> one could write a new "cas-consumer" that writes output in a 
> >>>>> desired format - but this should not require changes to engines.
> >>>>>
> >>>>> "If it ain't broke, don't fix it"
> >>>>>
> >>>>> Sean
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
> >>>>> Sent: Monday, October 06, 2014 11:23 PM
> >>>>> To: dev@ctakes.apache.org
> >>>>> Subject: Re: cTakes output predictability
> >>>>>
> >>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
> >>>>> <br...@perfectsearchcorp.com> wrote:
> >>>>>> Since I started working with cTakes some time ago, I have found 
> >>>>>> it difficult to compare the output between subsequent runs on 
> >>>>>> the same files because annotations are often assigned different 
> >>>>>> IDs, are listed in different order, etc.
> >>>>> At one point, I spent some time writing a script for diff-ing 
> >>>>> CASes that intended to address some of these kinds of issues. 
> >>>>> It's still here in cTAKES:
> >>>>>
> >>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/an
> >>>>> aly
> >>>>> sis
> >>>>> /CompareFeatureStructures.java
> >>>>>
> >>>>> You might see if you could use or adapt that to your needs.
> >>>>>
> >>>>> Steve
> >
>
>

Re: cTakes output predictability

Posted by Bruce Tietjen <br...@perfectsearchcorp.com>.
I did not intend to step on anyone's toes.

One of the reasons I proposed the changes was to try to make it extremely
obvious when there are significant difference in output from the cTakes
pipeline when running the same document again, and once identified, make it
easier to identify the source of the difference.

Because of the huge number of differences between the output using the
FileWriterCasConsumer.xml, first detecting that there is a significant
differences and identifying them for a large set of documents is a daunting
task.

The following is an example of some significant differences that I have
detected between two subsequent runs on the same document using the current
release of cTakes. (There are actually quite a few documents that exhibit
this kind of behavior. This is only one example.)


Snippet from first run:

    <org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
_indexed="1" _id="9869" _ref_sofa="3" begin="3039" end="3047"/>
    <org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed="1" _id="9895" _ref_sofa="3" begin="2075" end="2081" id="95"
_ref_ontologyConceptArr="9891" typeID="1" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
conditional="false" generic="true" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed="1" _id="9937" _ref_sofa="3" begin="2312" end="2322" id="110"
_ref_ontologyConceptArr="9934" typeID="1" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
conditional="false" generic="false" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.textsem.DiseaseDisorderMention
_indexed="1" _id="9979" _ref_sofa="3" begin="0" end="4" id="0"
_ref_ontologyConceptArr="9976" typeID="2" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
conditional="false" generic="false" subject="patient" historyOf="0"/>


Snippet from subsequent trun:

    <org.apache.ctakes.typesystem.type.textsem.ProcedureMention
_indexed="1" _id="15773" _ref_sofa="3" begin="2929" end="2933" id="125"
_ref_ontologyConceptArr="15770" typeID="5" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="0"
conditional="false" generic="false" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.textsem.MedicationMention
_indexed="1" _id="15928" _ref_sofa="3" begin="2075" end="2081" id="95"
_ref_ontologyConceptArr="15924" typeID="1" segmentID="SIMPLE_SEGMENT"
discoveryTechnique="1" confidence="1.0" polarity="1" uncertainty="1"
conditional="false" generic="true" subject="patient" historyOf="0"/>
    <org.apache.ctakes.typesystem.type.syntax.ConllDependencyNode
_indexed="1" _id="15958" _ref_sofa="3" begin="0" end="5" id="0"/>


Note that in the first instance, there were two MedicationMentions, but in
the second, there is only one.

Yes, everyone could write their own custom compare code, but wouldn't it be
more valuable to the community to make that task easier?

Thanks,

Bruce Tietjen



 [image: IMAT Solutions] <http://imatsolutions.com>
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tietjen@imatsolutions.com

On Tue, Oct 7, 2014 at 11:01 AM, Kim Ebert <ki...@perfectsearchcorp.com>
wrote:

> Hi Sean,
>
> No, your not a jerk. These are things worth considering, and I
> understand your concerns with touching various points of the codebase.
>
> I'll talk with our group over here and see where we want to go. We are
> really interested in cTakes behaving well, so we are usually pretty
> careful in testing our changes before committing anything.
>
> Thanks,
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 10:46 AM, Finan, Sean wrote:
> > Hi Kim,
> >
> >> It concerns me a bit by making the code return consistent results would
> be so concerning.
> > Could you please clarify what you mean by "consistent results"?  Do you
> mean ordering and IDs or are you talking about actual type values not
> matching?
> >
> >> This should be the default mode of operation.
> > Depending upon what you meant above, I may agree or disagree.
> >
> >> Since it doesn't appear that there are any consequences with moving
> forward with changing the code
> > Why do you say this?
> >
> > I think that there may be more required changes than you realize.  Every
> insertion into the CAS must be of ordered data.  This means that, for
> instance, named entities discovered by dictionary will need to be inserted
> in some predictable order, such as by alphabetized cui per every
> alphabetized tui (and other code) per ordered text span.  You will need to
> check and recheck every point at which the CAS is modified by every
> module.  Right now there are at least three or four places in two cTakes
> dictionary modules where a change would be required - and that doesn't
> include YTEX lookup.
> >
> > If you really feel strongly about this and are going to change cTakes
> code, then I suggest (at the risk of sounding like a complete jerk) that
> you also consider the following:
> > 1.  Don't check anything into trunk until all is well with your changes
> and tests
> > Just in case you abandon the effort
> > 2.  Write unit tests for every change
> > True, Map to LinkedMap shouldn't break anything, but they are good to
> have, and may prevent others in the future from switching back to a
> non-linked map or any unordered collection (set not list, etc.).  It also
> makes a better place for explanation in Javadoc than inlines above the code.
> > 3.  Run memory requirement tests before all of your changes and then
> again after your changes
> > I'm actually curious about how much memory might be eaten with linkages
> everywhere
> > 4.  Run performance (speed) tests before and after
> > On a large corpus to ensure that garbage collection is involved
> > 5.  Do the above with every combination possible in current workflows:
> every combination of available sentence detector, pos tagger, smoking
> status detector, dictionary lookup, cas consumer, etc.
> > As soon as somebody says "all output is consistently ordered between
> runs" it had better be so for every possible workflow
> > 6.  Write system tests to ensure ordered/predicted outputs with each
> combination
> > Otherwise somebody may break it
> > 7.  Document the what, how, and why for future development
> > Otherwise somebody won't know to stick to the new rules
> > 8.  Assist anybody as needed that in the future breaks one of these unit
> or system tests with a fix or new feature
> > By mandating such a rule you are assuming responsibility for it
> > 9.  Assist anybody as needed that in the future adds a new module or
> workflow to cTakes to abide by the ordering requirement
> > By mandating such a rule you are assuming responsibility for it
> > 10.  Assist anybody as needed that in the future adds a new module or
> workflow to add system tests to ensure maintenance of the ordering
> requirement
> > By mandating such a rule you are assuming responsibility for it
> >
> >
> > -----Original Message-----
> > From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
> > Sent: Tuesday, October 07, 2014 11:57 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: cTakes output predictability
> >
> > I think we may really prefer the first method. Since it doesn't appear
> that there are any consequences with moving forward with changing the code,
> we would really like to move forward with this approach.
> >
> > Kim Ebert
> > 1.801.669.7342
> > Perfect Search Corp
> > http://www.perfectsearchcorp.com/
> >
> > On 10/07/2014 09:35 AM, britt fitch wrote:
> >> The option Sean mentioned of writing your own custom consumer (without
> >> the UIMA id that is causing your issues) should meet these needs I
> >> believe.
> >>
> >>
> >>
> >> Britt Fitch
> >> Wired Informatics
> >> 265 Franklin St Ste 1702
> >> Boston, MA 02110
> >> http://wiredinformatics.com
> >> Britt.Fitch@wiredinformatics.com
> >>
> >> On Oct 7, 2014, at 11:29 AM, Kim Ebert
> >> <kim.ebert@perfectsearchcorp.com
> >> <ma...@perfectsearchcorp.com>> wrote:
> >>
> >>> Hi Sean,
> >>>
> >>> Well of course that makes plenty of sense. Testing different cTakes
> >>> configurations you would expect different output. In our testing
> >>> we've found several cases where running with the same configuration
> >>> outputs different data under different moons. Having consistent
> >>> results helps us know if we've made improvements to our quality or
> >>> not. Having output that is in a predictable order makes checking to
> >>> see if there are differences much cheaper when you are dealing with
> larger data sets.
> >>>
> >>> Kim Ebert
> >>> 1.801.669.7342
> >>> Perfect Search Corp
> >>> http://www.perfectsearchcorp.com/
> >>>
> >>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
> >>>> Hi Kim,
> >>>>
> >>>> One might want compare the Sentence detector that uses end of line
> >>>> characters as sentence splitters with one that does not.  Such a
> >>>> change in sentence splitting would not only effect the sentence type
> >>>> discoveries but also practically every type that follows.
> >>>>
> >>>> Another might want to compare a note with "skin cancer" vs. one in
> >>>> which you replace "skin cancer" with "melanoma" just to see what the
> >>>> CUI differences might be.  There are changes in two words vs. one,
> >>>> 11 characters vs. 8, a removed adjective(?), and of course changes
> >>>> in CUIs.
> >>>>
> >>>> Of course, if you are just running notes on a new moon and then
> >>>> again on a full moon ...
> >>>>
> >>>> Sean
> >>>>
> >>>> -----Original Message-----
> >>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
> >>>> Sent: Tuesday, October 07, 2014 10:41 AM
> >>>> To: dev@ctakes.apache.org
> >>>> Subject: Re: cTakes output predictability
> >>>>
> >>>> Sean,
> >>>>
> >>>> "...being different because of a possibly intentional difference."
> >>>>
> >>>> I would like you to elaborate a bit on the what would be
> >>>> intentionally different between the processing of the same document
> >>>> multiple times. It would help my understanding of cTakes.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Kim Ebert
> >>>> 1.801.669.7342
> >>>> Perfect Search Corp
> >>>> http://www.perfectsearchcorp.com/
> >>>>
> >>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
> >>>>> Steve Bethard wrote:
> >>>>>> I spent some time writing a script for diff-ing CASes
> >>>>> I urge anyone interested in comparing cTakes CASes / output to use
> >>>>> this type of approach.  Comparison of program output is a
> >>>>> post-process task, and unless absolutely necessary code to juggle
> >>>>> data and metadata belongs there.  Attempts to force every module
> >>>>> past, present and Future to abide by fixed orderings, enumerations
> >>>>> etc. is not as simple a task as one might initially think -
> >>>>> especially if third-party libraries are involved.  I won't get into
> >>>>> problems associated with why one is comparing output (swapped
> >>>>> module?) and IDs, orders etc. being different because of a possibly
> >>>>> intentional difference.
> >>>>>
> >>>>> In addition to or instead of creating a post-processing script, one
> >>>>> could write a new "cas-consumer" that writes output in a desired
> >>>>> format - but this should not require changes to engines.
> >>>>>
> >>>>> "If it ain't broke, don't fix it"
> >>>>>
> >>>>> Sean
> >>>>>
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
> >>>>> Sent: Monday, October 06, 2014 11:23 PM
> >>>>> To: dev@ctakes.apache.org
> >>>>> Subject: Re: cTakes output predictability
> >>>>>
> >>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
> >>>>> <br...@perfectsearchcorp.com> wrote:
> >>>>>> Since I started working with cTakes some time ago, I have found it
> >>>>>> difficult to compare the output between subsequent runs on the
> >>>>>> same files because annotations are often assigned different IDs,
> >>>>>> are listed in different order, etc.
> >>>>> At one point, I spent some time writing a script for diff-ing CASes
> >>>>> that intended to address some of these kinds of issues. It's still
> >>>>> here in cTAKES:
> >>>>>
> >>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy
> >>>>> sis
> >>>>> /CompareFeatureStructures.java
> >>>>>
> >>>>> You might see if you could use or adapt that to your needs.
> >>>>>
> >>>>> Steve
> >
>
>

Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Hi Sean,

No, your not a jerk. These are things worth considering, and I
understand your concerns with touching various points of the codebase.

I'll talk with our group over here and see where we want to go. We are
really interested in cTakes behaving well, so we are usually pretty
careful in testing our changes before committing anything.

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 10:46 AM, Finan, Sean wrote:
> Hi Kim,
>
>> It concerns me a bit by making the code return consistent results would be so concerning. 
> Could you please clarify what you mean by "consistent results"?  Do you mean ordering and IDs or are you talking about actual type values not matching?
>
>> This should be the default mode of operation.
> Depending upon what you meant above, I may agree or disagree.
>
>> Since it doesn't appear that there are any consequences with moving forward with changing the code
> Why do you say this?  
>
> I think that there may be more required changes than you realize.  Every insertion into the CAS must be of ordered data.  This means that, for instance, named entities discovered by dictionary will need to be inserted in some predictable order, such as by alphabetized cui per every alphabetized tui (and other code) per ordered text span.  You will need to check and recheck every point at which the CAS is modified by every module.  Right now there are at least three or four places in two cTakes dictionary modules where a change would be required - and that doesn't include YTEX lookup.
>
> If you really feel strongly about this and are going to change cTakes code, then I suggest (at the risk of sounding like a complete jerk) that you also consider the following:
> 1.  Don't check anything into trunk until all is well with your changes and tests
> Just in case you abandon the effort
> 2.  Write unit tests for every change   
> True, Map to LinkedMap shouldn't break anything, but they are good to have, and may prevent others in the future from switching back to a non-linked map or any unordered collection (set not list, etc.).  It also makes a better place for explanation in Javadoc than inlines above the code.
> 3.  Run memory requirement tests before all of your changes and then again after your changes
> I'm actually curious about how much memory might be eaten with linkages everywhere
> 4.  Run performance (speed) tests before and after
> On a large corpus to ensure that garbage collection is involved
> 5.  Do the above with every combination possible in current workflows: every combination of available sentence detector, pos tagger, smoking status detector, dictionary lookup, cas consumer, etc.
> As soon as somebody says "all output is consistently ordered between runs" it had better be so for every possible workflow
> 6.  Write system tests to ensure ordered/predicted outputs with each combination
> Otherwise somebody may break it
> 7.  Document the what, how, and why for future development
> Otherwise somebody won't know to stick to the new rules
> 8.  Assist anybody as needed that in the future breaks one of these unit or system tests with a fix or new feature
> By mandating such a rule you are assuming responsibility for it
> 9.  Assist anybody as needed that in the future adds a new module or workflow to cTakes to abide by the ordering requirement
> By mandating such a rule you are assuming responsibility for it
> 10.  Assist anybody as needed that in the future adds a new module or workflow to add system tests to ensure maintenance of the ordering requirement
> By mandating such a rule you are assuming responsibility for it
>
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 11:57 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> I think we may really prefer the first method. Since it doesn't appear that there are any consequences with moving forward with changing the code, we would really like to move forward with this approach.
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 09:35 AM, britt fitch wrote:
>> The option Sean mentioned of writing your own custom consumer (without 
>> the UIMA id that is causing your issues) should meet these needs I 
>> believe.
>>
>>   	  	  	 
>>
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com
>> Britt.Fitch@wiredinformatics.com
>>
>> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
>> <kim.ebert@perfectsearchcorp.com 
>> <ma...@perfectsearchcorp.com>> wrote:
>>
>>> Hi Sean,
>>>
>>> Well of course that makes plenty of sense. Testing different cTakes 
>>> configurations you would expect different output. In our testing 
>>> we've found several cases where running with the same configuration 
>>> outputs different data under different moons. Having consistent 
>>> results helps us know if we've made improvements to our quality or 
>>> not. Having output that is in a predictable order makes checking to 
>>> see if there are differences much cheaper when you are dealing with larger data sets.
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>>> Hi Kim,
>>>>
>>>> One might want compare the Sentence detector that uses end of line 
>>>> characters as sentence splitters with one that does not.  Such a 
>>>> change in sentence splitting would not only effect the sentence type 
>>>> discoveries but also practically every type that follows.
>>>>
>>>> Another might want to compare a note with "skin cancer" vs. one in 
>>>> which you replace "skin cancer" with "melanoma" just to see what the 
>>>> CUI differences might be.  There are changes in two words vs. one,
>>>> 11 characters vs. 8, a removed adjective(?), and of course changes 
>>>> in CUIs.
>>>>
>>>> Of course, if you are just running notes on a new moon and then 
>>>> again on a full moon ...
>>>>
>>>> Sean
>>>>
>>>> -----Original Message-----
>>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> Sean,
>>>>
>>>> "...being different because of a possibly intentional difference."
>>>>
>>>> I would like you to elaborate a bit on the what would be 
>>>> intentionally different between the processing of the same document 
>>>> multiple times. It would help my understanding of cTakes.
>>>>
>>>> Thanks,
>>>>
>>>> Kim Ebert
>>>> 1.801.669.7342
>>>> Perfect Search Corp
>>>> http://www.perfectsearchcorp.com/
>>>>
>>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>>> Steve Bethard wrote:
>>>>>> I spent some time writing a script for diff-ing CASes
>>>>> I urge anyone interested in comparing cTakes CASes / output to use 
>>>>> this type of approach.  Comparison of program output is a 
>>>>> post-process task, and unless absolutely necessary code to juggle 
>>>>> data and metadata belongs there.  Attempts to force every module 
>>>>> past, present and Future to abide by fixed orderings, enumerations 
>>>>> etc. is not as simple a task as one might initially think - 
>>>>> especially if third-party libraries are involved.  I won't get into 
>>>>> problems associated with why one is comparing output (swapped
>>>>> module?) and IDs, orders etc. being different because of a possibly 
>>>>> intentional difference.
>>>>>
>>>>> In addition to or instead of creating a post-processing script, one 
>>>>> could write a new "cas-consumer" that writes output in a desired 
>>>>> format - but this should not require changes to engines.
>>>>>
>>>>> "If it ain't broke, don't fix it"
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: cTakes output predictability
>>>>>
>>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>>>> <br...@perfectsearchcorp.com> wrote:
>>>>>> Since I started working with cTakes some time ago, I have found it 
>>>>>> difficult to compare the output between subsequent runs on the 
>>>>>> same files because annotations are often assigned different IDs, 
>>>>>> are listed in different order, etc.
>>>>> At one point, I spent some time writing a script for diff-ing CASes 
>>>>> that intended to address some of these kinds of issues. It's still 
>>>>> here in cTAKES:
>>>>>
>>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy
>>>>> sis
>>>>> /CompareFeatureStructures.java
>>>>>
>>>>> You might see if you could use or adapt that to your needs.
>>>>>
>>>>> Steve
>


RE: cTakes output predictability

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Kim,

> It concerns me a bit by making the code return consistent results would be so concerning. 
Could you please clarify what you mean by "consistent results"?  Do you mean ordering and IDs or are you talking about actual type values not matching?

>This should be the default mode of operation.
Depending upon what you meant above, I may agree or disagree.

> Since it doesn't appear that there are any consequences with moving forward with changing the code
Why do you say this?  

I think that there may be more required changes than you realize.  Every insertion into the CAS must be of ordered data.  This means that, for instance, named entities discovered by dictionary will need to be inserted in some predictable order, such as by alphabetized cui per every alphabetized tui (and other code) per ordered text span.  You will need to check and recheck every point at which the CAS is modified by every module.  Right now there are at least three or four places in two cTakes dictionary modules where a change would be required - and that doesn't include YTEX lookup.

If you really feel strongly about this and are going to change cTakes code, then I suggest (at the risk of sounding like a complete jerk) that you also consider the following:
1.  Don't check anything into trunk until all is well with your changes and tests
Just in case you abandon the effort
2.  Write unit tests for every change   
True, Map to LinkedMap shouldn't break anything, but they are good to have, and may prevent others in the future from switching back to a non-linked map or any unordered collection (set not list, etc.).  It also makes a better place for explanation in Javadoc than inlines above the code.
3.  Run memory requirement tests before all of your changes and then again after your changes
I'm actually curious about how much memory might be eaten with linkages everywhere
4.  Run performance (speed) tests before and after
On a large corpus to ensure that garbage collection is involved
5.  Do the above with every combination possible in current workflows: every combination of available sentence detector, pos tagger, smoking status detector, dictionary lookup, cas consumer, etc.
As soon as somebody says "all output is consistently ordered between runs" it had better be so for every possible workflow
6.  Write system tests to ensure ordered/predicted outputs with each combination
Otherwise somebody may break it
7.  Document the what, how, and why for future development
Otherwise somebody won't know to stick to the new rules
8.  Assist anybody as needed that in the future breaks one of these unit or system tests with a fix or new feature
By mandating such a rule you are assuming responsibility for it
9.  Assist anybody as needed that in the future adds a new module or workflow to cTakes to abide by the ordering requirement
By mandating such a rule you are assuming responsibility for it
10.  Assist anybody as needed that in the future adds a new module or workflow to add system tests to ensure maintenance of the ordering requirement
By mandating such a rule you are assuming responsibility for it


-----Original Message-----
From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 11:57 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

I think we may really prefer the first method. Since it doesn't appear that there are any consequences with moving forward with changing the code, we would really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
> The option Sean mentioned of writing your own custom consumer (without 
> the UIMA id that is causing your issues) should meet these needs I 
> believe.
>
>   	  	  	 
>
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
>
> On Oct 7, 2014, at 11:29 AM, Kim Ebert 
> <kim.ebert@perfectsearchcorp.com 
> <ma...@perfectsearchcorp.com>> wrote:
>
>> Hi Sean,
>>
>> Well of course that makes plenty of sense. Testing different cTakes 
>> configurations you would expect different output. In our testing 
>> we've found several cases where running with the same configuration 
>> outputs different data under different moons. Having consistent 
>> results helps us know if we've made improvements to our quality or 
>> not. Having output that is in a predictable order makes checking to 
>> see if there are differences much cheaper when you are dealing with larger data sets.
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>> One might want compare the Sentence detector that uses end of line 
>>> characters as sentence splitters with one that does not.  Such a 
>>> change in sentence splitting would not only effect the sentence type 
>>> discoveries but also practically every type that follows.
>>>
>>> Another might want to compare a note with "skin cancer" vs. one in 
>>> which you replace "skin cancer" with "melanoma" just to see what the 
>>> CUI differences might be.  There are changes in two words vs. one,
>>> 11 characters vs. 8, a removed adjective(?), and of course changes 
>>> in CUIs.
>>>
>>> Of course, if you are just running notes on a new moon and then 
>>> again on a full moon ...
>>>
>>> Sean
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> Sean,
>>>
>>> "...being different because of a possibly intentional difference."
>>>
>>> I would like you to elaborate a bit on the what would be 
>>> intentionally different between the processing of the same document 
>>> multiple times. It would help my understanding of cTakes.
>>>
>>> Thanks,
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>> Steve Bethard wrote:
>>>>> I spent some time writing a script for diff-ing CASes
>>>> I urge anyone interested in comparing cTakes CASes / output to use 
>>>> this type of approach.  Comparison of program output is a 
>>>> post-process task, and unless absolutely necessary code to juggle 
>>>> data and metadata belongs there.  Attempts to force every module 
>>>> past, present and Future to abide by fixed orderings, enumerations 
>>>> etc. is not as simple a task as one might initially think - 
>>>> especially if third-party libraries are involved.  I won't get into 
>>>> problems associated with why one is comparing output (swapped
>>>> module?) and IDs, orders etc. being different because of a possibly 
>>>> intentional difference.
>>>>
>>>> In addition to or instead of creating a post-processing script, one 
>>>> could write a new "cas-consumer" that writes output in a desired 
>>>> format - but this should not require changes to engines.
>>>>
>>>> "If it ain't broke, don't fix it"
>>>>
>>>> Sean
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>>> <br...@perfectsearchcorp.com> wrote:
>>>>> Since I started working with cTakes some time ago, I have found it 
>>>>> difficult to compare the output between subsequent runs on the 
>>>>> same files because annotations are often assigned different IDs, 
>>>>> are listed in different order, etc.
>>>> At one point, I spent some time writing a script for diff-ing CASes 
>>>> that intended to address some of these kinds of issues. It's still 
>>>> here in cTAKES:
>>>>
>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analy
>>>> sis
>>>> /CompareFeatureStructures.java
>>>>
>>>> You might see if you could use or adapt that to your needs.
>>>>
>>>> Steve
>>
>


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
I think we may really prefer the first method. Since it doesn't appear
that there are any consequences with moving forward with changing the
code, we would really like to move forward with this approach.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 09:35 AM, britt fitch wrote:
> The option Sean mentioned of writing your own custom consumer (without
> the UIMA id that is causing your issues) should meet these needs I
> believe. 
>
>   	  	  	 
>
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
>
> On Oct 7, 2014, at 11:29 AM, Kim Ebert
> <kim.ebert@perfectsearchcorp.com
> <ma...@perfectsearchcorp.com>> wrote:
>
>> Hi Sean,
>>
>> Well of course that makes plenty of sense. Testing different cTakes
>> configurations you would expect different output. In our testing we've
>> found several cases where running with the same configuration outputs
>> different data under different moons. Having consistent results helps us
>> know if we've made improvements to our quality or not. Having output
>> that is in a predictable order makes checking to see if there are
>> differences much cheaper when you are dealing with larger data sets.
>>
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>>
>> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>>> Hi Kim,
>>>
>>> One might want compare the Sentence detector that uses end of line
>>> characters as sentence splitters with one that does not.  Such a
>>> change in sentence splitting would not only effect the sentence type
>>> discoveries but also practically every type that follows.
>>>
>>> Another might want to compare a note with "skin cancer" vs. one in
>>> which you replace "skin cancer" with "melanoma" just to see what the
>>> CUI differences might be.  There are changes in two words vs. one,
>>> 11 characters vs. 8, a removed adjective(?), and of course changes
>>> in CUIs.
>>>
>>> Of course, if you are just running notes on a new moon and then
>>> again on a full moon ...
>>>
>>> Sean
>>>
>>> -----Original Message-----
>>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com]
>>> Sent: Tuesday, October 07, 2014 10:41 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>>
>>> Sean,
>>>
>>> "...being different because of a possibly intentional difference."
>>>
>>> I would like you to elaborate a bit on the what would be
>>> intentionally different between the processing of the same document
>>> multiple times. It would help my understanding of cTakes.
>>>
>>> Thanks,
>>>
>>> Kim Ebert
>>> 1.801.669.7342
>>> Perfect Search Corp
>>> http://www.perfectsearchcorp.com/
>>>
>>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>>> Steve Bethard wrote:
>>>>> I spent some time writing a script for diff-ing CASes
>>>> I urge anyone interested in comparing cTakes CASes / output to use
>>>> this type of approach.  Comparison of program output is a
>>>> post-process task, and unless absolutely necessary code to juggle
>>>> data and metadata belongs there.  Attempts to force every module
>>>> past, present and Future to abide by fixed orderings, enumerations
>>>> etc. is not as simple a task as one might initially think -
>>>> especially if third-party libraries are involved.  I won't get into
>>>> problems associated with why one is comparing output (swapped
>>>> module?) and IDs, orders etc. being different because of a possibly
>>>> intentional difference.
>>>>
>>>> In addition to or instead of creating a post-processing script, one
>>>> could write a new "cas-consumer" that writes output in a desired
>>>> format - but this should not require changes to engines.
>>>>
>>>> "If it ain't broke, don't fix it"
>>>>
>>>> Sean
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>>> Sent: Monday, October 06, 2014 11:23 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: cTakes output predictability
>>>>
>>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
>>>> <br...@perfectsearchcorp.com> wrote:
>>>>> Since I started working with cTakes some time ago, I have found it
>>>>> difficult to compare the output between subsequent runs on the same
>>>>> files because annotations are often assigned different IDs, are
>>>>> listed in different order, etc.
>>>> At one point, I spent some time writing a script for diff-ing CASes
>>>> that intended to address some of these kinds of issues. It's still
>>>> here in cTAKES:
>>>>
>>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>>> /CompareFeatureStructures.java
>>>>
>>>> You might see if you could use or adapt that to your needs.
>>>>
>>>> Steve
>>
>


Re: cTakes output predictability

Posted by britt fitch <br...@wiredinformatics.com>.
The option Sean mentioned of writing your own custom consumer (without the UIMA id that is causing your issues) should meet these needs I believe. 

 	 	 	 
Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

On Oct 7, 2014, at 11:29 AM, Kim Ebert <ki...@perfectsearchcorp.com> wrote:

> Hi Sean,
> 
> Well of course that makes plenty of sense. Testing different cTakes
> configurations you would expect different output. In our testing we've
> found several cases where running with the same configuration outputs
> different data under different moons. Having consistent results helps us
> know if we've made improvements to our quality or not. Having output
> that is in a predictable order makes checking to see if there are
> differences much cheaper when you are dealing with larger data sets.
> 
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
> 
> On 10/07/2014 08:50 AM, Finan, Sean wrote:
>> Hi Kim,
>> 
>> One might want compare the Sentence detector that uses end of line characters as sentence splitters with one that does not.  Such a change in sentence splitting would not only effect the sentence type discoveries but also practically every type that follows.
>> 
>> Another might want to compare a note with "skin cancer" vs. one in which you replace "skin cancer" with "melanoma" just to see what the CUI differences might be.  There are changes in two words vs. one, 11 characters vs. 8, a removed adjective(?), and of course changes in CUIs.
>> 
>> Of course, if you are just running notes on a new moon and then again on a full moon ...
>> 
>> Sean
>> 
>> -----Original Message-----
>> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
>> Sent: Tuesday, October 07, 2014 10:41 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes output predictability
>> 
>> Sean,
>> 
>> "...being different because of a possibly intentional difference."
>> 
>> I would like you to elaborate a bit on the what would be intentionally different between the processing of the same document multiple times. It would help my understanding of cTakes.
>> 
>> Thanks,
>> 
>> Kim Ebert
>> 1.801.669.7342
>> Perfect Search Corp
>> http://www.perfectsearchcorp.com/
>> 
>> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>>> Steve Bethard wrote:
>>>> I spent some time writing a script for diff-ing CASes
>>> I urge anyone interested in comparing cTakes CASes / output to use this type of approach.  Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there.  Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved.  I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference.
>>> 
>>> In addition to or instead of creating a post-processing script, one could write a new "cas-consumer" that writes output in a desired format - but this should not require changes to engines.
>>> 
>>> "If it ain't broke, don't fix it"
>>> 
>>> Sean
>>> 
>>> 
>>> -----Original Message-----
>>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>>> Sent: Monday, October 06, 2014 11:23 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: cTakes output predictability
>>> 
>>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>>> <br...@perfectsearchcorp.com> wrote:
>>>> Since I started working with cTakes some time ago, I have found it 
>>>> difficult to compare the output between subsequent runs on the same 
>>>> files because annotations are often assigned different IDs, are 
>>>> listed in different order, etc.
>>> At one point, I spent some time writing a script for diff-ing CASes 
>>> that intended to address some of these kinds of issues. It's still 
>>> here in cTAKES:
>>> 
>>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>>> /CompareFeatureStructures.java
>>> 
>>> You might see if you could use or adapt that to your needs.
>>> 
>>> Steve
> 


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Hi Sean,

Well of course that makes plenty of sense. Testing different cTakes
configurations you would expect different output. In our testing we've
found several cases where running with the same configuration outputs
different data under different moons. Having consistent results helps us
know if we've made improvements to our quality or not. Having output
that is in a predictable order makes checking to see if there are
differences much cheaper when you are dealing with larger data sets.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 08:50 AM, Finan, Sean wrote:
> Hi Kim,
>
> One might want compare the Sentence detector that uses end of line characters as sentence splitters with one that does not.  Such a change in sentence splitting would not only effect the sentence type discoveries but also practically every type that follows.
>
> Another might want to compare a note with "skin cancer" vs. one in which you replace "skin cancer" with "melanoma" just to see what the CUI differences might be.  There are changes in two words vs. one, 11 characters vs. 8, a removed adjective(?), and of course changes in CUIs.
>
> Of course, if you are just running notes on a new moon and then again on a full moon ...
>
> Sean
>
> -----Original Message-----
> From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
> Sent: Tuesday, October 07, 2014 10:41 AM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> Sean,
>
> "...being different because of a possibly intentional difference."
>
> I would like you to elaborate a bit on the what would be intentionally different between the processing of the same document multiple times. It would help my understanding of cTakes.
>
> Thanks,
>
> Kim Ebert
> 1.801.669.7342
> Perfect Search Corp
> http://www.perfectsearchcorp.com/
>
> On 10/07/2014 07:30 AM, Finan, Sean wrote:
>> Steve Bethard wrote:
>>> I spent some time writing a script for diff-ing CASes
>> I urge anyone interested in comparing cTakes CASes / output to use this type of approach.  Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there.  Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved.  I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference.
>>
>> In addition to or instead of creating a post-processing script, one could write a new "cas-consumer" that writes output in a desired format - but this should not require changes to engines.
>>
>> "If it ain't broke, don't fix it"
>>
>> Sean
>>
>>
>> -----Original Message-----
>> From: Steven Bethard [mailto:steven.bethard@gmail.com]
>> Sent: Monday, October 06, 2014 11:23 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: cTakes output predictability
>>
>> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
>> <br...@perfectsearchcorp.com> wrote:
>>> Since I started working with cTakes some time ago, I have found it 
>>> difficult to compare the output between subsequent runs on the same 
>>> files because annotations are often assigned different IDs, are 
>>> listed in different order, etc.
>> At one point, I spent some time writing a script for diff-ing CASes 
>> that intended to address some of these kinds of issues. It's still 
>> here in cTAKES:
>>
>> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
>> /CompareFeatureStructures.java
>>
>> You might see if you could use or adapt that to your needs.
>>
>> Steve


RE: cTakes output predictability

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Kim,

One might want compare the Sentence detector that uses end of line characters as sentence splitters with one that does not.  Such a change in sentence splitting would not only effect the sentence type discoveries but also practically every type that follows.

Another might want to compare a note with "skin cancer" vs. one in which you replace "skin cancer" with "melanoma" just to see what the CUI differences might be.  There are changes in two words vs. one, 11 characters vs. 8, a removed adjective(?), and of course changes in CUIs.

Of course, if you are just running notes on a new moon and then again on a full moon ...

Sean

-----Original Message-----
From: Kim Ebert [mailto:kim.ebert@perfectsearchcorp.com] 
Sent: Tuesday, October 07, 2014 10:41 AM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

Sean,

"...being different because of a possibly intentional difference."

I would like you to elaborate a bit on the what would be intentionally different between the processing of the same document multiple times. It would help my understanding of cTakes.

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 07:30 AM, Finan, Sean wrote:
> Steve Bethard wrote:
>> I spent some time writing a script for diff-ing CASes
> I urge anyone interested in comparing cTakes CASes / output to use this type of approach.  Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there.  Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved.  I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference.
>
> In addition to or instead of creating a post-processing script, one could write a new "cas-consumer" that writes output in a desired format - but this should not require changes to engines.
>
> "If it ain't broke, don't fix it"
>
> Sean
>
>
> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@gmail.com]
> Sent: Monday, October 06, 2014 11:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen 
> <br...@perfectsearchcorp.com> wrote:
>> Since I started working with cTakes some time ago, I have found it 
>> difficult to compare the output between subsequent runs on the same 
>> files because annotations are often assigned different IDs, are 
>> listed in different order, etc.
> At one point, I spent some time writing a script for diff-ing CASes 
> that intended to address some of these kinds of issues. It's still 
> here in cTAKES:
>
> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis
> /CompareFeatureStructures.java
>
> You might see if you could use or adapt that to your needs.
>
> Steve


Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Sean,

"...being different because of a possibly intentional difference."

I would like you to elaborate a bit on the what would be intentionally
different between the processing of the same document multiple times. It
would help my understanding of cTakes.

Thanks,

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/07/2014 07:30 AM, Finan, Sean wrote:
> Steve Bethard wrote:
>> I spent some time writing a script for diff-ing CASes
> I urge anyone interested in comparing cTakes CASes / output to use this type of approach.  Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there.  Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved.  I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference.
>
> In addition to or instead of creating a post-processing script, one could write a new "cas-consumer" that writes output in a desired format - but this should not require changes to engines.
>
> "If it ain't broke, don't fix it"
>
> Sean
>
>
> -----Original Message-----
> From: Steven Bethard [mailto:steven.bethard@gmail.com] 
> Sent: Monday, October 06, 2014 11:23 PM
> To: dev@ctakes.apache.org
> Subject: Re: cTakes output predictability
>
> On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
> <br...@perfectsearchcorp.com> wrote:
>> Since I started working with cTakes some time ago, I have found it
>> difficult to compare the output between subsequent runs on the same files
>> because annotations are often assigned different IDs, are listed in
>> different order, etc.
> At one point, I spent some time writing a script for diff-ing CASes
> that intended to address some of these kinds of issues. It's still
> here in cTAKES:
>
> ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java
>
> You might see if you could use or adapt that to your needs.
>
> Steve


RE: cTakes output predictability

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Steve Bethard wrote:
> I spent some time writing a script for diff-ing CASes

I urge anyone interested in comparing cTakes CASes / output to use this type of approach.  Comparison of program output is a post-process task, and unless absolutely necessary code to juggle data and metadata belongs there.  Attempts to force every module past, present and Future to abide by fixed orderings, enumerations etc. is not as simple a task as one might initially think - especially if third-party libraries are involved.  I won't get into problems associated with why one is comparing output (swapped module?) and IDs, orders etc. being different because of a possibly intentional difference.

In addition to or instead of creating a post-processing script, one could write a new "cas-consumer" that writes output in a desired format - but this should not require changes to engines.

"If it ain't broke, don't fix it"

Sean


-----Original Message-----
From: Steven Bethard [mailto:steven.bethard@gmail.com] 
Sent: Monday, October 06, 2014 11:23 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes output predictability

On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
<br...@perfectsearchcorp.com> wrote:
> Since I started working with cTakes some time ago, I have found it
> difficult to compare the output between subsequent runs on the same files
> because annotations are often assigned different IDs, are listed in
> different order, etc.

At one point, I spent some time writing a script for diff-ing CASes
that intended to address some of these kinds of issues. It's still
here in cTAKES:

ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java

You might see if you could use or adapt that to your needs.

Steve

Re: cTakes output predictability

Posted by Steven Bethard <st...@gmail.com>.
On Mon, Oct 6, 2014 at 3:59 PM, Bruce Tietjen
<br...@perfectsearchcorp.com> wrote:
> Since I started working with cTakes some time ago, I have found it
> difficult to compare the output between subsequent runs on the same files
> because annotations are often assigned different IDs, are listed in
> different order, etc.

At one point, I spent some time writing a script for diff-ing CASes
that intended to address some of these kinds of issues. It's still
here in cTAKES:

ctakes-temporal/src/main/java/org/apache/ctakes/temporal/data/analysis/CompareFeatureStructures.java

You might see if you could use or adapt that to your needs.

Steve

Re: cTakes output predictability

Posted by Britt Fitch <br...@wiredinformatics.com>.
Before making changes to the data structure I think it would be good to
understand the use case.

Bruce, can can you give a high level description of the issue you are
trying to solve?

Cheers,

Britt


On Mon, Oct 6, 2014 at 7:38 PM, jay vyas <ja...@gmail.com>
wrote:

> Im not a ctakes expert by any means, but in general, I like that idea....
> predictable and deterministic ordering of mapped elements almost always
> leads to less buggy applications.
> As groovy has shown (LinkedHashMap is the default data structure and its
> much easier imo to get reproducible groovy unit tests etc b/c of that).
>
>
> On Mon, Oct 6, 2014 at 4:59 PM, Bruce Tietjen <
> bruce.tietjen@perfectsearchcorp.com> wrote:
>
> > Since I started working with cTakes some time ago, I have found it
> > difficult to compare the output between subsequent runs on the same files
> > because annotations are often assigned different IDs, are listed in
> > different order, etc.
> >
> > One area that seems to be a cause for at least some of these differences
> is
> > the common use of HashMap where enumerating the contents is not
> guaranteed
> > to return items in the same order they were added.
> >
> > I would like to work towards addressing this issue by changing those
> areas
> > of the code where it matters to use a LinkedHashMap instead.
> >
> > Is this something the community would be interested in and find helpful?
> >
> > Thanks,
> >
> > Bruce Tietjen
> > Perfect Search Corp.
> >
>
>
>
> --
> jay vyas
>

Re: cTakes output predictability

Posted by Kim Ebert <ki...@perfectsearchcorp.com>.
Jay,

I agree. This does lead to reproducible unit tests, which helps us out
in the long term.

Kim Ebert
1.801.669.7342
Perfect Search Corp
http://www.perfectsearchcorp.com/

On 10/06/2014 05:38 PM, jay vyas wrote:
> Im not a ctakes expert by any means, but in general, I like that idea....
> predictable and deterministic ordering of mapped elements almost always
> leads to less buggy applications.
> As groovy has shown (LinkedHashMap is the default data structure and its
> much easier imo to get reproducible groovy unit tests etc b/c of that).
>
>
> On Mon, Oct 6, 2014 at 4:59 PM, Bruce Tietjen <
> bruce.tietjen@perfectsearchcorp.com> wrote:
>
>> Since I started working with cTakes some time ago, I have found it
>> difficult to compare the output between subsequent runs on the same files
>> because annotations are often assigned different IDs, are listed in
>> different order, etc.
>>
>> One area that seems to be a cause for at least some of these differences is
>> the common use of HashMap where enumerating the contents is not guaranteed
>> to return items in the same order they were added.
>>
>> I would like to work towards addressing this issue by changing those areas
>> of the code where it matters to use a LinkedHashMap instead.
>>
>> Is this something the community would be interested in and find helpful?
>>
>> Thanks,
>>
>> Bruce Tietjen
>> Perfect Search Corp.
>>
>
>


Re: cTakes output predictability

Posted by jay vyas <ja...@gmail.com>.
Im not a ctakes expert by any means, but in general, I like that idea....
predictable and deterministic ordering of mapped elements almost always
leads to less buggy applications.
As groovy has shown (LinkedHashMap is the default data structure and its
much easier imo to get reproducible groovy unit tests etc b/c of that).


On Mon, Oct 6, 2014 at 4:59 PM, Bruce Tietjen <
bruce.tietjen@perfectsearchcorp.com> wrote:

> Since I started working with cTakes some time ago, I have found it
> difficult to compare the output between subsequent runs on the same files
> because annotations are often assigned different IDs, are listed in
> different order, etc.
>
> One area that seems to be a cause for at least some of these differences is
> the common use of HashMap where enumerating the contents is not guaranteed
> to return items in the same order they were added.
>
> I would like to work towards addressing this issue by changing those areas
> of the code where it matters to use a LinkedHashMap instead.
>
> Is this something the community would be interested in and find helpful?
>
> Thanks,
>
> Bruce Tietjen
> Perfect Search Corp.
>



-- 
jay vyas