You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Pei Chen <ch...@apache.org> on 2015/10/01 16:20:00 UTC

Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Hi Azad,
This is awesome news.  Thanks for adding in the code that was
referenced by the paper.  I'll create a Jira to track we need to port
it over to UIMA/Ruta.

In the meantime, the link is at:
http://sourceforge.net/p/clinical-deid/code/ci/master/tree/ for those
who may be interested in helping out...

--Pei

Hello Pei,

I hope all is well.

I have now uploaded the source code for cDeid
(http://sourceforge.net/p/clinical-deid/code/ci/master/tree/) ; I have
tried to make the code as portable and modular as possible with some
trade-off for performance. This should help with porting the code to
cTAKES/UIMA.

Once you let the community know I will try to get involved to help
with translating JAPE to RUTA, etc.

Best,
Azad

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

great :-)

The ANNIE Tokenizer and Sentence Splitter will maybe best be replaced by
the coresponding cTAKES components. The ruta word-level features can
then additionally come in handy for token classes.

Best,

Peter

Am 09.10.2015 um 15:42 schrieb Azad Dehghan:
> Peter,
>
> I do have full IP for the files that matter: rule-set, dictionaries, and
> the TwoPass implementation. ANNIE Tokeniser and Sentence splitter won't be
> 'ported' (?) as RUTA provides the required word-level features used by the
> rule-set.
>
> Azad
>
> On 9 October 2015 at 14:32, Peter Klügl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>> do you have full IP for all files in the sourceforge project? ... e.g.,
>> the files in GATE/plugins/ANNIE/ or GATE/plugins/ANNIE/resources/gazetteer/
>>
>> Best,
>>
>> Peter
>>
>> Am 08.10.2015 um 21:44 schrieb Azad Dehghan:
>>> Hi Pei,
>>>
>>> The licence has now been updated.
>>>
>>> @Andy the licencing is up to the IP holder.
>>>
>>> Cheers,
>>> Azad
>>>
>>> On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
>>> wrote:
>>>
>>>> This is great news!
>>>>> What is the current status and procedure? Is there an explicit
>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>> sourceforge project?
>>>> Jira has been opened to track this:
>>>> https://issues.apache.org/jira/browse/CTAKES-384
>>>>
>>>> 1) Azad, would you be willing to switch licenses?  I believe it's
>>>> currently GNU3 -> ASL 2.0?
>>>> 2) Create a project/module in cTAKES sandbox for this
>>>> 3) Export/Import sourceforge and attach the code to the Jira initially.
>>>> One of the current cTAKES committers can commit it to the repo (Until
>> folks
>>>> can commit directly to the ctakes repo directly going forward.)
>>>>
>>>> -----Original Message-----
>>>> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
>>>> Sent: Thursday, October 08, 2015 8:06 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>>>> De-identification of Clinical Narratives
>>>>
>>>> Hi,
>>>>
>>>> I can offer my help here if required.
>>>>
>>>> I have experience in translating JAPE rules to UIMA Ruta and already
>>>> worked with clinical notes, e.g., also concerning deidentification.
>>>>
>>>> The problem is that I can only invest a few hours in the next two weeks.
>>>> I will have more time next month or even more next year.
>>>>
>>>> What is the current status and procedure? Is there an explicit
>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>> sourceforge project?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 01.10.2015 um 16:20 schrieb Pei Chen:
>>>>> Hi Azad,
>>>>> This is awesome news.  Thanks for adding in the code that was
>>>>> referenced by the paper.  I'll create a Jira to track we need to port
>>>>> it over to UIMA/Ruta.
>>>>>
>>>>> In the meantime, the link is at:
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
>>>>>
>> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>> for those who may be interested in helping out...
>>>>> --Pei
>>>>>
>>>>> Hello Pei,
>>>>>
>>>>> I hope all is well.
>>>>>
>>>>> I have now uploaded the source code for cDeid
>>>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
>>>>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
>>>>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
>>>>>
>> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>> ) ; I have tried to make the code as portable and modular as possible
>> with
>>>> some trade-off for performance. This should help with porting the code
>> to
>>>> cTAKES/UIMA.
>>>>> Once you let the community know I will try to get involved to help
>>>>> with translating JAPE to RUTA, etc.
>>>>>
>>>>> Best,
>>>>> Azad
>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Peter,

I do have full IP for the files that matter: rule-set, dictionaries, and
the TwoPass implementation. ANNIE Tokeniser and Sentence splitter won't be
'ported' (?) as RUTA provides the required word-level features used by the
rule-set.

Azad

On 9 October 2015 at 14:32, Peter Klügl <pe...@averbis.com> wrote:

> Hi,
>
> do you have full IP for all files in the sourceforge project? ... e.g.,
> the files in GATE/plugins/ANNIE/ or GATE/plugins/ANNIE/resources/gazetteer/
>
> Best,
>
> Peter
>
> Am 08.10.2015 um 21:44 schrieb Azad Dehghan:
> > Hi Pei,
> >
> > The licence has now been updated.
> >
> > @Andy the licencing is up to the IP holder.
> >
> > Cheers,
> > Azad
> >
> > On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
> > wrote:
> >
> >> This is great news!
> >>> What is the current status and procedure? Is there an explicit
> >> contribution to cTAKES? Is there an ICLA? What about the license of the
> >> sourceforge project?
> >> Jira has been opened to track this:
> >> https://issues.apache.org/jira/browse/CTAKES-384
> >>
> >> 1) Azad, would you be willing to switch licenses?  I believe it's
> >> currently GNU3 -> ASL 2.0?
> >> 2) Create a project/module in cTAKES sandbox for this
> >> 3) Export/Import sourceforge and attach the code to the Jira initially.
> >> One of the current cTAKES committers can commit it to the repo (Until
> folks
> >> can commit directly to the ctakes repo directly going forward.)
> >>
> >> -----Original Message-----
> >> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
> >> Sent: Thursday, October 08, 2015 8:06 AM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: Combining Knowledge- and Data-driven Methods for
> >> De-identification of Clinical Narratives
> >>
> >> Hi,
> >>
> >> I can offer my help here if required.
> >>
> >> I have experience in translating JAPE rules to UIMA Ruta and already
> >> worked with clinical notes, e.g., also concerning deidentification.
> >>
> >> The problem is that I can only invest a few hours in the next two weeks.
> >> I will have more time next month or even more next year.
> >>
> >> What is the current status and procedure? Is there an explicit
> >> contribution to cTAKES? Is there an ICLA? What about the license of the
> >> sourceforge project?
> >>
> >> Best,
> >>
> >> Peter
> >>
> >> Am 01.10.2015 um 16:20 schrieb Pei Chen:
> >>> Hi Azad,
> >>> This is awesome news.  Thanks for adding in the code that was
> >>> referenced by the paper.  I'll create a Jira to track we need to port
> >>> it over to UIMA/Ruta.
> >>>
> >>> In the meantime, the link is at:
> >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
> >>>
> >>
> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> >> for those who may be interested in helping out...
> >>> --Pei
> >>>
> >>> Hello Pei,
> >>>
> >>> I hope all is well.
> >>>
> >>> I have now uploaded the source code for cDeid
> >>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
> >>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
> >>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
> >>>
> >>
> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> >> ) ; I have tried to make the code as portable and modular as possible
> with
> >> some trade-off for performance. This should help with porting the code
> to
> >> cTAKES/UIMA.
> >>> Once you let the community know I will try to get involved to help
> >>> with translating JAPE to RUTA, etc.
> >>>
> >>> Best,
> >>> Azad
> >>
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

do you have full IP for all files in the sourceforge project? ... e.g.,
the files in GATE/plugins/ANNIE/ or GATE/plugins/ANNIE/resources/gazetteer/

Best,

Peter

Am 08.10.2015 um 21:44 schrieb Azad Dehghan:
> Hi Pei,
>
> The licence has now been updated.
>
> @Andy the licencing is up to the IP holder.
>
> Cheers,
> Azad
>
> On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
> wrote:
>
>> This is great news!
>>> What is the current status and procedure? Is there an explicit
>> contribution to cTAKES? Is there an ICLA? What about the license of the
>> sourceforge project?
>> Jira has been opened to track this:
>> https://issues.apache.org/jira/browse/CTAKES-384
>>
>> 1) Azad, would you be willing to switch licenses?  I believe it's
>> currently GNU3 -> ASL 2.0?
>> 2) Create a project/module in cTAKES sandbox for this
>> 3) Export/Import sourceforge and attach the code to the Jira initially.
>> One of the current cTAKES committers can commit it to the repo (Until folks
>> can commit directly to the ctakes repo directly going forward.)
>>
>> -----Original Message-----
>> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
>> Sent: Thursday, October 08, 2015 8:06 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>> De-identification of Clinical Narratives
>>
>> Hi,
>>
>> I can offer my help here if required.
>>
>> I have experience in translating JAPE rules to UIMA Ruta and already
>> worked with clinical notes, e.g., also concerning deidentification.
>>
>> The problem is that I can only invest a few hours in the next two weeks.
>> I will have more time next month or even more next year.
>>
>> What is the current status and procedure? Is there an explicit
>> contribution to cTAKES? Is there an ICLA? What about the license of the
>> sourceforge project?
>>
>> Best,
>>
>> Peter
>>
>> Am 01.10.2015 um 16:20 schrieb Pei Chen:
>>> Hi Azad,
>>> This is awesome news.  Thanks for adding in the code that was
>>> referenced by the paper.  I'll create a Jira to track we need to port
>>> it over to UIMA/Ruta.
>>>
>>> In the meantime, the link is at:
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
>>>
>> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>> for those who may be interested in helping out...
>>> --Pei
>>>
>>> Hello Pei,
>>>
>>> I hope all is well.
>>>
>>> I have now uploaded the source code for cDeid
>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
>>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
>>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
>>>
>> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>> ) ; I have tried to make the code as portable and modular as possible with
>> some trade-off for performance. This should help with porting the code to
>> cTAKES/UIMA.
>>> Once you let the community know I will try to get involved to help
>>> with translating JAPE to RUTA, etc.
>>>
>>> Best,
>>> Azad
>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Hi Pei,

The licence has now been updated.

@Andy the licencing is up to the IP holder.

Cheers,
Azad

On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
wrote:

> This is great news!
> > What is the current status and procedure? Is there an explicit
> contribution to cTAKES? Is there an ICLA? What about the license of the
> sourceforge project?
> Jira has been opened to track this:
> https://issues.apache.org/jira/browse/CTAKES-384
>
> 1) Azad, would you be willing to switch licenses?  I believe it's
> currently GNU3 -> ASL 2.0?
> 2) Create a project/module in cTAKES sandbox for this
> 3) Export/Import sourceforge and attach the code to the Jira initially.
> One of the current cTAKES committers can commit it to the repo (Until folks
> can commit directly to the ctakes repo directly going forward.)
>
> -----Original Message-----
> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
> Sent: Thursday, October 08, 2015 8:06 AM
> To: dev@ctakes.apache.org
> Subject: Re: Combining Knowledge- and Data-driven Methods for
> De-identification of Clinical Narratives
>
> Hi,
>
> I can offer my help here if required.
>
> I have experience in translating JAPE rules to UIMA Ruta and already
> worked with clinical notes, e.g., also concerning deidentification.
>
> The problem is that I can only invest a few hours in the next two weeks.
> I will have more time next month or even more next year.
>
> What is the current status and procedure? Is there an explicit
> contribution to cTAKES? Is there an ICLA? What about the license of the
> sourceforge project?
>
> Best,
>
> Peter
>
> Am 01.10.2015 um 16:20 schrieb Pei Chen:
> > Hi Azad,
> > This is awesome news.  Thanks for adding in the code that was
> > referenced by the paper.  I'll create a Jira to track we need to port
> > it over to UIMA/Ruta.
> >
> > In the meantime, the link is at:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
> >
> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> for those who may be interested in helping out...
> >
> > --Pei
> >
> > Hello Pei,
> >
> > I hope all is well.
> >
> > I have now uploaded the source code for cDeid
> > (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
> > _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
> > hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
> >
> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> ) ; I have tried to make the code as portable and modular as possible with
> some trade-off for performance. This should help with porting the code to
> cTAKES/UIMA.
> >
> > Once you let the community know I will try to get involved to help
> > with translating JAPE to RUTA, etc.
> >
> > Best,
> > Azad
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

Ruta requires a newer uima version since Marshall fixed some bugs. In
previous ruta versions there were workarounds but I removed them in the
latest version(s). Thus, using ruta with such an old uima version could
lead to flawed rule inference.

I would recommend ugrading to the latest uima version but this is not a
trivial task and I do not know how good the test coverage of cTAKES is.
I managed the uima version to 2.8.1 for now in the deid project.

There's a ClassNotFoundException of a class that is not part of xml-apis
1.0 but 1.4. I added the stacktrace below.

There was also a problem when merging the patch. I will fix that with
the next patch.

Best,

Peter


log4j:WARN Error during default initialization
java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
    at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown
Source)
    at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown
Source)
    at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown
Source)
    at
org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown
Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at
org.apache.log4j.xml.DOMConfigurator$2.parse(DOMConfigurator.java:767)
    at
org.apache.log4j.xml.DOMConfigurator.doConfigure(DOMConfigurator.java:866)
    at
org.apache.log4j.xml.DOMConfigurator.doConfigure(DOMConfigurator.java:773)
    at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:483)
    at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
    at org.apache.log4j.Logger.getLogger(Logger.java:104)
    at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:289)
    at
org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:109)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at
org.apache.commons.logging.impl.LogFactoryImpl.createLogFromClass(LogFactoryImpl.java:1116)
    at
org.apache.commons.logging.impl.LogFactoryImpl.discoverLogImplementation(LogFactoryImpl.java:914)
    at
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:604)
    at
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:336)
    at
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:310)
    at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
    at
org.springframework.core.io.support.PathMatchingResourcePatternResolver.<clinit>(PathMatchingResourcePatternResolver.java:169)
    at
org.apache.uima.fit.internal.MetaDataUtil.resolve(MetaDataUtil.java:92)
    at
org.apache.uima.fit.internal.MetaDataUtil.scanImportsAndManifests(MetaDataUtil.java:65)
    at
org.apache.uima.fit.internal.MetaDataUtil.scanDescriptors(MetaDataUtil.java:170)
    at
org.apache.uima.fit.factory.TypeSystemDescriptionFactory.scanTypeDescriptors(TypeSystemDescriptionFactory.java:131)
    at
org.apache.uima.fit.factory.TypeSystemDescriptionFactory.createTypeSystemDescription(TypeSystemDescriptionFactory.java:102)
    at
org.apache.uima.fit.factory.JCasFactory.createJCas(JCasFactory.java:50)
    at
org.apache.ctakes.deid.DeidPipelineTest.test(DeidPipelineTest.java:42)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
    at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 70 more

Am 11.01.2016 um 23:52 schrieb Pei Chen:
> Patch applied:
> http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-clinical-deid/
> Thanks Peter.
>
> What error did you get with xml-api’s?  Do you mean upgrade ctakes to
> the latest version of uima instead of 2.4.0?
>
> —Pei
>
>> On Jan 11, 2016, at 12:39 PM, Peter Klügl <peter.kluegl@averbis.com
>> <ma...@averbis.com>> wrote:
>>
>> Hi,
>>
>> I just added a small patch which adds a maven build process and a dummy
>> unit test.
>>
>> I had some problems with the version of xml-apis. Is this known or
>> rather a local problem on my build machine?
>> Is there a reason why cTAKES requires uima 2.4.0?
>>
>> Next step would be translating the rules. Azad mentioned that he already
>> started with that :-)
>>
>> Best,
>>
>> Peter
>>
>>
>> Am 18.12.2015 um 11:01 schrieb Peter Klügl:
>>> Hi,
>>>
>>> sorry, there was no free time left in December for this issue, but I
>>> will be able to provide the patches in January (for real).
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 24.11.2015 um 11:37 schrieb Azad Dehghan:
>>>> This is on my todo list for Dec. as well. If there are any more
>>>> volunteers
>>>> for translating JAPE to RUTA, please get in touch.
>>>>
>>>> Cheers,
>>>> Azad
>>>>
>>>> On 24 Nov 2015 09:55, "Peter Klügl" <peter.kluegl@averbis.com
>>>> <ma...@averbis.com>> wrote:
>>>>> Hi,
>>>>>
>>>>> I just wanted to mention that I haven't forgot about it.
>>>>> Unfortunately,
>>>>> there is just no spare time right now. I hope I will be able to
>>>>> provide
>>>>> the patches in December.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>> Hi Peter,
>>>>>> I think the ctakes-examples is probably a good starting point at
>>>>>> least
>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>> generate desc's as secondary...
>>>>>> I think the actual components that would be required is probably best
>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>> an independent preprocessing component or part of a pipeline (in
>>>>>> which
>>>>>> case, we may need to propose a change to the type system or
>>>>>> perhaps an
>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>> the dev group as you see fit.)
>>>>>>
>>>>>> My 2 cents...
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl
>>>>>> <peter.kluegl@averbis.com <ma...@averbis.com>>
>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>> cTAKES
>>>>>>> community develops or how a project should look like?
>>>>>>> I learned that different people set up UIMA project in a quite
>>>> different
>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>> approach in the cTAKES repo.
>>>>>>>
>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>> components
>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>> Components: On which components may the componetns rely:
>>>>>>> tokenizer, ...
>>>>>>> parser, ... dict lookup?
>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>
>>>>>>> More comments below.
>>>>>>>
>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>> work
>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>
>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you
>>>>>>> want, or
>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>
>>>>>>> If any questions arise, just ask :-)
>>>>>>>
>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>
>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>> data
>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>> individual
>>>> basis
>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>
>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>
>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>
>>>>>>>
>>>>>>>>> My first step would be:
>>>>>>>>> - set up a maven project
>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But one item that we need to review is the 3rd party libs jars
>>>>>>>>> that
>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a
>>>>>>>>> look
>>>> at
>>>>>>>>> that over the next few weeks.
>>>>>>>>>
>>>>>>>>> —Pei
>>>>>>>>>
>>>>>>>>>
>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>> need to
>>>>>>>> worry about the 3rd party libs.
>>>>>>>>
>>>>>>>> Also, just a thought: we may want to create an independent
>>>>>>>> component
>>>> for
>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>> useful
>>>>>>>> for general NER on longitudinal data and surely useful
>>>>>>>> independent of
>>>> the
>>>>>>>> deid component.
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Azad
>>>>>>>>
>>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <pe...@wiredinformatics.com>.

Patch applied:
http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-clinical-deid/ <http://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-clinical-deid/>
Thanks Peter.

What error did you get with xml-api’s?  Do you mean upgrade ctakes to the latest version of uima instead of 2.4.0?

—Pei

> On Jan 11, 2016, at 12:39 PM, Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> I just added a small patch which adds a maven build process and a dummy
> unit test.
> 
> I had some problems with the version of xml-apis. Is this known or
> rather a local problem on my build machine?
> Is there a reason why cTAKES requires uima 2.4.0?
> 
> Next step would be translating the rules. Azad mentioned that he already
> started with that :-)
> 
> Best,
> 
> Peter
> 
> 
> Am 18.12.2015 um 11:01 schrieb Peter Klügl:
>> Hi,
>> 
>> sorry, there was no free time left in December for this issue, but I
>> will be able to provide the patches in January (for real).
>> 
>> Best,
>> 
>> Peter
>> 
>> Am 24.11.2015 um 11:37 schrieb Azad Dehghan:
>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>> for translating JAPE to RUTA, please get in touch.
>>> 
>>> Cheers,
>>> Azad
>>> 
>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>> Hi,
>>>> 
>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>> there is just no spare time right now. I hope I will be able to provide
>>>> the patches in December.
>>>> 
>>>> Best,
>>>> 
>>>> Peter
>>>> 
>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>> Hi Peter,
>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>> uimaFIT style as primary approach to wiring components together and
>>>>> generate desc's as secondary...
>>>>> I think the actual components that would be required is probably best
>>>>> left up to what is actually required for best performing c-deid.  The
>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>> the dev group as you see fit.)
>>>>> 
>>>>> My 2 cents...
>>>>> 
>>>>> 
>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Is there a cTAKES project that may serve as an example on how the
>>> cTAKES
>>>>>> community develops or how a project should look like?
>>>>>> I learned that different people set up UIMA project in a quite
>>> different
>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>> approach in the cTAKES repo.
>>>>>> 
>>>>>> Are there restriction or preferences about the preprocessing components
>>>>>> that should be used and the kind of "output" of the project.
>>>>>> Components: On which components may the componetns rely: tokenizer, ...
>>>>>> parser, ... dict lookup?
>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>> 
>>>>>> More comments below.
>>>>>> 
>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>> work
>>>>>>>> and to coordnate the efforts ...
>>>>>>>> 
>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>> wait until I set up the project with ruta integration.
>>>>>> 
>>>>>> If any questions arise, just ask :-)
>>>>>> 
>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>> 
>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>> data
>>>>>>> sets 12 months after a given challenge; this is done on an individual
>>> basis
>>>>>>> and involve a Data Use Agreement.
>>>>>>> 
>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>> 
>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>> 
>>>>>> 
>>>>>>>> My first step would be:
>>>>>>>> - set up a maven project
>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>> at
>>>>>>>> that over the next few weeks.
>>>>>>>> 
>>>>>>>> —Pei
>>>>>>>> 
>>>>>>>> 
>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>> need to
>>>>>>> worry about the 3rd party libs.
>>>>>>> 
>>>>>>> Also, just a thought: we may want to create an independent component
>>> for
>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>> useful
>>>>>>> for general NER on longitudinal data and surely useful independent of
>>> the
>>>>>>> deid component.
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Azad
>>>>>>> 
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I just added a small patch which adds a maven build process and a dummy
unit test.

I had some problems with the version of xml-apis. Is this known or
rather a local problem on my build machine?
Is there a reason why cTAKES requires uima 2.4.0?

Next step would be translating the rules. Azad mentioned that he already
started with that :-)

Best,

Peter


Am 18.12.2015 um 11:01 schrieb Peter Klügl:
> Hi,
>
> sorry, there was no free time left in December for this issue, but I
> will be able to provide the patches in January (for real).
>
> Best,
>
> Peter
>
> Am 24.11.2015 um 11:37 schrieb Azad Dehghan:
>> This is on my todo list for Dec. as well. If there are any more volunteers
>> for translating JAPE to RUTA, please get in touch.
>>
>> Cheers,
>> Azad
>>
>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>> Hi,
>>>
>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>> there is just no spare time right now. I hope I will be able to provide
>>> the patches in December.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>> Hi Peter,
>>>> I think the ctakes-examples is probably a good starting point at least
>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>> uimaFIT style as primary approach to wiring components together and
>>>> generate desc's as secondary...
>>>> I think the actual components that would be required is probably best
>>>> left up to what is actually required for best performing c-deid.  The
>>>> output would be interesting, I'm not sure if we should treat this as
>>>> an independent preprocessing component or part of a pipeline (in which
>>>> case, we may need to propose a change to the type system or perhaps an
>>>> alternative JCas view.  You can probably open up that discussion to
>>>> the dev group as you see fit.)
>>>>
>>>> My 2 cents...
>>>>
>>>>
>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>> wrote:
>>>>> Hi,
>>>>>
>>>>> Is there a cTAKES project that may serve as an example on how the
>> cTAKES
>>>>> community develops or how a project should look like?
>>>>> I learned that different people set up UIMA project in a quite
>> different
>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>> approach in the cTAKES repo.
>>>>>
>>>>> Are there restriction or preferences about the preprocessing components
>>>>> that should be used and the kind of "output" of the project.
>>>>> Components: On which components may the componetns rely: tokenizer, ...
>>>>> parser, ... dict lookup?
>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>
>>>>> More comments below.
>>>>>
>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>> work
>>>>>>> and to coordnate the efforts ...
>>>>>>>
>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>> wait until I set up the project with ruta integration.
>>>>>
>>>>> If any questions arise, just ask :-)
>>>>>
>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>
>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>> data
>>>>>> sets 12 months after a given challenge; this is done on an individual
>> basis
>>>>>> and involve a Data Use Agreement.
>>>>>>
>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>
>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>
>>>>>
>>>>>>> My first step would be:
>>>>>>> - set up a maven project
>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>
>>>>>>>
>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>> at
>>>>>>> that over the next few weeks.
>>>>>>>
>>>>>>> —Pei
>>>>>>>
>>>>>>>
>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>> need to
>>>>>> worry about the 3rd party libs.
>>>>>>
>>>>>> Also, just a thought: we may want to create an independent component
>> for
>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>> useful
>>>>>> for general NER on longitudinal data and surely useful independent of
>> the
>>>>>> deid component.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Azad
>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

sorry, there was no free time left in December for this issue, but I
will be able to provide the patches in January (for real).

Best,

Peter

Am 24.11.2015 um 11:37 schrieb Azad Dehghan:
> This is on my todo list for Dec. as well. If there are any more volunteers
> for translating JAPE to RUTA, please get in touch.
>
> Cheers,
> Azad
>
> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>> Hi,
>>
>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>> there is just no spare time right now. I hope I will be able to provide
>> the patches in December.
>>
>> Best,
>>
>> Peter
>>
>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>> Hi Peter,
>>> I think the ctakes-examples is probably a good starting point at least
>>> in terms of maven modules, etc.  I think it would be good if we use
>>> uimaFIT style as primary approach to wiring components together and
>>> generate desc's as secondary...
>>> I think the actual components that would be required is probably best
>>> left up to what is actually required for best performing c-deid.  The
>>> output would be interesting, I'm not sure if we should treat this as
>>> an independent preprocessing component or part of a pipeline (in which
>>> case, we may need to propose a change to the type system or perhaps an
>>> alternative JCas view.  You can probably open up that discussion to
>>> the dev group as you see fit.)
>>>
>>> My 2 cents...
>>>
>>>
>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
> wrote:
>>>> Hi,
>>>>
>>>> Is there a cTAKES project that may serve as an example on how the
> cTAKES
>>>> community develops or how a project should look like?
>>>> I learned that different people set up UIMA project in a quite
> different
>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>> approach in the cTAKES repo.
>>>>
>>>> Are there restriction or preferences about the preprocessing components
>>>> that should be used and the kind of "output" of the project.
>>>> Components: On which components may the componetns rely: tokenizer, ...
>>>> parser, ... dict lookup?
>>>> "output": Should the project provide a pipeline or a single AE?
>>>>
>>>> More comments below.
>>>>
>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
> work
>>>>>> and to coordnate the efforts ...
>>>>>>
>>>>> I would like to help with the translating JAPE to RUTA.
>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>> wait until I set up the project with ruta integration.
>>>>
>>>> If any questions arise, just ask :-)
>>>>
>>>>>> Is there a development dataset which was utilized for the initial
>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>
>>>>> The data set is unfortunately not publicly available; i2b2
>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
> data
>>>>> sets 12 months after a given challenge; this is done on an individual
> basis
>>>>> and involve a Data Use Agreement.
>>>>>
>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>
>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>
>>>>
>>>>>> My first step would be:
>>>>>> - set up a maven project
>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>
>>>>>>
>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
> at
>>>>>> that over the next few weeks.
>>>>>>
>>>>>> —Pei
>>>>>>
>>>>>>
>>>>> @Pei - once ANNIE components are replaced there is should not be a
> need to
>>>>> worry about the 3rd party libs.
>>>>>
>>>>> Also, just a thought: we may want to create an independent component
> for
>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
> useful
>>>>> for general NER on longitudinal data and surely useful independent of
> the
>>>>> deid component.
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Azad
>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

> I integrated your ruta scripts and added a new patch (includes and
> replaces my last one).
>
Ok.

> I noticed some semantic differences between the ruta rules and their
> jape originals, e.g., the brackets for the user name. Are they intended?
>
Brackets should be included. I was getting some inconsistent output so I
removed them for the time being.
> I needed to change some rule elements, e.g, "M.D." does not work as a
> literal rule element match (very old restriction of ruta which should be
> removed some day...). These literal string matches should be avoided at
> all if possible, or at least the start anchor should set to a different
> rule element.
>
OK.
> Ok, I'll let you know when I start with a rule set.

Great!

Azad
> Am 20.01.2016 um 01:47 schrieb Azad Dehghan:
> > Peter,
> >
> > So, we have Email, Url, Profession, Street, Zip, State and Username
> > completed so far.
> >
> > The following NERs remain:
> > Country, Age, Doctor, Fax, Id_num, Medicalrec_num, Patient, and Phone.
> >
> > I will do Country next. If you are able to translated the rest quickly
> > please do :) else just keep me posted which ones your are working on to
> > avoid duplicate work...and we can work through the remaining NERs.
> >
> > Also, once the NERs are translated I will prepare a number of examples
for
> > unit testing -- I will also be validate the NERs using the i2b2 research
> > dataset.
> >
> > Cheers,
> > Azad
> >
> > On 19 January 2016 at 09:01, Peter Klügl <pe...@averbis.com>
wrote:
> >
> >> Ok, let me know which ones I should translate.
> >>
> >> Best,
> >>
> >> Peter
> >>
> >> Am 18.01.2016 um 20:13 schrieb Azad Dehghan:
> >>> Peter,
> >>>
> >>> Thanks for pushing things!
> >>>
> >>> I would rather split the rules/NERs to get things moving quicker (as I
> >> am a
> >>> newbie to Ruta). I will be uploading another NER (Username) shortly. I
> >> will
> >>> look at your changes to follow suit.
> >>>
> >>> Best,
> >>> Azad
> >>>
> >>> On 18 January 2016 at 14:06, Peter Klügl <pe...@averbis.com>
> >> wrote:
> >>>> Hi,
> >>>>
> >>>> a new patch is attached.
> >>>>
> >>>> @Pei:
> >>>> are there suitable annotation types in the cTAKES type system? Some
> >>>> project in cTAKES uses something like OntologyMatch... I map it to
> >>>> IdentifiedAnnotation right now, but there are many empty features...
> >>>>
> >>>> @Azad:
> >>>> I changed the rules a bit, especially the capitalization like I use
it
> >>>> in ruta normally. The wordlist are compiled to a trie by the maven
> >>>> plugin. I also added the two regexes for url and email. I extended
the
> >>>> regex for the url. I also changed the evaluation order of some rules
> >>>> (with @). Feel free to add simple examples to examples.csv for the
unit
> >>>> tests.
> >>>>
> >>>> Let me know if you need more information about the changes.
> >>>>
> >>>> Do you wanna have help with the other rule sets? Or should we split
them
> >>>> up?
> >>>>
> >>>> Best,
> >>>>
> >>>> Peter
> >>>>
> >>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>>>> Hi,
> >>>>>
> >>>>> great. I will integrate them in the project and in the next patch.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>>>> Three NERs translated and uploaded.
> >>>>>>
> >>>>>> PS. I will validate all NERs once we have them all completed.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Azad
> >>>>>>
> >>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com>
> >>>> wrote:
> >>>>>>> This is on my todo list for Dec. as well. If there are any more
> >>>> volunteers
> >>>>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Azad
> >>>>>>>
> >>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com>
> >> wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I just wanted to mention that I haven't forgot about it.
> >>>> Unfortunately,
> >>>>>>>> there is just no spare time right now. I hope I will be able to
> >>>> provide
> >>>>>>>> the patches in December.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Peter
> >>>>>>>>
> >>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>>>> Hi Peter,
> >>>>>>>>> I think the ctakes-examples is probably a good starting point at
> >>>> least
> >>>>>>>>> in terms of maven modules, etc.  I think it would be good if we
use
> >>>>>>>>> uimaFIT style as primary approach to wiring components together
and
> >>>>>>>>> generate desc's as secondary...
> >>>>>>>>> I think the actual components that would be required is probably
> >> best
> >>>>>>>>> left up to what is actually required for best performing c-deid.
> >> The
> >>>>>>>>> output would be interesting, I'm not sure if we should treat
this
> >> as
> >>>>>>>>> an independent preprocessing component or part of a pipeline (in
> >>>> which
> >>>>>>>>> case, we may need to propose a change to the type system or
perhaps
> >>>> an
> >>>>>>>>> alternative JCas view.  You can probably open up that
discussion to
> >>>>>>>>> the dev group as you see fit.)
> >>>>>>>>>
> >>>>>>>>> My 2 cents...
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> >>>> peter.kluegl@averbis.com>
> >>>>>>> wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Is there a cTAKES project that may serve as an example on how
the
> >>>>>>> cTAKES
> >>>>>>>>>> community develops or how a project should look like?
> >>>>>>>>>> I learned that different people set up UIMA project in a quite
> >>>>>>> different
> >>>>>>>>>> manner and I do not what to get inspired by "some sort of
> >> out-dated"
> >>>>>>>>>> approach in the cTAKES repo.
> >>>>>>>>>>
> >>>>>>>>>> Are there restriction or preferences about the preprocessing
> >>>>>>> components
> >>>>>>>>>> that should be used and the kind of "output" of the project.
> >>>>>>>>>> Components: On which components may the componetns rely:
> >> tokenizer,
> >>>>>>> ...
> >>>>>>>>>> parser, ... dict lookup?
> >>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
> >>>>>>>>>>
> >>>>>>>>>> More comments below.
> >>>>>>>>>>
> >>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
> >> duplicate
> >>>>>>> work
> >>>>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>>>
> >>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
> >>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you
want,
> >>>> or
> >>>>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>>>
> >>>>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>>>
> >>>>>>>>>>>> Is there a development dataset which was utilized for the
> >> initial
> >>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
> >>>>>>>>>>>>
> >>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
> >>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
releases
> >>>> the
> >>>>>>> data
> >>>>>>>>>>> sets 12 months after a given challenge; this is done on an
> >>>>>>> individual basis
> >>>>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>>>
> >>>>>>>>>>> However, I will be able to conduct and coordinate the
validation.
> >>>>>>>>>>>
> >>>>>>>>>> Ok, I'll investigate if we have already access to the dataset
> >> here.
> >>>>>>>>>>
> >>>>>>>>>>>> My first step would be:
> >>>>>>>>>>>> - set up a maven project
> >>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
> >> components
> >>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> But one item that we need to review is the 3rd party libs
jars
> >>>> that
> >>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take
a
> >>>> look
> >>>>>>> at
> >>>>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> —Pei
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not
be
> >> a
> >>>>>>> need to
> >>>>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>>>
> >>>>>>>>>>> Also, just a thought: we may want to create an independent
> >>>> component
> >>>>>>> for
> >>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have
shown
> >>>>>>> useful
> >>>>>>>>>>> for general NER on longitudinal data and surely useful
> >> independent
> >>>>>>> of the
> >>>>>>>>>>> deid component.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Azad
> >>>>>>>>>>>
> >>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I integrated your ruta scripts and added a new patch (includes and
replaces my last one).

I noticed some semantic differences between the ruta rules and their
jape originals, e.g., the brackets for the user name. Are they intended?

I needed to change some rule elements, e.g, "M.D." does not work as a
literal rule element match (very old restriction of ruta which should be
removed some day...). These literal string matches should be avoided at
all if possible, or at least the start anchor should set to a different
rule element.

Ok, I'll let you know when I start with a rule set.

Best,

Peter

Am 20.01.2016 um 01:47 schrieb Azad Dehghan:
> Peter,
>
> So, we have Email, Url, Profession, Street, Zip, State and Username
> completed so far.
>
> The following NERs remain:
> Country, Age, Doctor, Fax, Id_num, Medicalrec_num, Patient, and Phone.
>
> I will do Country next. If you are able to translated the rest quickly
> please do :) else just keep me posted which ones your are working on to
> avoid duplicate work...and we can work through the remaining NERs.
>
> Also, once the NERs are translated I will prepare a number of examples for
> unit testing -- I will also be validate the NERs using the i2b2 research
> dataset.
>
> Cheers,
> Azad
>
> On 19 January 2016 at 09:01, Peter Klügl <pe...@averbis.com> wrote:
>
>> Ok, let me know which ones I should translate.
>>
>> Best,
>>
>> Peter
>>
>> Am 18.01.2016 um 20:13 schrieb Azad Dehghan:
>>> Peter,
>>>
>>> Thanks for pushing things!
>>>
>>> I would rather split the rules/NERs to get things moving quicker (as I
>> am a
>>> newbie to Ruta). I will be uploading another NER (Username) shortly. I
>> will
>>> look at your changes to follow suit.
>>>
>>> Best,
>>> Azad
>>>
>>> On 18 January 2016 at 14:06, Peter Klügl <pe...@averbis.com>
>> wrote:
>>>> Hi,
>>>>
>>>> a new patch is attached.
>>>>
>>>> @Pei:
>>>> are there suitable annotation types in the cTAKES type system? Some
>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>> IdentifiedAnnotation right now, but there are many empty features...
>>>>
>>>> @Azad:
>>>> I changed the rules a bit, especially the capitalization like I use it
>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>> plugin. I also added the two regexes for url and email. I extended the
>>>> regex for the url. I also changed the evaluation order of some rules
>>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>>> tests.
>>>>
>>>> Let me know if you need more information about the changes.
>>>>
>>>> Do you wanna have help with the other rule sets? Or should we split them
>>>> up?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>> Hi,
>>>>>
>>>>> great. I will integrate them in the project and in the next patch.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>> Three NERs translated and uploaded.
>>>>>>
>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>
>>>>>> Cheers,
>>>>>> Azad
>>>>>>
>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com>
>>>> wrote:
>>>>>>> This is on my todo list for Dec. as well. If there are any more
>>>> volunteers
>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Azad
>>>>>>>
>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com>
>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>> Unfortunately,
>>>>>>>> there is just no spare time right now. I hope I will be able to
>>>> provide
>>>>>>>> the patches in December.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>> Hi Peter,
>>>>>>>>> I think the ctakes-examples is probably a good starting point at
>>>> least
>>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>>> generate desc's as secondary...
>>>>>>>>> I think the actual components that would be required is probably
>> best
>>>>>>>>> left up to what is actually required for best performing c-deid.
>> The
>>>>>>>>> output would be interesting, I'm not sure if we should treat this
>> as
>>>>>>>>> an independent preprocessing component or part of a pipeline (in
>>>> which
>>>>>>>>> case, we may need to propose a change to the type system or perhaps
>>>> an
>>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>
>>>>>>>>> My 2 cents...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>>>> peter.kluegl@averbis.com>
>>>>>>> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>>> cTAKES
>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>>> different
>>>>>>>>>> manner and I do not what to get inspired by "some sort of
>> out-dated"
>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>
>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>> components
>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>> Components: On which components may the componetns rely:
>> tokenizer,
>>>>>>> ...
>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>>
>>>>>>>>>> More comments below.
>>>>>>>>>>
>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
>> duplicate
>>>>>>> work
>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>
>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want,
>>>> or
>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>
>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>
>>>>>>>>>>>> Is there a development dataset which was utilized for the
>> initial
>>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>>
>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases
>>>> the
>>>>>>> data
>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>> individual basis
>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>
>>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>>
>>>>>>>>>> Ok, I'll investigate if we have already access to the dataset
>> here.
>>>>>>>>>>
>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
>> components
>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars
>>>> that
>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a
>>>> look
>>>>>>> at
>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>
>>>>>>>>>>>> —Pei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be
>> a
>>>>>>> need to
>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>
>>>>>>>>>>> Also, just a thought: we may want to create an independent
>>>> component
>>>>>>> for
>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>>> useful
>>>>>>>>>>> for general NER on longitudinal data and surely useful
>> independent
>>>>>>> of the
>>>>>>>>>>> deid component.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Azad
>>>>>>>>>>>
>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Peter,

So, we have Email, Url, Profession, Street, Zip, State and Username
completed so far.

The following NERs remain:
Country, Age, Doctor, Fax, Id_num, Medicalrec_num, Patient, and Phone.

I will do Country next. If you are able to translated the rest quickly
please do :) else just keep me posted which ones your are working on to
avoid duplicate work...and we can work through the remaining NERs.

Also, once the NERs are translated I will prepare a number of examples for
unit testing -- I will also be validate the NERs using the i2b2 research
dataset.

Cheers,
Azad

On 19 January 2016 at 09:01, Peter Klügl <pe...@averbis.com> wrote:

> Ok, let me know which ones I should translate.
>
> Best,
>
> Peter
>
> Am 18.01.2016 um 20:13 schrieb Azad Dehghan:
> > Peter,
> >
> > Thanks for pushing things!
> >
> > I would rather split the rules/NERs to get things moving quicker (as I
> am a
> > newbie to Ruta). I will be uploading another NER (Username) shortly. I
> will
> > look at your changes to follow suit.
> >
> > Best,
> > Azad
> >
> > On 18 January 2016 at 14:06, Peter Klügl <pe...@averbis.com>
> wrote:
> >
> >> Hi,
> >>
> >> a new patch is attached.
> >>
> >> @Pei:
> >> are there suitable annotation types in the cTAKES type system? Some
> >> project in cTAKES uses something like OntologyMatch... I map it to
> >> IdentifiedAnnotation right now, but there are many empty features...
> >>
> >> @Azad:
> >> I changed the rules a bit, especially the capitalization like I use it
> >> in ruta normally. The wordlist are compiled to a trie by the maven
> >> plugin. I also added the two regexes for url and email. I extended the
> >> regex for the url. I also changed the evaluation order of some rules
> >> (with @). Feel free to add simple examples to examples.csv for the unit
> >> tests.
> >>
> >> Let me know if you need more information about the changes.
> >>
> >> Do you wanna have help with the other rule sets? Or should we split them
> >> up?
> >>
> >> Best,
> >>
> >> Peter
> >>
> >> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>> Hi,
> >>>
> >>> great. I will integrate them in the project and in the next patch.
> >>>
> >>> Best,
> >>>
> >>> Peter
> >>>
> >>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>> Three NERs translated and uploaded.
> >>>>
> >>>> PS. I will validate all NERs once we have them all completed.
> >>>>
> >>>> Cheers,
> >>>> Azad
> >>>>
> >>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com>
> >> wrote:
> >>>>> This is on my todo list for Dec. as well. If there are any more
> >> volunteers
> >>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>
> >>>>> Cheers,
> >>>>> Azad
> >>>>>
> >>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com>
> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I just wanted to mention that I haven't forgot about it.
> >> Unfortunately,
> >>>>>> there is just no spare time right now. I hope I will be able to
> >> provide
> >>>>>> the patches in December.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>> Hi Peter,
> >>>>>>> I think the ctakes-examples is probably a good starting point at
> >> least
> >>>>>>> in terms of maven modules, etc.  I think it would be good if we use
> >>>>>>> uimaFIT style as primary approach to wiring components together and
> >>>>>>> generate desc's as secondary...
> >>>>>>> I think the actual components that would be required is probably
> best
> >>>>>>> left up to what is actually required for best performing c-deid.
> The
> >>>>>>> output would be interesting, I'm not sure if we should treat this
> as
> >>>>>>> an independent preprocessing component or part of a pipeline (in
> >> which
> >>>>>>> case, we may need to propose a change to the type system or perhaps
> >> an
> >>>>>>> alternative JCas view.  You can probably open up that discussion to
> >>>>>>> the dev group as you see fit.)
> >>>>>>>
> >>>>>>> My 2 cents...
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> >> peter.kluegl@averbis.com>
> >>>>> wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Is there a cTAKES project that may serve as an example on how the
> >>>>> cTAKES
> >>>>>>>> community develops or how a project should look like?
> >>>>>>>> I learned that different people set up UIMA project in a quite
> >>>>> different
> >>>>>>>> manner and I do not what to get inspired by "some sort of
> out-dated"
> >>>>>>>> approach in the cTAKES repo.
> >>>>>>>>
> >>>>>>>> Are there restriction or preferences about the preprocessing
> >>>>> components
> >>>>>>>> that should be used and the kind of "output" of the project.
> >>>>>>>> Components: On which components may the componetns rely:
> tokenizer,
> >>>>> ...
> >>>>>>>> parser, ... dict lookup?
> >>>>>>>> "output": Should the project provide a pipeline or a single AE?
> >>>>>>>>
> >>>>>>>> More comments below.
> >>>>>>>>
> >>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>> Who else plans to provide patches for it? Just to avoid
> duplicate
> >>>>> work
> >>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>
> >>>>>>>>> I would like to help with the translating JAPE to RUTA.
> >>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want,
> >> or
> >>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>
> >>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>
> >>>>>>>>>> Is there a development dataset which was utilized for the
> initial
> >>>>>>>>>> development, and if yes, is it possible to contribute it too?
> >>>>>>>>>>
> >>>>>>>>> The data set is unfortunately not publicly available; i2b2
> >>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases
> >> the
> >>>>> data
> >>>>>>>>> sets 12 months after a given challenge; this is done on an
> >>>>> individual basis
> >>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>
> >>>>>>>>> However, I will be able to conduct and coordinate the validation.
> >>>>>>>>>
> >>>>>>>> Ok, I'll investigate if we have already access to the dataset
> here.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>> My first step would be:
> >>>>>>>>>> - set up a maven project
> >>>>>>>>>> - set up a development pipeline in a test (with cTAKES
> components
> >>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> But one item that we need to review is the 3rd party libs jars
> >> that
> >>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a
> >> look
> >>>>> at
> >>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>
> >>>>>>>>>> —Pei
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> @Pei - once ANNIE components are replaced there is should not be
> a
> >>>>> need to
> >>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>
> >>>>>>>>> Also, just a thought: we may want to create an independent
> >> component
> >>>>> for
> >>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
> >>>>> useful
> >>>>>>>>> for general NER on longitudinal data and surely useful
> independent
> >>>>> of the
> >>>>>>>>> deid component.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Azad
> >>>>>>>>>
> >>
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Ok, let me know which ones I should translate.

Best,

Peter

Am 18.01.2016 um 20:13 schrieb Azad Dehghan:
> Peter,
>
> Thanks for pushing things!
>
> I would rather split the rules/NERs to get things moving quicker (as I am a
> newbie to Ruta). I will be uploading another NER (Username) shortly. I will
> look at your changes to follow suit.
>
> Best,
> Azad
>
> On 18 January 2016 at 14:06, Peter Klügl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>> a new patch is attached.
>>
>> @Pei:
>> are there suitable annotation types in the cTAKES type system? Some
>> project in cTAKES uses something like OntologyMatch... I map it to
>> IdentifiedAnnotation right now, but there are many empty features...
>>
>> @Azad:
>> I changed the rules a bit, especially the capitalization like I use it
>> in ruta normally. The wordlist are compiled to a trie by the maven
>> plugin. I also added the two regexes for url and email. I extended the
>> regex for the url. I also changed the evaluation order of some rules
>> (with @). Feel free to add simple examples to examples.csv for the unit
>> tests.
>>
>> Let me know if you need more information about the changes.
>>
>> Do you wanna have help with the other rule sets? Or should we split them
>> up?
>>
>> Best,
>>
>> Peter
>>
>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>> Hi,
>>>
>>> great. I will integrate them in the project and in the next patch.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>> Three NERs translated and uploaded.
>>>>
>>>> PS. I will validate all NERs once we have them all completed.
>>>>
>>>> Cheers,
>>>> Azad
>>>>
>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com>
>> wrote:
>>>>> This is on my todo list for Dec. as well. If there are any more
>> volunteers
>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>
>>>>> Cheers,
>>>>> Azad
>>>>>
>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I just wanted to mention that I haven't forgot about it.
>> Unfortunately,
>>>>>> there is just no spare time right now. I hope I will be able to
>> provide
>>>>>> the patches in December.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>> Hi Peter,
>>>>>>> I think the ctakes-examples is probably a good starting point at
>> least
>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>> generate desc's as secondary...
>>>>>>> I think the actual components that would be required is probably best
>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>> an independent preprocessing component or part of a pipeline (in
>> which
>>>>>>> case, we may need to propose a change to the type system or perhaps
>> an
>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>> the dev group as you see fit.)
>>>>>>>
>>>>>>> My 2 cents...
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>> peter.kluegl@averbis.com>
>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>> cTAKES
>>>>>>>> community develops or how a project should look like?
>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>> different
>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>> approach in the cTAKES repo.
>>>>>>>>
>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>> components
>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>> ...
>>>>>>>> parser, ... dict lookup?
>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>
>>>>>>>> More comments below.
>>>>>>>>
>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>> work
>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>
>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want,
>> or
>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>
>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>
>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>
>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases
>> the
>>>>> data
>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>> individual basis
>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>
>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>
>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> My first step would be:
>>>>>>>>>> - set up a maven project
>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But one item that we need to review is the 3rd party libs jars
>> that
>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a
>> look
>>>>> at
>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>
>>>>>>>>>> —Pei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>> need to
>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>
>>>>>>>>> Also, just a thought: we may want to create an independent
>> component
>>>>> for
>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>> useful
>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>> of the
>>>>>>>>> deid component.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Azad
>>>>>>>>>
>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Peter,

Thanks for pushing things!

I would rather split the rules/NERs to get things moving quicker (as I am a
newbie to Ruta). I will be uploading another NER (Username) shortly. I will
look at your changes to follow suit.

Best,
Azad

On 18 January 2016 at 14:06, Peter Klügl <pe...@averbis.com> wrote:

> Hi,
>
> a new patch is attached.
>
> @Pei:
> are there suitable annotation types in the cTAKES type system? Some
> project in cTAKES uses something like OntologyMatch... I map it to
> IdentifiedAnnotation right now, but there are many empty features...
>
> @Azad:
> I changed the rules a bit, especially the capitalization like I use it
> in ruta normally. The wordlist are compiled to a trie by the maven
> plugin. I also added the two regexes for url and email. I extended the
> regex for the url. I also changed the evaluation order of some rules
> (with @). Feel free to add simple examples to examples.csv for the unit
> tests.
>
> Let me know if you need more information about the changes.
>
> Do you wanna have help with the other rule sets? Or should we split them
> up?
>
> Best,
>
> Peter
>
> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> > Hi,
> >
> > great. I will integrate them in the project and in the next patch.
> >
> > Best,
> >
> > Peter
> >
> > Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >> Three NERs translated and uploaded.
> >>
> >> PS. I will validate all NERs once we have them all completed.
> >>
> >> Cheers,
> >> Azad
> >>
> >> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com>
> wrote:
> >>
> >>> This is on my todo list for Dec. as well. If there are any more
> volunteers
> >>> for translating JAPE to RUTA, please get in touch.
> >>>
> >>> Cheers,
> >>> Azad
> >>>
> >>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I just wanted to mention that I haven't forgot about it.
> Unfortunately,
> >>>> there is just no spare time right now. I hope I will be able to
> provide
> >>>> the patches in December.
> >>>>
> >>>> Best,
> >>>>
> >>>> Peter
> >>>>
> >>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>> Hi Peter,
> >>>>> I think the ctakes-examples is probably a good starting point at
> least
> >>>>> in terms of maven modules, etc.  I think it would be good if we use
> >>>>> uimaFIT style as primary approach to wiring components together and
> >>>>> generate desc's as secondary...
> >>>>> I think the actual components that would be required is probably best
> >>>>> left up to what is actually required for best performing c-deid.  The
> >>>>> output would be interesting, I'm not sure if we should treat this as
> >>>>> an independent preprocessing component or part of a pipeline (in
> which
> >>>>> case, we may need to propose a change to the type system or perhaps
> an
> >>>>> alternative JCas view.  You can probably open up that discussion to
> >>>>> the dev group as you see fit.)
> >>>>>
> >>>>> My 2 cents...
> >>>>>
> >>>>>
> >>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> peter.kluegl@averbis.com>
> >>> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Is there a cTAKES project that may serve as an example on how the
> >>> cTAKES
> >>>>>> community develops or how a project should look like?
> >>>>>> I learned that different people set up UIMA project in a quite
> >>> different
> >>>>>> manner and I do not what to get inspired by "some sort of out-dated"
> >>>>>> approach in the cTAKES repo.
> >>>>>>
> >>>>>> Are there restriction or preferences about the preprocessing
> >>> components
> >>>>>> that should be used and the kind of "output" of the project.
> >>>>>> Components: On which components may the componetns rely: tokenizer,
> >>> ...
> >>>>>> parser, ... dict lookup?
> >>>>>> "output": Should the project provide a pipeline or a single AE?
> >>>>>>
> >>>>>> More comments below.
> >>>>>>
> >>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
> >>> work
> >>>>>>>> and to coordnate the efforts ...
> >>>>>>>>
> >>>>>>> I would like to help with the translating JAPE to RUTA.
> >>>>>> You can already go ahead with the UIMA Ruta Workbench if you want,
> or
> >>>>>> wait until I set up the project with ruta integration.
> >>>>>>
> >>>>>> If any questions arise, just ask :-)
> >>>>>>
> >>>>>>>> Is there a development dataset which was utilized for the initial
> >>>>>>>> development, and if yes, is it possible to contribute it too?
> >>>>>>>>
> >>>>>>> The data set is unfortunately not publicly available; i2b2
> >>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases
> the
> >>> data
> >>>>>>> sets 12 months after a given challenge; this is done on an
> >>> individual basis
> >>>>>>> and involve a Data Use Agreement.
> >>>>>>>
> >>>>>>> However, I will be able to conduct and coordinate the validation.
> >>>>>>>
> >>>>>> Ok, I'll investigate if we have already access to the dataset here.
> >>>>>>
> >>>>>>
> >>>>>>>> My first step would be:
> >>>>>>>> - set up a maven project
> >>>>>>>> - set up a development pipeline in a test (with cTAKES components
> >>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> But one item that we need to review is the 3rd party libs jars
> that
> >>>>>>>> were included to ensure compatibility.  I’ll be sure to take a
> look
> >>> at
> >>>>>>>> that over the next few weeks.
> >>>>>>>>
> >>>>>>>> —Pei
> >>>>>>>>
> >>>>>>>>
> >>>>>>> @Pei - once ANNIE components are replaced there is should not be a
> >>> need to
> >>>>>>> worry about the 3rd party libs.
> >>>>>>>
> >>>>>>> Also, just a thought: we may want to create an independent
> component
> >>> for
> >>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
> >>> useful
> >>>>>>> for general NER on longitudinal data and surely useful independent
> >>> of the
> >>>>>>> deid component.
> >>>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Azad
> >>>>>>>
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

Am 11.03.2016 um 11:46 schrieb Azad Dehghan:
> Do we have a tracker; which ones remain?
>
>

No. This can best be checked by the existing ruta files in
src/main/ruta/org/apache/ctakes/deid

Rigth now, there are:
Age, Date (just a few simple rule I wrote), Doctor (empty), Phone
(empty), Street, UserName, ZipState
Profession is just a gazetteer without own script file.

Best,

Peter

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi Pei,

the content of the new files is duplicated again, e.g., see
I2B2Evaluation.java

No idea what caused that...

Best,

Peter

Am 11.03.2016 um 11:32 schrieb Peter Klügl:
> Hi,
>
> thanks for the notes and links, Andy and Guergana. The software and
> articles are very interesting, but, as for my personal interest, we have
> our own clinical deidentification software solution at our company
> (which works good enough as far as I know). My focus is rather on
> helping out in translating the contribution from GATE/JAPE to UIMA/Ruta.
> Thus, I concentrate on the existing functionality for now.
>
> What is the final goal of the cTAKES comunity concerning clinical deid
> components? Will both sandbox projects be merged, what about statistical
> approaches?
>
> @Pei: there was again a problem with the patch (I also missed to add
> some files). I attached a new one.
>
> @Azad: I am just curious on which data the rules exactly rely. I think
> I'll find the information in the article.
> I assume that the 521 docuemnts have been utilized to develop the rules
> and the 269 documents to evaluate them. Did you correct the rules also
> using the second set? I need to reread to article :-)
>
> Best,
>
> Peter
>
>
> Am 10.03.2016 um 23:22 schrieb andy mcmurry:
>> *** For cross-validation, you can evaluate de-identified notes data from
>> i2b2 challenge** *
>> https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-scrubber-deid/data/models/
>>
>> *Methods for model generation of FeatureSet described here: *
>>
>> *Improved de-identification of physician notes through integrative modeling
>> of both public and private medical text*
>> http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-13-112
>>
>> Major objective of that study was to help provide external examples to
>> cross train / retrain other methods.
>>
>> hope this helps,
>> --Andy
>>
>>
>>
>> On Thu, Mar 10, 2016 at 1:27 PM, Savova, Guergana <
>> Guergana.Savova@childrens.harvard.edu> wrote:
>>
>>> You can re-build the models that feed into MIST. I personally would not
>>> use the default model that MIST comes with as it is not trained on clinical
>>> data. In our previous work we found that hand-annotating about 200 docs for
>>> PHI (representative of the sample you are going to run the models on)
>>> results in building a pretty good model - in the 90's for p, r and f1.
>>> However, even with that high performance, the institution that owns the
>>> data might be still reluctant to share as it might pose a violation of
>>> HIPAA through some potential PHI leaks. In cTAKES our approach has been to
>>> de-couple the de-identifcation from the NLP/information extraction. If a
>>> user has the need for de-identified data, they could choose their method --
>>> manual or otherwise -- and then process through cTAKES. Our focus is the
>>> NLP/IE space, while de-identification is a blend of that plus policy....
>>>
>>> --Guergana
>>>
>>> -----Original Message-----
>>> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
>>> Sent: Thursday, March 10, 2016 4:19 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Combining Knowledge- and Data-driven Methods for
>>> De-identification of Clinical Narratives
>>>
>>> Thanks Guergana.
>>>
>>>> Yes, the current release of cTAKES has a module for the temporal
>>> expressions which includes dates. The normalizer for the temporal
>>> expressions is Steven Bethard's timenorm code.
>>> Great.
>>>
>>>> However, if you do de-identification of dates/temporal expressions,
>>>> you
>>> run the risk of creating incorrect timelines as many of the relative
>>> temporal expressions (e.g. spring of this year, x-mas time, etc.) are
>>> unlikely to be correctly shifted by any de-identification tool.
>>> Indeed, a reason I have not included the dates component.
>>>
>>>> One de-identification tool is MIST --
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__mist-2Ddeid.sourceforge.net_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=5awdXn2I-hRE0-161tqFDGgmYgQQviQg360uHI4fs2s&e=
>>> .
>>> I don't remember them doing well in the community held evaluation in 2014.
>>> Hence, cDeid :)
>>>> Guergana Savova, PhD, FACMI
>>>> Associate Professor
>>>> PI Natural Language Processing Lab
>>>> Boston Children's Hospital and Harvard Medical School
>>>> 300 Longwood Avenue
>>>> Mailstop: BCH3092
>>>> Enders 144.1
>>>> Boston, MA 02115
>>>> Tel: (617) 919-2972
>>>> Fax: (617) 730-0817
>>>> Harvard Scholar:
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__scholar.harvard.ed
>>>> u_guergana-5Fk-5Fsavova_biocv&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
>>>> ZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGm
>>>> RCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=3taiTxFp55
>>>> iQUnc6A6Yemg-XzFQrRjo5QZRQeKHQ29c&e=
>>>>
>>>> -----Original Message-----
>>>> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
>>>> Sent: Thursday, March 10, 2016 3:42 PM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>>> De-identification of Clinical Narratives
>>>>> This means both training data folders? I have access to the data but
>>>>> not
>>>> to the challenge description.
>>>>
>>>> Yes. Is there any specific information that you are missing?
>>>>>> It would be good to incorporate/refactor (basically, GATE API needs
>>>>>> to be replaced with UIMA API to generate annotation) the two-pass
>>>>>> recognition method for cTAKES - which has a wider application on
>>> longitudinal data.
>>>>>> This method is used on-top of a number NERs.
>>>>> I'll take a look.
>>>>>
>>>>> I do not know how much time I can invest this month. Let's see how
>>>>> many
>>>> phases I can translate.
>>>>> I added the rules for age. Are there jape rules for creating date
>>>> annotations?
>>>> No. I believe cTAKES has existing component(s) to capture dates?
>>>>
>>>>> After all rules are translated, they need some major refactoring.
>>>>> Jape
>>>> and Ruta are quite different in some aspects.
>>>> Ok.
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Please let me know where I can help. I will be available again in
>>> April.
>>>>>> Cheers,
>>>>>> Azad
>>>>>>
>>>>>> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> sorry, I was quite busy last month.
>>>>>>>
>>>>>>> I added a new patch, which needs to be applied.
>>>>>>>
>>>>>>> No new rules, but it's possible now to evaluate everything against
>>>>>>> the labelled data of the challenge.
>>>>>>>
>>>>>>> @Azad:
>>>>>>> Which documents exactly did you use to develop the rules?
>>>>>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
>>>> testing-PHI-Gold-fixed?
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> the last patch fixed almost all problems.
>>>>>>>>
>>>>>>>> I added another one that adds the csv file for the unit test and
>>>> extends
>>>>>>>> svn-ignore.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I added another patch. I missed to manually add one test file to
>>>> version
>>>>>>>>> control, and there are still duplicate lines.
>>>>>>>>> I hope this patch fixes the remaining problems.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> the problems were caused by the svn client in my Eclipse. Sorry
>>>>>>>>>> for
>>>> the
>>>>>>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>>>>>>
>>>>>>>>>> I attached a new patch created with commandline tools wich
>>>>>>>>>> looks
>>>>>>> correct
>>>>>>>>>> now.
>>>>>>>>>>
>>>>>>>>>> Pei, can you apply the new patch?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>>>>>>> Thanks Pei.
>>>>>>>>>>>
>>>>>>>>>>> I fear there was again a problem with the patch. All new files
>>>>>>>>>>> are missing (and also the svn-ignore settings).
>>>>>>>>>>>
>>>>>>>>>>> Can you take a look?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>>>>>>> patch applied.
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Pei
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>>>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>>> Hi Pei,
>>>>>>>>>>>>>
>>>>>>>>>>>>> can you commit the recent patch for us?
>>>>>>>>>>>>>
>>>>>>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>>>>>>> But yeah, we can even create an extended type system to
>>>>>>>>>>>>>> store
>>>>>>> these items temporarily and add them into the main/core type
>>>>>>> system afterwards.
>>>>>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
>>>>>>>>>>>>>> will
>>>>>>> require much more testing.  If it works, we can upgrade it in our
>>>> sandbox
>>>>>>> area or create a branch if necessary.
>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>>>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> a new patch is attached.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Pei:
>>>>>>>>>>>>>>> are there suitable annotation types in the cTAKES type
>>> system?
>>>>>>> Some
>>>>>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I
>>>>>>>>>>>>>>> map it
>>>> to
>>>>>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>>>>>>> features...
>>>>>>>>>>>>>>> @Azad:
>>>>>>>>>>>>>>> I changed the rules a bit, especially the capitalization
>>>>>>>>>>>>>>> like I
>>>>>>> use it
>>>>>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by
>>>>>>>>>>>>>>> the
>>>> maven
>>>>>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>>>>>>> extended the
>>>>>>>>>>>>>>> regex for the url. I also changed the evaluation order of
>>>>>>>>>>>>>>> some
>>>>>>> rules
>>>>>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
>>>>>>>>>>>>>>> for
>>>>>>> the unit
>>>>>>>>>>>>>>> tests.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should
>>>>>>>>>>>>>>> we
>>>>>>> split them up?
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> great. I will integrate them in the project and in the
>>>>>>>>>>>>>>>> next
>>>>>>> patch.
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PS. I will validate all NERs once we have them all
>>> completed.
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>>>>>>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are
>>>>>>>>>>>>>>>>>> any
>>>>>>> more volunteers
>>>>>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
>>>>>>>>>>>>>>>>>> <peter.kluegl@averbis.com
>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>>>>> Unfortunately,
>>>>>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will
>>>>>>>>>>>>>>>>>>> be able
>>>>>>> to provide
>>>>>>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good
>>>>>>>>>>>>>>>>>>>> starting
>>>>>>> point at least
>>>>>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
>>>>>>>>>>>>>>>>>>>> good
>>>> if
>>>>>>> we use
>>>>>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring
>>>>>>>>>>>>>>>>>>>> components
>>>>>>> together and
>>>>>>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>>>>>>> I think the actual components that would be required
>>>>>>>>>>>>>>>>>>>> is
>>>>>>> probably best
>>>>>>>>>>>>>>>>>>>> left up to what is actually required for best
>>>>>>>>>>>>>>>>>>>> performing
>>>>>>> c-deid.  The
>>>>>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we
>>>>>>>>>>>>>>>>>>>> should
>>>> treat
>>>>>>> this as
>>>>>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
>>>> pipeline
>>>>>>> (in which
>>>>>>>>>>>>>>>>>>>> case, we may need to propose a change to the type
>>>>>>>>>>>>>>>>>>>> system or
>>>>>>> perhaps an
>>>>>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>>>>>>> discussion to
>>>>>>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>>>>>>> peter.kluegl@averbis.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an
>>>>>>>>>>>>>>>>>>>>> example on
>>>>>>> how the
>>>>>>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project
>>>>>>>>>>>>>>>>>>>>> in a
>>>>>>> quite
>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some
>>>>>>>>>>>>>>>>>>>>> sort of
>>>>>>> out-dated"
>>>>>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
>>>> preprocessing
>>>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
>>>> project.
>>>>>>>>>>>>>>>>>>>>> Components: On which components may the componetns
>>> rely:
>>>>>>> tokenizer,
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
>>>> single
>>>>>>> AE?
>>>>>>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
>>>>>>>>>>>>>>>>>>>>>>> avoid
>>>>>>> duplicate
>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to
>>> RUTA.
>>>>>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta
>>>>>>>>>>>>>>>>>>>>> Workbench if
>>>>>>> you want, or
>>>>>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized
>>>>>>>>>>>>>>>>>>>>>>> for
>>>> the
>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
>>>>>>>>>>>>>>>>>>>>>>> contribute it
>>>>>>> too?
>>>>>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly
>>>>>>>>>>>>>>>>>>>>>> available;
>>>> i2b2
>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3
>>>>>>>>>>>>>>>>>>>>>> A_
>>>>>>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4g
>>>>>>>>>>>>>>>>>>>>>> oW
>>>>>>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNn
>>>>>>>>>>>>>>>>>>>>>> J9
>>>>>>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP
>>>>>>>>>>>>>>>>>>>>>> &m
>>>>>>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEe
>>>>>>>>>>>>>>>>>>>>>> OR yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= >
>>>>>>>>>>>>>>>>>>>>>> typically
>>>>>>> releases the
>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is
>>>>>>>>>>>>>>>>>>>>>> done on
>>>> an
>>>>>>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>> validation.
>>>>>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>> dataset here.
>>>>>>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
>>>>>>>>>>>>>>>>>>>>>>> cTAKES
>>>>>>> components
>>>>>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd
>>>>>>>>>>>>>>>>>>>>>>> party
>>>> libs
>>>>>>> jars that
>>>>>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be
>>>>>>>>>>>>>>>>>>>>>>> sure to
>>>>>>> take a look
>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>> not be a
>>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
>>>> independent
>>>>>>> component
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this
>>>>>>>>>>>>>>>>>>>>>> method
>>>>>>> have shown
>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely
>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>> independent
>>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

My focus is rather on
> helping out in translating the contribution from GATE/JAPE to UIMA/Ruta.
> Thus, I concentrate on the existing functionality for now.
>

Do we have a tracker; which ones remain?


>
> What is the final goal of the cTAKES comunity concerning clinical deid
> components? Will both sandbox projects be merged, what about statistical
> approaches?
>

Does cTAKES provide APIs/integration of any statistical approaches?


>
> @Azad: I am just curious on which data the rules exactly rely. I think
> I'll find the information in the article.
> I assume that the 521 docuemnts have been utilized to develop the rules
> and the 269 documents to evaluate them. Did you correct the rules also
> using the second set? I need to reread to article :-)
>


The entire training dataset was used: 521+269.

Azad


>
> Am 10.03.2016 um 23:22 schrieb andy mcmurry:
> > *** For cross-validation, you can evaluate de-identified notes data from
> > i2b2 challenge** *
> >
> https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-scrubber-deid/data/models/
> >
> > *Methods for model generation of FeatureSet described here: *
> >
> > *Improved de-identification of physician notes through integrative
> modeling
> > of both public and private medical text*
> >
> http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-13-112
> >
> > Major objective of that study was to help provide external examples to
> > cross train / retrain other methods.
> >
> > hope this helps,
> > --Andy
> >
> >
> >
> > On Thu, Mar 10, 2016 at 1:27 PM, Savova, Guergana <
> > Guergana.Savova@childrens.harvard.edu> wrote:
> >
> >> You can re-build the models that feed into MIST. I personally would not
> >> use the default model that MIST comes with as it is not trained on
> clinical
> >> data. In our previous work we found that hand-annotating about 200 docs
> for
> >> PHI (representative of the sample you are going to run the models on)
> >> results in building a pretty good model - in the 90's for p, r and f1.
> >> However, even with that high performance, the institution that owns the
> >> data might be still reluctant to share as it might pose a violation of
> >> HIPAA through some potential PHI leaks. In cTAKES our approach has been
> to
> >> de-couple the de-identifcation from the NLP/information extraction. If a
> >> user has the need for de-identified data, they could choose their
> method --
> >> manual or otherwise -- and then process through cTAKES. Our focus is the
> >> NLP/IE space, while de-identification is a blend of that plus policy....
> >>
> >> --Guergana
> >>
> >> -----Original Message-----
> >> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> >> Sent: Thursday, March 10, 2016 4:19 PM
> >> To: dev@ctakes.apache.org
> >> Subject: RE: Combining Knowledge- and Data-driven Methods for
> >> De-identification of Clinical Narratives
> >>
> >> Thanks Guergana.
> >>
> >>> Yes, the current release of cTAKES has a module for the temporal
> >> expressions which includes dates. The normalizer for the temporal
> >> expressions is Steven Bethard's timenorm code.
> >> Great.
> >>
> >>> However, if you do de-identification of dates/temporal expressions,
> >>> you
> >> run the risk of creating incorrect timelines as many of the relative
> >> temporal expressions (e.g. spring of this year, x-mas time, etc.) are
> >> unlikely to be correctly shifted by any de-identification tool.
> >> Indeed, a reason I have not included the dates component.
> >>
> >>> One de-identification tool is MIST --
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__mist-2Ddeid.sourceforge.net_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=5awdXn2I-hRE0-161tqFDGgmYgQQviQg360uHI4fs2s&e=
> >> .
> >> I don't remember them doing well in the community held evaluation in
> 2014.
> >> Hence, cDeid :)
> >>> Guergana Savova, PhD, FACMI
> >>> Associate Professor
> >>> PI Natural Language Processing Lab
> >>> Boston Children's Hospital and Harvard Medical School
> >>> 300 Longwood Avenue
> >>> Mailstop: BCH3092
> >>> Enders 144.1
> >>> Boston, MA 02115
> >>> Tel: (617) 919-2972
> >>> Fax: (617) 730-0817
> >>> Harvard Scholar:
> >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__scholar.harvard.ed
> >>> u_guergana-5Fk-5Fsavova_biocv&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
> >>> ZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGm
> >>> RCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=3taiTxFp55
> >>> iQUnc6A6Yemg-XzFQrRjo5QZRQeKHQ29c&e=
> >>>
> >>> -----Original Message-----
> >>> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> >>> Sent: Thursday, March 10, 2016 3:42 PM
> >>> To: dev@ctakes.apache.org
> >>> Subject: Re: Combining Knowledge- and Data-driven Methods for
> >> De-identification of Clinical Narratives
> >>>> This means both training data folders? I have access to the data but
> >>>> not
> >>> to the challenge description.
> >>>
> >>> Yes. Is there any specific information that you are missing?
> >>>>
> >>>>> It would be good to incorporate/refactor (basically, GATE API needs
> >>>>> to be replaced with UIMA API to generate annotation) the two-pass
> >>>>> recognition method for cTAKES - which has a wider application on
> >> longitudinal data.
> >>>>> This method is used on-top of a number NERs.
> >>>>
> >>>> I'll take a look.
> >>>>
> >>>> I do not know how much time I can invest this month. Let's see how
> >>>> many
> >>> phases I can translate.
> >>>> I added the rules for age. Are there jape rules for creating date
> >>> annotations?
> >>> No. I believe cTAKES has existing component(s) to capture dates?
> >>>
> >>>> After all rules are translated, they need some major refactoring.
> >>>> Jape
> >>> and Ruta are quite different in some aspects.
> >>> Ok.
> >>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> Please let me know where I can help. I will be available again in
> >> April.
> >>>>> Cheers,
> >>>>> Azad
> >>>>>
> >>>>> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
> >> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> sorry, I was quite busy last month.
> >>>>>>
> >>>>>> I added a new patch, which needs to be applied.
> >>>>>>
> >>>>>> No new rules, but it's possible now to evaluate everything against
> >>>>>> the labelled data of the challenge.
> >>>>>>
> >>>>>> @Azad:
> >>>>>> Which documents exactly did you use to develop the rules?
> >>>>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
> >>> testing-PHI-Gold-fixed?
> >>>>>> Best,
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> the last patch fixed almost all problems.
> >>>>>>>
> >>>>>>> I added another one that adds the csv file for the unit test and
> >>> extends
> >>>>>>> svn-ignore.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Peter
> >>>>>>>
> >>>>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I added another patch. I missed to manually add one test file to
> >>> version
> >>>>>>>> control, and there are still duplicate lines.
> >>>>>>>> I hope this patch fixes the remaining problems.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Peter
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> the problems were caused by the svn client in my Eclipse. Sorry
> >>>>>>>>> for
> >>> the
> >>>>>>>>> trouble, I should have looked more closely at the ciomplete
> patch.
> >>>>>>>>>
> >>>>>>>>> I attached a new patch created with commandline tools wich
> >>>>>>>>> looks
> >>>>>> correct
> >>>>>>>>> now.
> >>>>>>>>>
> >>>>>>>>> Pei, can you apply the new patch?
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> Peter
> >>>>>>>>>
> >>>>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> >>>>>>>>>> Thanks Pei.
> >>>>>>>>>>
> >>>>>>>>>> I fear there was again a problem with the patch. All new files
> >>>>>>>>>> are missing (and also the svn-ignore settings).
> >>>>>>>>>>
> >>>>>>>>>> Can you take a look?
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Peter
> >>>>>>>>>>
> >>>>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> >>>>>>>>>>> patch applied.
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Pei
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> >>>>>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>>>> Hi Pei,
> >>>>>>>>>>>>
> >>>>>>>>>>>> can you commit the recent patch for us?
> >>>>>>>>>>>>
> >>>>>>>>>>>> CTAKES-384-20160120.patch
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Peter
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>> Sorry I was swamped recently.
> >>>>>>>>>>>>> But yeah, we can even create an extended type system to
> >>>>>>>>>>>>> store
> >>>>>> these items temporarily and add them into the main/core type
> >>>>>> system afterwards.
> >>>>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
> >>>>>>>>>>>>> will
> >>>>>> require much more testing.  If it works, we can upgrade it in our
> >>> sandbox
> >>>>>> area or create a branch if necessary.
> >>>>>>>>>>>>> —Pei
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> >>>>>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> a new patch is attached.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> @Pei:
> >>>>>>>>>>>>>> are there suitable annotation types in the cTAKES type
> >> system?
> >>>>>> Some
> >>>>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I
> >>>>>>>>>>>>>> map it
> >>> to
> >>>>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
> >>>>>> features...
> >>>>>>>>>>>>>> @Azad:
> >>>>>>>>>>>>>> I changed the rules a bit, especially the capitalization
> >>>>>>>>>>>>>> like I
> >>>>>> use it
> >>>>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by
> >>>>>>>>>>>>>> the
> >>> maven
> >>>>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
> >>>>>> extended the
> >>>>>>>>>>>>>> regex for the url. I also changed the evaluation order of
> >>>>>>>>>>>>>> some
> >>>>>> rules
> >>>>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
> >>>>>>>>>>>>>> for
> >>>>>> the unit
> >>>>>>>>>>>>>> tests.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Let me know if you need more information about the changes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should
> >>>>>>>>>>>>>> we
> >>>>>> split them up?
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> great. I will integrate them in the project and in the
> >>>>>>>>>>>>>>> next
> >>>>>> patch.
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>>> Three NERs translated and uploaded.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> PS. I will validate all NERs once we have them all
> >> completed.
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> >>>>>> azad.dehghan@gmail.com> wrote:
> >>>>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are
> >>>>>>>>>>>>>>>>> any
> >>>>>> more volunteers
> >>>>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
> >>>>>>>>>>>>>>>>> <peter.kluegl@averbis.com
> >>>>>> wrote:
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> >>>>>> Unfortunately,
> >>>>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will
> >>>>>>>>>>>>>>>>>> be able
> >>>>>> to provide
> >>>>>>>>>>>>>>>>>> the patches in December.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>>>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good
> >>>>>>>>>>>>>>>>>>> starting
> >>>>>> point at least
> >>>>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
> >>>>>>>>>>>>>>>>>>> good
> >>> if
> >>>>>> we use
> >>>>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring
> >>>>>>>>>>>>>>>>>>> components
> >>>>>> together and
> >>>>>>>>>>>>>>>>>>> generate desc's as secondary...
> >>>>>>>>>>>>>>>>>>> I think the actual components that would be required
> >>>>>>>>>>>>>>>>>>> is
> >>>>>> probably best
> >>>>>>>>>>>>>>>>>>> left up to what is actually required for best
> >>>>>>>>>>>>>>>>>>> performing
> >>>>>> c-deid.  The
> >>>>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we
> >>>>>>>>>>>>>>>>>>> should
> >>> treat
> >>>>>> this as
> >>>>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
> >>> pipeline
> >>>>>> (in which
> >>>>>>>>>>>>>>>>>>> case, we may need to propose a change to the type
> >>>>>>>>>>>>>>>>>>> system or
> >>>>>> perhaps an
> >>>>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> >>>>>> discussion to
> >>>>>>>>>>>>>>>>>>> the dev group as you see fit.)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> My 2 cents...
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> >>>>>> peter.kluegl@averbis.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an
> >>>>>>>>>>>>>>>>>>>> example on
> >>>>>> how the
> >>>>>>>>>>>>>>>>> cTAKES
> >>>>>>>>>>>>>>>>>>>> community develops or how a project should look like?
> >>>>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project
> >>>>>>>>>>>>>>>>>>>> in a
> >>>>>> quite
> >>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some
> >>>>>>>>>>>>>>>>>>>> sort of
> >>>>>> out-dated"
> >>>>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
> >>> preprocessing
> >>>>>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
> >>> project.
> >>>>>>>>>>>>>>>>>>>> Components: On which components may the componetns
> >> rely:
> >>>>>> tokenizer,
> >>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>>>> parser, ... dict lookup?
> >>>>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
> >>> single
> >>>>>> AE?
> >>>>>>>>>>>>>>>>>>>> More comments below.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
> >>>>>>>>>>>>>>>>>>>>>> avoid
> >>>>>> duplicate
> >>>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to
> >> RUTA.
> >>>>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta
> >>>>>>>>>>>>>>>>>>>> Workbench if
> >>>>>> you want, or
> >>>>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized
> >>>>>>>>>>>>>>>>>>>>>> for
> >>> the
> >>>>>> initial
> >>>>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
> >>>>>>>>>>>>>>>>>>>>>> contribute it
> >>>>>> too?
> >>>>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly
> >>>>>>>>>>>>>>>>>>>>> available;
> >>> i2b2
> >>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3
> >>>>>>>>>>>>>>>>>>>>> A_
> >>>>>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4g
> >>>>>>>>>>>>>>>>>>>>> oW
> >>>>>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNn
> >>>>>>>>>>>>>>>>>>>>> J9
> >>>>>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP
> >>>>>>>>>>>>>>>>>>>>> &m
> >>>>>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEe
> >>>>>>>>>>>>>>>>>>>>> OR yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= >
> >>>>>>>>>>>>>>>>>>>>> typically
> >>>>>> releases the
> >>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is
> >>>>>>>>>>>>>>>>>>>>> done on
> >>> an
> >>>>>>>>>>>>>>>>> individual basis
> >>>>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate
> >>>>>>>>>>>>>>>>>>>>> the
> >>>>>> validation.
> >>>>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>> dataset here.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> My first step would be:
> >>>>>>>>>>>>>>>>>>>>>> - set up a maven project
> >>>>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
> >>>>>>>>>>>>>>>>>>>>>> cTAKES
> >>>>>> components
> >>>>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd
> >>>>>>>>>>>>>>>>>>>>>> party
> >>> libs
> >>>>>> jars that
> >>>>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be
> >>>>>>>>>>>>>>>>>>>>>> sure to
> >>>>>> take a look
> >>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> —Pei
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
> >>>>>>>>>>>>>>>>>>>>> should
> >>>>>> not be a
> >>>>>>>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
> >>> independent
> >>>>>> component
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this
> >>>>>>>>>>>>>>>>>>>>> method
> >>>>>> have shown
> >>>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely
> >>>>>>>>>>>>>>>>>>>>> useful
> >>>>>> independent
> >>>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>>> deid component.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>>>>>>
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

thanks for the notes and links, Andy and Guergana. The software and
articles are very interesting, but, as for my personal interest, we have
our own clinical deidentification software solution at our company
(which works good enough as far as I know). My focus is rather on
helping out in translating the contribution from GATE/JAPE to UIMA/Ruta.
Thus, I concentrate on the existing functionality for now.

What is the final goal of the cTAKES comunity concerning clinical deid
components? Will both sandbox projects be merged, what about statistical
approaches?

@Pei: there was again a problem with the patch (I also missed to add
some files). I attached a new one.

@Azad: I am just curious on which data the rules exactly rely. I think
I'll find the information in the article.
I assume that the 521 docuemnts have been utilized to develop the rules
and the 269 documents to evaluate them. Did you correct the rules also
using the second set? I need to reread to article :-)

Best,

Peter


Am 10.03.2016 um 23:22 schrieb andy mcmurry:
> *** For cross-validation, you can evaluate de-identified notes data from
> i2b2 challenge** *
> https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-scrubber-deid/data/models/
>
> *Methods for model generation of FeatureSet described here: *
>
> *Improved de-identification of physician notes through integrative modeling
> of both public and private medical text*
> http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-13-112
>
> Major objective of that study was to help provide external examples to
> cross train / retrain other methods.
>
> hope this helps,
> --Andy
>
>
>
> On Thu, Mar 10, 2016 at 1:27 PM, Savova, Guergana <
> Guergana.Savova@childrens.harvard.edu> wrote:
>
>> You can re-build the models that feed into MIST. I personally would not
>> use the default model that MIST comes with as it is not trained on clinical
>> data. In our previous work we found that hand-annotating about 200 docs for
>> PHI (representative of the sample you are going to run the models on)
>> results in building a pretty good model - in the 90's for p, r and f1.
>> However, even with that high performance, the institution that owns the
>> data might be still reluctant to share as it might pose a violation of
>> HIPAA through some potential PHI leaks. In cTAKES our approach has been to
>> de-couple the de-identifcation from the NLP/information extraction. If a
>> user has the need for de-identified data, they could choose their method --
>> manual or otherwise -- and then process through cTAKES. Our focus is the
>> NLP/IE space, while de-identification is a blend of that plus policy....
>>
>> --Guergana
>>
>> -----Original Message-----
>> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
>> Sent: Thursday, March 10, 2016 4:19 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Combining Knowledge- and Data-driven Methods for
>> De-identification of Clinical Narratives
>>
>> Thanks Guergana.
>>
>>> Yes, the current release of cTAKES has a module for the temporal
>> expressions which includes dates. The normalizer for the temporal
>> expressions is Steven Bethard's timenorm code.
>> Great.
>>
>>> However, if you do de-identification of dates/temporal expressions,
>>> you
>> run the risk of creating incorrect timelines as many of the relative
>> temporal expressions (e.g. spring of this year, x-mas time, etc.) are
>> unlikely to be correctly shifted by any de-identification tool.
>> Indeed, a reason I have not included the dates component.
>>
>>> One de-identification tool is MIST --
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__mist-2Ddeid.sourceforge.net_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=5awdXn2I-hRE0-161tqFDGgmYgQQviQg360uHI4fs2s&e=
>> .
>> I don't remember them doing well in the community held evaluation in 2014.
>> Hence, cDeid :)
>>> Guergana Savova, PhD, FACMI
>>> Associate Professor
>>> PI Natural Language Processing Lab
>>> Boston Children's Hospital and Harvard Medical School
>>> 300 Longwood Avenue
>>> Mailstop: BCH3092
>>> Enders 144.1
>>> Boston, MA 02115
>>> Tel: (617) 919-2972
>>> Fax: (617) 730-0817
>>> Harvard Scholar:
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__scholar.harvard.ed
>>> u_guergana-5Fk-5Fsavova_biocv&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
>>> ZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGm
>>> RCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=3taiTxFp55
>>> iQUnc6A6Yemg-XzFQrRjo5QZRQeKHQ29c&e=
>>>
>>> -----Original Message-----
>>> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
>>> Sent: Thursday, March 10, 2016 3:42 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>> De-identification of Clinical Narratives
>>>> This means both training data folders? I have access to the data but
>>>> not
>>> to the challenge description.
>>>
>>> Yes. Is there any specific information that you are missing?
>>>>
>>>>> It would be good to incorporate/refactor (basically, GATE API needs
>>>>> to be replaced with UIMA API to generate annotation) the two-pass
>>>>> recognition method for cTAKES - which has a wider application on
>> longitudinal data.
>>>>> This method is used on-top of a number NERs.
>>>>
>>>> I'll take a look.
>>>>
>>>> I do not know how much time I can invest this month. Let's see how
>>>> many
>>> phases I can translate.
>>>> I added the rules for age. Are there jape rules for creating date
>>> annotations?
>>> No. I believe cTAKES has existing component(s) to capture dates?
>>>
>>>> After all rules are translated, they need some major refactoring.
>>>> Jape
>>> and Ruta are quite different in some aspects.
>>> Ok.
>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Please let me know where I can help. I will be available again in
>> April.
>>>>> Cheers,
>>>>> Azad
>>>>>
>>>>> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> sorry, I was quite busy last month.
>>>>>>
>>>>>> I added a new patch, which needs to be applied.
>>>>>>
>>>>>> No new rules, but it's possible now to evaluate everything against
>>>>>> the labelled data of the challenge.
>>>>>>
>>>>>> @Azad:
>>>>>> Which documents exactly did you use to develop the rules?
>>>>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
>>> testing-PHI-Gold-fixed?
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>>>>>> Hi,
>>>>>>>
>>>>>>> the last patch fixed almost all problems.
>>>>>>>
>>>>>>> I added another one that adds the csv file for the unit test and
>>> extends
>>>>>>> svn-ignore.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I added another patch. I missed to manually add one test file to
>>> version
>>>>>>>> control, and there are still duplicate lines.
>>>>>>>> I hope this patch fixes the remaining problems.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> the problems were caused by the svn client in my Eclipse. Sorry
>>>>>>>>> for
>>> the
>>>>>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>>>>>
>>>>>>>>> I attached a new patch created with commandline tools wich
>>>>>>>>> looks
>>>>>> correct
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>> Pei, can you apply the new patch?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>>>>>> Thanks Pei.
>>>>>>>>>>
>>>>>>>>>> I fear there was again a problem with the patch. All new files
>>>>>>>>>> are missing (and also the svn-ignore settings).
>>>>>>>>>>
>>>>>>>>>> Can you take a look?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>>>>>> patch applied.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Pei
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>> Hi Pei,
>>>>>>>>>>>>
>>>>>>>>>>>> can you commit the recent patch for us?
>>>>>>>>>>>>
>>>>>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>>>>>> But yeah, we can even create an extended type system to
>>>>>>>>>>>>> store
>>>>>> these items temporarily and add them into the main/core type
>>>>>> system afterwards.
>>>>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
>>>>>>>>>>>>> will
>>>>>> require much more testing.  If it works, we can upgrade it in our
>>> sandbox
>>>>>> area or create a branch if necessary.
>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> a new patch is attached.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Pei:
>>>>>>>>>>>>>> are there suitable annotation types in the cTAKES type
>> system?
>>>>>> Some
>>>>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I
>>>>>>>>>>>>>> map it
>>> to
>>>>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>>>>>> features...
>>>>>>>>>>>>>> @Azad:
>>>>>>>>>>>>>> I changed the rules a bit, especially the capitalization
>>>>>>>>>>>>>> like I
>>>>>> use it
>>>>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by
>>>>>>>>>>>>>> the
>>> maven
>>>>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>>>>>> extended the
>>>>>>>>>>>>>> regex for the url. I also changed the evaluation order of
>>>>>>>>>>>>>> some
>>>>>> rules
>>>>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
>>>>>>>>>>>>>> for
>>>>>> the unit
>>>>>>>>>>>>>> tests.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should
>>>>>>>>>>>>>> we
>>>>>> split them up?
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> great. I will integrate them in the project and in the
>>>>>>>>>>>>>>> next
>>>>>> patch.
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PS. I will validate all NERs once we have them all
>> completed.
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>>>>>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are
>>>>>>>>>>>>>>>>> any
>>>>>> more volunteers
>>>>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
>>>>>>>>>>>>>>>>> <peter.kluegl@averbis.com
>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>>>> Unfortunately,
>>>>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will
>>>>>>>>>>>>>>>>>> be able
>>>>>> to provide
>>>>>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good
>>>>>>>>>>>>>>>>>>> starting
>>>>>> point at least
>>>>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
>>>>>>>>>>>>>>>>>>> good
>>> if
>>>>>> we use
>>>>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring
>>>>>>>>>>>>>>>>>>> components
>>>>>> together and
>>>>>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>>>>>> I think the actual components that would be required
>>>>>>>>>>>>>>>>>>> is
>>>>>> probably best
>>>>>>>>>>>>>>>>>>> left up to what is actually required for best
>>>>>>>>>>>>>>>>>>> performing
>>>>>> c-deid.  The
>>>>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we
>>>>>>>>>>>>>>>>>>> should
>>> treat
>>>>>> this as
>>>>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
>>> pipeline
>>>>>> (in which
>>>>>>>>>>>>>>>>>>> case, we may need to propose a change to the type
>>>>>>>>>>>>>>>>>>> system or
>>>>>> perhaps an
>>>>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>>>>>> discussion to
>>>>>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>>>>>> peter.kluegl@averbis.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an
>>>>>>>>>>>>>>>>>>>> example on
>>>>>> how the
>>>>>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project
>>>>>>>>>>>>>>>>>>>> in a
>>>>>> quite
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some
>>>>>>>>>>>>>>>>>>>> sort of
>>>>>> out-dated"
>>>>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
>>> preprocessing
>>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
>>> project.
>>>>>>>>>>>>>>>>>>>> Components: On which components may the componetns
>> rely:
>>>>>> tokenizer,
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
>>> single
>>>>>> AE?
>>>>>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
>>>>>>>>>>>>>>>>>>>>>> avoid
>>>>>> duplicate
>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to
>> RUTA.
>>>>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta
>>>>>>>>>>>>>>>>>>>> Workbench if
>>>>>> you want, or
>>>>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized
>>>>>>>>>>>>>>>>>>>>>> for
>>> the
>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
>>>>>>>>>>>>>>>>>>>>>> contribute it
>>>>>> too?
>>>>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly
>>>>>>>>>>>>>>>>>>>>> available;
>>> i2b2
>>>>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3
>>>>>>>>>>>>>>>>>>>>> A_
>>>>>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4g
>>>>>>>>>>>>>>>>>>>>> oW
>>>>>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNn
>>>>>>>>>>>>>>>>>>>>> J9
>>>>>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP
>>>>>>>>>>>>>>>>>>>>> &m
>>>>>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEe
>>>>>>>>>>>>>>>>>>>>> OR yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= >
>>>>>>>>>>>>>>>>>>>>> typically
>>>>>> releases the
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is
>>>>>>>>>>>>>>>>>>>>> done on
>>> an
>>>>>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate
>>>>>>>>>>>>>>>>>>>>> the
>>>>>> validation.
>>>>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to
>>>>>>>>>>>>>>>>>>>> the
>>>>>> dataset here.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
>>>>>>>>>>>>>>>>>>>>>> cTAKES
>>>>>> components
>>>>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd
>>>>>>>>>>>>>>>>>>>>>> party
>>> libs
>>>>>> jars that
>>>>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be
>>>>>>>>>>>>>>>>>>>>>> sure to
>>>>>> take a look
>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
>>>>>>>>>>>>>>>>>>>>> should
>>>>>> not be a
>>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
>>> independent
>>>>>> component
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this
>>>>>>>>>>>>>>>>>>>>> method
>>>>>> have shown
>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely
>>>>>>>>>>>>>>>>>>>>> useful
>>>>>> independent
>>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Hi Peter,

I will pick this up soon after the summer I hope.


Cheers,
Azad



2016-06-07 8:57 GMT+01:00 Peter Klügl <pe...@averbis.com>:

> Hi Azad,
>
>
> the basic rules are now translated. Do you wanna take a look at it?
>
>
> There remain still many issues and the F score is quite low on the dev
> set. I will continue improving the rules when I find the time.
>
>
> Best,
>
> Peter
>
>
> Am 15.03.2016 um 09:49 schrieb Peter Klügl:
> > Hi,
> >
> > this is essentially just a decision of design. For a single longitudinal
> > record, there is no problem at all. We can solve this even with some
> > simple ruta rules, or with some cutom analysis engine. If we want to
> > process a set of record of the same patient jointly, then we cannot
> > apply a single pipeline. I propose to postpone the decison and implement
> > it only for single documents for now.
> >
> > Best,
> >
> > Peter
> >
> >
> > Am 11.03.2016 um 20:03 schrieb Azad Dehghan:
> >>> I had a quick look on PassTwo. This is not directly translatable into
> >>> UIMA if the functionlity is based on analysis engines. Normally,
> >>> analysis engines process one document at a time in a pipeline. My first
> >>> quick guess is the we either need two pipelines (result is a program
> not
> >>> a component) or we need a different definition of a CAS (joining all
> >>> documents of a patient). Overall, it depends on the targeted use case
> of
> >>> the project. Should it be usable in a cTAKES/uimaFIT pipeline?
> >>>
> >> The two pass method will have a broader applicability for NER on
> >> longitudinal records...
> >>
> >>
> >>> btw, the CRF models are not part of the contribution, right?
> >>>
> >>>
> >> The CRF  (UK,US) models will be released but this will be together with
> a
> >> more mature software planned for August 2016.
> >>
> >> Best,
> >>> Peter
> >>>
> >>> Am 10.03.2016 um 20:29 schrieb Azad Dehghan:
> >>>> Thanks Peter,
> >>>>
> >>>> The rules were modeled using the training data.
> >>>>
> >>>> It would be good to incorporate/refactor (basically, GATE API needs
> to be
> >>>> replaced with UIMA API to generate annotation) the two-pass
> recognition
> >>>> method for cTAKES - which has a wider application on longitudinal
> data.
> >>>> This method is used on-top of a number NERs.
> >>>>
> >>>> Please let me know where I can help. I will be available again in
> April.
> >>>>
> >>>> Cheers,
> >>>> Azad
> >>>>
> >>>> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> sorry, I was quite busy last month.
> >>>>>
> >>>>> I added a new patch, which needs to be applied.
> >>>>>
> >>>>> No new rules, but it's possible now to evaluate everything against
> the
> >>>>> labelled data of the challenge.
> >>>>>
> >>>>> @Azad:
> >>>>> Which documents exactly did you use to develop the rules?
> >>>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
> >>> testing-PHI-Gold-fixed?
> >>>>> Best,
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> >>>>>> Hi,
> >>>>>>
> >>>>>> the last patch fixed almost all problems.
> >>>>>>
> >>>>>> I added another one that adds the csv file for the unit test and
> >>> extends
> >>>>>> svn-ignore.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I added another patch. I missed to manually add one test file to
> >>> version
> >>>>>>> control, and there are still duplicate lines.
> >>>>>>> I hope this patch fixes the remaining problems.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Peter
> >>>>>>>
> >>>>>>>
> >>>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> the problems were caused by the svn client in my Eclipse. Sorry
> for
> >>> the
> >>>>>>>> trouble, I should have looked more closely at the ciomplete patch.
> >>>>>>>>
> >>>>>>>> I attached a new patch created with commandline tools wich looks
> >>>>> correct
> >>>>>>>> now.
> >>>>>>>>
> >>>>>>>> Pei, can you apply the new patch?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Peter
> >>>>>>>>
> >>>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> >>>>>>>>> Thanks Pei.
> >>>>>>>>>
> >>>>>>>>> I fear there was again a problem with the patch. All new files
> are
> >>>>>>>>> missing (and also the svn-ignore settings).
> >>>>>>>>>
> >>>>>>>>> Can you take a look?
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> Peter
> >>>>>>>>>
> >>>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> >>>>>>>>>> patch applied.
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Pei
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> >>>>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>>> Hi Pei,
> >>>>>>>>>>>
> >>>>>>>>>>> can you commit the recent patch for us?
> >>>>>>>>>>>
> >>>>>>>>>>> CTAKES-384-20160120.patch
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Peter
> >>>>>>>>>>>
> >>>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> Sorry I was swamped recently.
> >>>>>>>>>>>> But yeah, we can even create an extended type system to store
> >>>>> these items temporarily and add them into the main/core type system
> >>>>> afterwards.
> >>>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
> will
> >>>>> require much more testing.  If it works, we can upgrade it in our
> >>> sandbox
> >>>>> area or create a branch if necessary.
> >>>>>>>>>>>> —Pei
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> >>>>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> a new patch is attached.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> @Pei:
> >>>>>>>>>>>>> are there suitable annotation types in the cTAKES type
> system?
> >>>>> Some
> >>>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map
> it
> >>> to
> >>>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
> >>>>> features...
> >>>>>>>>>>>>> @Azad:
> >>>>>>>>>>>>> I changed the rules a bit, especially the capitalization
> like I
> >>>>> use it
> >>>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the
> >>> maven
> >>>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
> >>>>> extended the
> >>>>>>>>>>>>> regex for the url. I also changed the evaluation order of
> some
> >>>>> rules
> >>>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
> for
> >>>>> the unit
> >>>>>>>>>>>>> tests.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Let me know if you need more information about the changes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should we
> >>>>> split them up?
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> great. I will integrate them in the project and in the next
> >>>>> patch.
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>> Three NERs translated and uploaded.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> PS. I will validate all NERs once we have them all
> completed.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> >>>>> azad.dehghan@gmail.com> wrote:
> >>>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
> >>>>> more volunteers
> >>>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <
> >>> peter.kluegl@averbis.com>
> >>>>> wrote:
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> >>>>> Unfortunately,
> >>>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be
> able
> >>>>> to provide
> >>>>>>>>>>>>>>>>> the patches in December.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
> >>>>> point at least
> >>>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
> good
> >>> if
> >>>>> we use
> >>>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
> >>>>> together and
> >>>>>>>>>>>>>>>>>> generate desc's as secondary...
> >>>>>>>>>>>>>>>>>> I think the actual components that would be required is
> >>>>> probably best
> >>>>>>>>>>>>>>>>>> left up to what is actually required for best performing
> >>>>> c-deid.  The
> >>>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should
> >>> treat
> >>>>> this as
> >>>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
> >>> pipeline
> >>>>> (in which
> >>>>>>>>>>>>>>>>>> case, we may need to propose a change to the type
> system or
> >>>>> perhaps an
> >>>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> >>>>> discussion to
> >>>>>>>>>>>>>>>>>> the dev group as you see fit.)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> My 2 cents...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> >>>>> peter.kluegl@averbis.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example
> on
> >>>>> how the
> >>>>>>>>>>>>>>>> cTAKES
> >>>>>>>>>>>>>>>>>>> community develops or how a project should look like?
> >>>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in
> a
> >>>>> quite
> >>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort
> of
> >>>>> out-dated"
> >>>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
> >>> preprocessing
> >>>>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
> >>> project.
> >>>>>>>>>>>>>>>>>>> Components: On which components may the componetns
> rely:
> >>>>> tokenizer,
> >>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>>> parser, ... dict lookup?
> >>>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
> >>> single
> >>>>> AE?
> >>>>>>>>>>>>>>>>>>> More comments below.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
> avoid
> >>>>> duplicate
> >>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to
> RUTA.
> >>>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench
> if
> >>>>> you want, or
> >>>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for
> >>> the
> >>>>> initial
> >>>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
> contribute it
> >>>>> too?
> >>>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available;
> >>> i2b2
> >>>>>>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php>
> typically
> >>>>> releases the
> >>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done
> on
> >>> an
> >>>>>>>>>>>>>>>> individual basis
> >>>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
> >>>>> validation.
> >>>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
> >>>>> dataset here.
> >>>>>>>>>>>>>>>>>>>>> My first step would be:
> >>>>>>>>>>>>>>>>>>>>> - set up a maven project
> >>>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
> cTAKES
> >>>>> components
> >>>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party
> >>> libs
> >>>>> jars that
> >>>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure
> to
> >>>>> take a look
> >>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> —Pei
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
> should
> >>>>> not be a
> >>>>>>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
> >>> independent
> >>>>> component
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
> >>>>> have shown
> >>>>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
> >>>>> independent
> >>>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>>> deid component.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>>>>>
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi Azad,


the basic rules are now translated. Do you wanna take a look at it?


There remain still many issues and the F score is quite low on the dev
set. I will continue improving the rules when I find the time.


Best,

Peter


Am 15.03.2016 um 09:49 schrieb Peter Kl�gl:
> Hi,
>
> this is essentially just a decision of design. For a single longitudinal
> record, there is no problem at all. We can solve this even with some
> simple ruta rules, or with some cutom analysis engine. If we want to
> process a set of record of the same patient jointly, then we cannot
> apply a single pipeline. I propose to postpone the decison and implement
> it only for single documents for now.
>
> Best,
>
> Peter
>
>
> Am 11.03.2016 um 20:03 schrieb Azad Dehghan:
>>> I had a quick look on PassTwo. This is not directly translatable into
>>> UIMA if the functionlity is based on analysis engines. Normally,
>>> analysis engines process one document at a time in a pipeline. My first
>>> quick guess is the we either need two pipelines (result is a program not
>>> a component) or we need a different definition of a CAS (joining all
>>> documents of a patient). Overall, it depends on the targeted use case of
>>> the project. Should it be usable in a cTAKES/uimaFIT pipeline?
>>>
>> The two pass method will have a broader applicability for NER on
>> longitudinal records...
>>
>>
>>> btw, the CRF models are not part of the contribution, right?
>>>
>>>
>> The CRF  (UK,US) models will be released but this will be together with a
>> more mature software planned for August 2016.
>>
>> Best,
>>> Peter
>>>
>>> Am 10.03.2016 um 20:29 schrieb Azad Dehghan:
>>>> Thanks Peter,
>>>>
>>>> The rules were modeled using the training data.
>>>>
>>>> It would be good to incorporate/refactor (basically, GATE API needs to be
>>>> replaced with UIMA API to generate annotation) the two-pass recognition
>>>> method for cTAKES - which has a wider application on longitudinal data.
>>>> This method is used on-top of a number NERs.
>>>>
>>>> Please let me know where I can help. I will be available again in April.
>>>>
>>>> Cheers,
>>>> Azad
>>>>
>>>> On 10 March 2016 at 13:13, Peter Kl�gl <pe...@averbis.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> sorry, I was quite busy last month.
>>>>>
>>>>> I added a new patch, which needs to be applied.
>>>>>
>>>>> No new rules, but it's possible now to evaluate everything against the
>>>>> labelled data of the challenge.
>>>>>
>>>>> @Azad:
>>>>> Which documents exactly did you use to develop the rules?
>>>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
>>> testing-PHI-Gold-fixed?
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 03.02.2016 um 09:05 schrieb Peter Kl�gl:
>>>>>> Hi,
>>>>>>
>>>>>> the last patch fixed almost all problems.
>>>>>>
>>>>>> I added another one that adds the csv file for the unit test and
>>> extends
>>>>>> svn-ignore.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 02.02.2016 um 09:16 schrieb Peter Kl�gl:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I added another patch. I missed to manually add one test file to
>>> version
>>>>>>> control, and there are still duplicate lines.
>>>>>>> I hope this patch fixes the remaining problems.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>> Am 29.01.2016 um 10:34 schrieb Peter Kl�gl:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> the problems were caused by the svn client in my Eclipse. Sorry for
>>> the
>>>>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>>>>
>>>>>>>> I attached a new patch created with commandline tools wich looks
>>>>> correct
>>>>>>>> now.
>>>>>>>>
>>>>>>>> Pei, can you apply the new patch?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Kl�gl:
>>>>>>>>> Thanks Pei.
>>>>>>>>>
>>>>>>>>> I fear there was again a problem with the patch. All new files are
>>>>>>>>> missing (and also the svn-ignore settings).
>>>>>>>>>
>>>>>>>>> Can you take a look?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>>>>> patch applied.
>>>>>>>>>> Thanks,
>>>>>>>>>> Pei
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Kl�gl <
>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>> Hi Pei,
>>>>>>>>>>>
>>>>>>>>>>> can you commit the recent patch for us?
>>>>>>>>>>>
>>>>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>>>>> But yeah, we can even create an extended type system to store
>>>>> these items temporarily and add them into the main/core type system
>>>>> afterwards.
>>>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it will
>>>>> require much more testing.  If it works, we can upgrade it in our
>>> sandbox
>>>>> area or create a branch if necessary.
>>>>>>>>>>>> \u2014Pei
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Kl�gl <
>>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> a new patch is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Pei:
>>>>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
>>>>> Some
>>>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map it
>>> to
>>>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>>>>> features...
>>>>>>>>>>>>> @Azad:
>>>>>>>>>>>>> I changed the rules a bit, especially the capitalization like I
>>>>> use it
>>>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the
>>> maven
>>>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>>>>> extended the
>>>>>>>>>>>>> regex for the url. I also changed the evaluation order of some
>>>>> rules
>>>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv for
>>>>> the unit
>>>>>>>>>>>>> tests.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should we
>>>>> split them up?
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Kl�gl:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> great. I will integrate them in the project and in the next
>>>>> patch.
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>>>>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
>>>>> more volunteers
>>>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Kl�gl" <
>>> peter.kluegl@averbis.com>
>>>>> wrote:
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>>> Unfortunately,
>>>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be able
>>>>> to provide
>>>>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
>>>>> point at least
>>>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good
>>> if
>>>>> we use
>>>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
>>>>> together and
>>>>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>>>>> I think the actual components that would be required is
>>>>> probably best
>>>>>>>>>>>>>>>>>> left up to what is actually required for best performing
>>>>> c-deid.  The
>>>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should
>>> treat
>>>>> this as
>>>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
>>> pipeline
>>>>> (in which
>>>>>>>>>>>>>>>>>> case, we may need to propose a change to the type system or
>>>>> perhaps an
>>>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>>>>> discussion to
>>>>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Kl�gl <
>>>>> peter.kluegl@averbis.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on
>>>>> how the
>>>>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in a
>>>>> quite
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of
>>>>> out-dated"
>>>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
>>> preprocessing
>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
>>> project.
>>>>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
>>>>> tokenizer,
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
>>> single
>>>>> AE?
>>>>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
>>>>> duplicate
>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if
>>>>> you want, or
>>>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for
>>> the
>>>>> initial
>>>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it
>>>>> too?
>>>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available;
>>> i2b2
>>>>>>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
>>>>> releases the
>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on
>>> an
>>>>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
>>>>> validation.
>>>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
>>>>> dataset here.
>>>>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
>>>>> components
>>>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party
>>> libs
>>>>> jars that
>>>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I\u2019ll be sure to
>>>>> take a look
>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> \u2014Pei
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should
>>>>> not be a
>>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
>>> independent
>>>>> component
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
>>>>> have shown
>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
>>>>> independent
>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

this is essentially just a decision of design. For a single longitudinal
record, there is no problem at all. We can solve this even with some
simple ruta rules, or with some cutom analysis engine. If we want to
process a set of record of the same patient jointly, then we cannot
apply a single pipeline. I propose to postpone the decison and implement
it only for single documents for now.

Best,

Peter


Am 11.03.2016 um 20:03 schrieb Azad Dehghan:
>>
>> I had a quick look on PassTwo. This is not directly translatable into
>> UIMA if the functionlity is based on analysis engines. Normally,
>> analysis engines process one document at a time in a pipeline. My first
>> quick guess is the we either need two pipelines (result is a program not
>> a component) or we need a different definition of a CAS (joining all
>> documents of a patient). Overall, it depends on the targeted use case of
>> the project. Should it be usable in a cTAKES/uimaFIT pipeline?
>>
> The two pass method will have a broader applicability for NER on
> longitudinal records...
>
>
>> btw, the CRF models are not part of the contribution, right?
>>
>>
> The CRF  (UK,US) models will be released but this will be together with a
> more mature software planned for August 2016.
>
> Best,
>> Peter
>>
>> Am 10.03.2016 um 20:29 schrieb Azad Dehghan:
>>> Thanks Peter,
>>>
>>> The rules were modeled using the training data.
>>>
>>> It would be good to incorporate/refactor (basically, GATE API needs to be
>>> replaced with UIMA API to generate annotation) the two-pass recognition
>>> method for cTAKES - which has a wider application on longitudinal data.
>>> This method is used on-top of a number NERs.
>>>
>>> Please let me know where I can help. I will be available again in April.
>>>
>>> Cheers,
>>> Azad
>>>
>>> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> sorry, I was quite busy last month.
>>>>
>>>> I added a new patch, which needs to be applied.
>>>>
>>>> No new rules, but it's possible now to evaluate everything against the
>>>> labelled data of the challenge.
>>>>
>>>> @Azad:
>>>> Which documents exactly did you use to develop the rules?
>>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
>> testing-PHI-Gold-fixed?
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>>>> Hi,
>>>>>
>>>>> the last patch fixed almost all problems.
>>>>>
>>>>> I added another one that adds the csv file for the unit test and
>> extends
>>>>> svn-ignore.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>>>> Hi,
>>>>>>
>>>>>> I added another patch. I missed to manually add one test file to
>> version
>>>>>> control, and there are still duplicate lines.
>>>>>> I hope this patch fixes the remaining problems.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>>>> Hi,
>>>>>>>
>>>>>>> the problems were caused by the svn client in my Eclipse. Sorry for
>> the
>>>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>>>
>>>>>>> I attached a new patch created with commandline tools wich looks
>>>> correct
>>>>>>> now.
>>>>>>>
>>>>>>> Pei, can you apply the new patch?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>>>> Thanks Pei.
>>>>>>>>
>>>>>>>> I fear there was again a problem with the patch. All new files are
>>>>>>>> missing (and also the svn-ignore settings).
>>>>>>>>
>>>>>>>> Can you take a look?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>>>> patch applied.
>>>>>>>>> Thanks,
>>>>>>>>> Pei
>>>>>>>>>
>>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>> Hi Pei,
>>>>>>>>>>
>>>>>>>>>> can you commit the recent patch for us?
>>>>>>>>>>
>>>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>>>> Hi,
>>>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>>>> But yeah, we can even create an extended type system to store
>>>> these items temporarily and add them into the main/core type system
>>>> afterwards.
>>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it will
>>>> require much more testing.  If it works, we can upgrade it in our
>> sandbox
>>>> area or create a branch if necessary.
>>>>>>>>>>> —Pei
>>>>>>>>>>>
>>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> a new patch is attached.
>>>>>>>>>>>>
>>>>>>>>>>>> @Pei:
>>>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
>>>> Some
>>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map it
>> to
>>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>>>> features...
>>>>>>>>>>>> @Azad:
>>>>>>>>>>>> I changed the rules a bit, especially the capitalization like I
>>>> use it
>>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the
>> maven
>>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>>>> extended the
>>>>>>>>>>>> regex for the url. I also changed the evaluation order of some
>>>> rules
>>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv for
>>>> the unit
>>>>>>>>>>>> tests.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should we
>>>> split them up?
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> great. I will integrate them in the project and in the next
>>>> patch.
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>>>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
>>>> more volunteers
>>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <
>> peter.kluegl@averbis.com>
>>>> wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>> Unfortunately,
>>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be able
>>>> to provide
>>>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
>>>> point at least
>>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good
>> if
>>>> we use
>>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
>>>> together and
>>>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>>>> I think the actual components that would be required is
>>>> probably best
>>>>>>>>>>>>>>>>> left up to what is actually required for best performing
>>>> c-deid.  The
>>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should
>> treat
>>>> this as
>>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
>> pipeline
>>>> (in which
>>>>>>>>>>>>>>>>> case, we may need to propose a change to the type system or
>>>> perhaps an
>>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>>>> discussion to
>>>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>>>> peter.kluegl@averbis.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on
>>>> how the
>>>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in a
>>>> quite
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of
>>>> out-dated"
>>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
>> preprocessing
>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
>> project.
>>>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
>>>> tokenizer,
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
>> single
>>>> AE?
>>>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
>>>> duplicate
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if
>>>> you want, or
>>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for
>> the
>>>> initial
>>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it
>>>> too?
>>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available;
>> i2b2
>>>>>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
>>>> releases the
>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on
>> an
>>>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
>>>> validation.
>>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
>>>> dataset here.
>>>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
>>>> components
>>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party
>> libs
>>>> jars that
>>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to
>>>> take a look
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should
>>>> not be a
>>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
>> independent
>>>> component
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
>>>> have shown
>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
>>>> independent
>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>>
>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

>
>
> I had a quick look on PassTwo. This is not directly translatable into
> UIMA if the functionlity is based on analysis engines. Normally,
> analysis engines process one document at a time in a pipeline. My first
> quick guess is the we either need two pipelines (result is a program not
> a component) or we need a different definition of a CAS (joining all
> documents of a patient). Overall, it depends on the targeted use case of
> the project. Should it be usable in a cTAKES/uimaFIT pipeline?
>

The two pass method will have a broader applicability for NER on
longitudinal records...


>
> btw, the CRF models are not part of the contribution, right?
>
>
The CRF  (UK,US) models will be released but this will be together with a
more mature software planned for August 2016.

Best,
>
> Peter
>
> Am 10.03.2016 um 20:29 schrieb Azad Dehghan:
> > Thanks Peter,
> >
> > The rules were modeled using the training data.
> >
> > It would be good to incorporate/refactor (basically, GATE API needs to be
> > replaced with UIMA API to generate annotation) the two-pass recognition
> > method for cTAKES - which has a wider application on longitudinal data.
> > This method is used on-top of a number NERs.
> >
> > Please let me know where I can help. I will be available again in April.
> >
> > Cheers,
> > Azad
> >
> > On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com> wrote:
> >
> >> Hi,
> >>
> >> sorry, I was quite busy last month.
> >>
> >> I added a new patch, which needs to be applied.
> >>
> >> No new rules, but it's possible now to evaluate everything against the
> >> labelled data of the challenge.
> >>
> >> @Azad:
> >> Which documents exactly did you use to develop the rules?
> >> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
> testing-PHI-Gold-fixed?
> >>
> >> Best,
> >>
> >> Peter
> >>
> >> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> >>> Hi,
> >>>
> >>> the last patch fixed almost all problems.
> >>>
> >>> I added another one that adds the csv file for the unit test and
> extends
> >>> svn-ignore.
> >>>
> >>> Best,
> >>>
> >>> Peter
> >>>
> >>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> >>>> Hi,
> >>>>
> >>>> I added another patch. I missed to manually add one test file to
> version
> >>>> control, and there are still duplicate lines.
> >>>> I hope this patch fixes the remaining problems.
> >>>>
> >>>> Best,
> >>>>
> >>>> Peter
> >>>>
> >>>>
> >>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> >>>>> Hi,
> >>>>>
> >>>>> the problems were caused by the svn client in my Eclipse. Sorry for
> the
> >>>>> trouble, I should have looked more closely at the ciomplete patch.
> >>>>>
> >>>>> I attached a new patch created with commandline tools wich looks
> >> correct
> >>>>> now.
> >>>>>
> >>>>> Pei, can you apply the new patch?
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> >>>>>> Thanks Pei.
> >>>>>>
> >>>>>> I fear there was again a problem with the patch. All new files are
> >>>>>> missing (and also the svn-ignore settings).
> >>>>>>
> >>>>>> Can you take a look?
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> >>>>>>> patch applied.
> >>>>>>> Thanks,
> >>>>>>> Pei
> >>>>>>>
> >>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> >> peter.kluegl@averbis.com> wrote:
> >>>>>>>> Hi Pei,
> >>>>>>>>
> >>>>>>>> can you commit the recent patch for us?
> >>>>>>>>
> >>>>>>>> CTAKES-384-20160120.patch
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Peter
> >>>>>>>>
> >>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> >>>>>>>>> Hi,
> >>>>>>>>> Sorry I was swamped recently.
> >>>>>>>>> But yeah, we can even create an extended type system to store
> >> these items temporarily and add them into the main/core type system
> >> afterwards.
> >>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it will
> >> require much more testing.  If it works, we can upgrade it in our
> sandbox
> >> area or create a branch if necessary.
> >>>>>>>>> —Pei
> >>>>>>>>>
> >>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> >> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> a new patch is attached.
> >>>>>>>>>>
> >>>>>>>>>> @Pei:
> >>>>>>>>>> are there suitable annotation types in the cTAKES type system?
> >> Some
> >>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map it
> to
> >>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
> >> features...
> >>>>>>>>>> @Azad:
> >>>>>>>>>> I changed the rules a bit, especially the capitalization like I
> >> use it
> >>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the
> maven
> >>>>>>>>>> plugin. I also added the two regexes for url and email. I
> >> extended the
> >>>>>>>>>> regex for the url. I also changed the evaluation order of some
> >> rules
> >>>>>>>>>> (with @). Feel free to add simple examples to examples.csv for
> >> the unit
> >>>>>>>>>> tests.
> >>>>>>>>>>
> >>>>>>>>>> Let me know if you need more information about the changes.
> >>>>>>>>>>
> >>>>>>>>>> Do you wanna have help with the other rule sets? Or should we
> >> split them up?
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Peter
> >>>>>>>>>>
> >>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> great. I will integrate them in the project and in the next
> >> patch.
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Peter
> >>>>>>>>>>>
> >>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>>>>>>>>>> Three NERs translated and uploaded.
> >>>>>>>>>>>>
> >>>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Azad
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> >> azad.dehghan@gmail.com> wrote:
> >>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
> >> more volunteers
> >>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <
> peter.kluegl@averbis.com>
> >> wrote:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> >> Unfortunately,
> >>>>>>>>>>>>>> there is just no spare time right now. I hope I will be able
> >> to provide
> >>>>>>>>>>>>>> the patches in December.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
> >> point at least
> >>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good
> if
> >> we use
> >>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
> >> together and
> >>>>>>>>>>>>>>> generate desc's as secondary...
> >>>>>>>>>>>>>>> I think the actual components that would be required is
> >> probably best
> >>>>>>>>>>>>>>> left up to what is actually required for best performing
> >> c-deid.  The
> >>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should
> treat
> >> this as
> >>>>>>>>>>>>>>> an independent preprocessing component or part of a
> pipeline
> >> (in which
> >>>>>>>>>>>>>>> case, we may need to propose a change to the type system or
> >> perhaps an
> >>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> >> discussion to
> >>>>>>>>>>>>>>> the dev group as you see fit.)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> My 2 cents...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> >> peter.kluegl@averbis.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on
> >> how the
> >>>>>>>>>>>>> cTAKES
> >>>>>>>>>>>>>>>> community develops or how a project should look like?
> >>>>>>>>>>>>>>>> I learned that different people set up UIMA project in a
> >> quite
> >>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of
> >> out-dated"
> >>>>>>>>>>>>>>>> approach in the cTAKES repo.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Are there restriction or preferences about the
> preprocessing
> >>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
> project.
> >>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
> >> tokenizer,
> >>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>> parser, ... dict lookup?
> >>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
> single
> >> AE?
> >>>>>>>>>>>>>>>> More comments below.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
> >> duplicate
> >>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
> >>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if
> >> you want, or
> >>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for
> the
> >> initial
> >>>>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it
> >> too?
> >>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available;
> i2b2
> >>>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
> >> releases the
> >>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on
> an
> >>>>>>>>>>>>> individual basis
> >>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
> >> validation.
> >>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
> >> dataset here.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> My first step would be:
> >>>>>>>>>>>>>>>>>> - set up a maven project
> >>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
> >> components
> >>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party
> libs
> >> jars that
> >>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to
> >> take a look
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> —Pei
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should
> >> not be a
> >>>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
> independent
> >> component
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
> >> have shown
> >>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
> >> independent
> >>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>> deid component.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>>
> >>
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I had a quick look on PassTwo. This is not directly translatable into
UIMA if the functionlity is based on analysis engines. Normally,
analysis engines process one document at a time in a pipeline. My first
quick guess is the we either need two pipelines (result is a program not
a component) or we need a different definition of a CAS (joining all
documents of a patient). Overall, it depends on the targeted use case of
the project. Should it be usable in a cTAKES/uimaFIT pipeline?

btw, the CRF models are not part of the contribution, right?

Best,

Peter

Am 10.03.2016 um 20:29 schrieb Azad Dehghan:
> Thanks Peter,
>
> The rules were modeled using the training data.
>
> It would be good to incorporate/refactor (basically, GATE API needs to be
> replaced with UIMA API to generate annotation) the two-pass recognition
> method for cTAKES - which has a wider application on longitudinal data.
> This method is used on-top of a number NERs.
>
> Please let me know where I can help. I will be available again in April.
>
> Cheers,
> Azad
>
> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>> sorry, I was quite busy last month.
>>
>> I added a new patch, which needs to be applied.
>>
>> No new rules, but it's possible now to evaluate everything against the
>> labelled data of the challenge.
>>
>> @Azad:
>> Which documents exactly did you use to develop the rules?
>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or testing-PHI-Gold-fixed?
>>
>> Best,
>>
>> Peter
>>
>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>> Hi,
>>>
>>> the last patch fixed almost all problems.
>>>
>>> I added another one that adds the csv file for the unit test and extends
>>> svn-ignore.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>> Hi,
>>>>
>>>> I added another patch. I missed to manually add one test file to version
>>>> control, and there are still duplicate lines.
>>>> I hope this patch fixes the remaining problems.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>> Hi,
>>>>>
>>>>> the problems were caused by the svn client in my Eclipse. Sorry for the
>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>
>>>>> I attached a new patch created with commandline tools wich looks
>> correct
>>>>> now.
>>>>>
>>>>> Pei, can you apply the new patch?
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>> Thanks Pei.
>>>>>>
>>>>>> I fear there was again a problem with the patch. All new files are
>>>>>> missing (and also the svn-ignore settings).
>>>>>>
>>>>>> Can you take a look?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>> patch applied.
>>>>>>> Thanks,
>>>>>>> Pei
>>>>>>>
>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>> peter.kluegl@averbis.com> wrote:
>>>>>>>> Hi Pei,
>>>>>>>>
>>>>>>>> can you commit the recent patch for us?
>>>>>>>>
>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>> Hi,
>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>> But yeah, we can even create an extended type system to store
>> these items temporarily and add them into the main/core type system
>> afterwards.
>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it will
>> require much more testing.  If it works, we can upgrade it in our sandbox
>> area or create a branch if necessary.
>>>>>>>>> —Pei
>>>>>>>>>
>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> a new patch is attached.
>>>>>>>>>>
>>>>>>>>>> @Pei:
>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
>> Some
>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>> features...
>>>>>>>>>> @Azad:
>>>>>>>>>> I changed the rules a bit, especially the capitalization like I
>> use it
>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>> extended the
>>>>>>>>>> regex for the url. I also changed the evaluation order of some
>> rules
>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv for
>> the unit
>>>>>>>>>> tests.
>>>>>>>>>>
>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>
>>>>>>>>>> Do you wanna have help with the other rule sets? Or should we
>> split them up?
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> great. I will integrate them in the project and in the next
>> patch.
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>
>>>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Azad
>>>>>>>>>>>>
>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
>> more volunteers
>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com>
>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>> Unfortunately,
>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be able
>> to provide
>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
>> point at least
>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if
>> we use
>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
>> together and
>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>> I think the actual components that would be required is
>> probably best
>>>>>>>>>>>>>>> left up to what is actually required for best performing
>> c-deid.  The
>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should treat
>> this as
>>>>>>>>>>>>>>> an independent preprocessing component or part of a pipeline
>> (in which
>>>>>>>>>>>>>>> case, we may need to propose a change to the type system or
>> perhaps an
>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>> discussion to
>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>> peter.kluegl@averbis.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on
>> how the
>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in a
>> quite
>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of
>> out-dated"
>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
>> tokenizer,
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a single
>> AE?
>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
>> duplicate
>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if
>> you want, or
>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for the
>> initial
>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it
>> too?
>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
>> releases the
>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
>> validation.
>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
>> dataset here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
>> components
>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs
>> jars that
>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to
>> take a look
>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should
>> not be a
>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an independent
>> component
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
>> have shown
>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
>> independent
>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>
>>

RE: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

MIST did very well in the 2006 i2b2 challange on a very limited set of PHI
entity types. The 2014 challange evaluated a more comprehensive set of PHIs
with a number of different methods being propsed.

The issue of PHI leaks is an interesting one that keeps reoccurring. I
cannot see clinical data being released as 'open data' without additional
safe guards such as data use agreement etc. Also, it is good that cTAKES
has started to take onboard deid as the removal of PHI remains a hurdle for
clinical data access in increasing number of institutions, after all it is
a problem that has a NLP solution.

Azad
On 10 Mar 2016 21:27, "Savova, Guergana" <
Guergana.Savova@childrens.harvard.edu> wrote:

> You can re-build the models that feed into MIST. I personally would not
> use the default model that MIST comes with as it is not trained on clinical
> data. In our previous work we found that hand-annotating about 200 docs for
> PHI (representative of the sample you are going to run the models on)
> results in building a pretty good model - in the 90's for p, r and f1.
> However, even with that high performance, the institution that owns the
> data might be still reluctant to share as it might pose a violation of
> HIPAA through some potential PHI leaks. In cTAKES our approach has been to
> de-couple the de-identifcation from the NLP/information extraction. If a
> user has the need for de-identified data, they could choose their method --
> manual or otherwise -- and then process through cTAKES. Our focus is the
> NLP/IE space, while de-identification is a blend of that plus policy....
>
> --Guergana
>
> -----Original Message-----
> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> Sent: Thursday, March 10, 2016 4:19 PM
> To: dev@ctakes.apache.org
> Subject: RE: Combining Knowledge- and Data-driven Methods for
> De-identification of Clinical Narratives
>
> Thanks Guergana.
>
> > Yes, the current release of cTAKES has a module for the temporal
> expressions which includes dates. The normalizer for the temporal
> expressions is Steven Bethard's timenorm code.
> >
>
> Great.
>
> > However, if you do de-identification of dates/temporal expressions,
> > you
> run the risk of creating incorrect timelines as many of the relative
> temporal expressions (e.g. spring of this year, x-mas time, etc.) are
> unlikely to be correctly shifted by any de-identification tool.
> >
> Indeed, a reason I have not included the dates component.
>
> > One de-identification tool is MIST --
> https://urldefense.proofpoint.com/v2/url?u=http-3A__mist-2Ddeid.sourceforge.net_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=5awdXn2I-hRE0-161tqFDGgmYgQQviQg360uHI4fs2s&e=
> .
> >
> I don't remember them doing well in the community held evaluation in 2014.
> Hence, cDeid :)
> >
> > Guergana Savova, PhD, FACMI
> > Associate Professor
> > PI Natural Language Processing Lab
> > Boston Children's Hospital and Harvard Medical School
> > 300 Longwood Avenue
> > Mailstop: BCH3092
> > Enders 144.1
> > Boston, MA 02115
> > Tel: (617) 919-2972
> > Fax: (617) 730-0817
> > Harvard Scholar:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__scholar.harvard.ed
> > u_guergana-5Fk-5Fsavova_biocv&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
> > ZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGm
> > RCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=3taiTxFp55
> > iQUnc6A6Yemg-XzFQrRjo5QZRQeKHQ29c&e=
> >
> > -----Original Message-----
> > From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> > Sent: Thursday, March 10, 2016 3:42 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Combining Knowledge- and Data-driven Methods for
> De-identification of Clinical Narratives
> >
> > > This means both training data folders? I have access to the data but
> > > not
> > to the challenge description.
> >
> > Yes. Is there any specific information that you are missing?
> > >
> > >
> > >> It would be good to incorporate/refactor (basically, GATE API needs
> > >> to be replaced with UIMA API to generate annotation) the two-pass
> > >> recognition method for cTAKES - which has a wider application on
> longitudinal data.
> > >> This method is used on-top of a number NERs.
> > >
> > >
> > > I'll take a look.
> > >
> > > I do not know how much time I can invest this month. Let's see how
> > > many
> > phases I can translate.
> > >
> > > I added the rules for age. Are there jape rules for creating date
> > annotations?
> > >
> >
> > No. I believe cTAKES has existing component(s) to capture dates?
> >
> > > After all rules are translated, they need some major refactoring.
> > > Jape
> > and Ruta are quite different in some aspects.
> > >
> > Ok.
> >
> > >
> > >
> > >
> > >
> > >
> > >> Please let me know where I can help. I will be available again in
> April.
> > >>
> > >> Cheers,
> > >> Azad
> > >>
> > >> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> sorry, I was quite busy last month.
> > >>>
> > >>> I added a new patch, which needs to be applied.
> > >>>
> > >>> No new rules, but it's possible now to evaluate everything against
> > >>> the labelled data of the challenge.
> > >>>
> > >>> @Azad:
> > >>> Which documents exactly did you use to develop the rules?
> > >>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
> > testing-PHI-Gold-fixed?
> > >>>
> > >>> Best,
> > >>>
> > >>> Peter
> > >>>
> > >>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> the last patch fixed almost all problems.
> > >>>>
> > >>>> I added another one that adds the csv file for the unit test and
> > extends
> > >>>> svn-ignore.
> > >>>>
> > >>>> Best,
> > >>>>
> > >>>> Peter
> > >>>>
> > >>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I added another patch. I missed to manually add one test file to
> > version
> > >>>>> control, and there are still duplicate lines.
> > >>>>> I hope this patch fixes the remaining problems.
> > >>>>>
> > >>>>> Best,
> > >>>>>
> > >>>>> Peter
> > >>>>>
> > >>>>>
> > >>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> the problems were caused by the svn client in my Eclipse. Sorry
> > >>>>>> for
> > the
> > >>>>>> trouble, I should have looked more closely at the ciomplete patch.
> > >>>>>>
> > >>>>>> I attached a new patch created with commandline tools wich
> > >>>>>> looks
> > >>>
> > >>> correct
> > >>>>>>
> > >>>>>> now.
> > >>>>>>
> > >>>>>> Pei, can you apply the new patch?
> > >>>>>>
> > >>>>>> Best,
> > >>>>>>
> > >>>>>> Peter
> > >>>>>>
> > >>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> > >>>>>>>
> > >>>>>>> Thanks Pei.
> > >>>>>>>
> > >>>>>>> I fear there was again a problem with the patch. All new files
> > >>>>>>> are missing (and also the svn-ignore settings).
> > >>>>>>>
> > >>>>>>> Can you take a look?
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>>
> > >>>>>>> Peter
> > >>>>>>>
> > >>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> > >>>>>>>>
> > >>>>>>>> patch applied.
> > >>>>>>>> Thanks,
> > >>>>>>>> Pei
> > >>>>>>>>
> > >>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> > >>>
> > >>> peter.kluegl@averbis.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Pei,
> > >>>>>>>>>
> > >>>>>>>>> can you commit the recent patch for us?
> > >>>>>>>>>
> > >>>>>>>>> CTAKES-384-20160120.patch
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>>
> > >>>>>>>>> Peter
> > >>>>>>>>>
> > >>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>> Sorry I was swamped recently.
> > >>>>>>>>>> But yeah, we can even create an extended type system to
> > >>>>>>>>>> store
> > >>>
> > >>> these items temporarily and add them into the main/core type
> > >>> system afterwards.
> > >>>>>>>>>>
> > >>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
> > >>>>>>>>>> will
> > >>>
> > >>> require much more testing.  If it works, we can upgrade it in our
> > sandbox
> > >>> area or create a branch if necessary.
> > >>>>>>>>>>
> > >>>>>>>>>> —Pei
> > >>>>>>>>>>
> > >>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> > >>>
> > >>> peter.kluegl@averbis.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> a new patch is attached.
> > >>>>>>>>>>>
> > >>>>>>>>>>> @Pei:
> > >>>>>>>>>>> are there suitable annotation types in the cTAKES type
> system?
> > >>>
> > >>> Some
> > >>>>>>>>>>>
> > >>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I
> > >>>>>>>>>>> map it
> > to
> > >>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
> > >>>
> > >>> features...
> > >>>>>>>>>>>
> > >>>>>>>>>>> @Azad:
> > >>>>>>>>>>> I changed the rules a bit, especially the capitalization
> > >>>>>>>>>>> like I
> > >>>
> > >>> use it
> > >>>>>>>>>>>
> > >>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by
> > >>>>>>>>>>> the
> > maven
> > >>>>>>>>>>> plugin. I also added the two regexes for url and email. I
> > >>>
> > >>> extended the
> > >>>>>>>>>>>
> > >>>>>>>>>>> regex for the url. I also changed the evaluation order of
> > >>>>>>>>>>> some
> > >>>
> > >>> rules
> > >>>>>>>>>>>
> > >>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
> > >>>>>>>>>>> for
> > >>>
> > >>> the unit
> > >>>>>>>>>>>
> > >>>>>>>>>>> tests.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Let me know if you need more information about the changes.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Do you wanna have help with the other rule sets? Or should
> > >>>>>>>>>>> we
> > >>>
> > >>> split them up?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Peter
> > >>>>>>>>>>>
> > >>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> great. I will integrate them in the project and in the
> > >>>>>>>>>>>> next
> > >>>
> > >>> patch.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Peter
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Three NERs translated and uploaded.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> PS. I will validate all NERs once we have them all
> completed.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>> Azad
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> > >>>
> > >>> azad.dehghan@gmail.com> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are
> > >>>>>>>>>>>>>> any
> > >>>
> > >>> more volunteers
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>> Azad
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
> > >>>>>>>>>>>>>> <peter.kluegl@averbis.com
> > >
> > >>>
> > >>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> > >>>
> > >>> Unfortunately,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> there is just no spare time right now. I hope I will
> > >>>>>>>>>>>>>>> be able
> > >>>
> > >>> to provide
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> the patches in December.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Peter
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Peter,
> > >>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good
> > >>>>>>>>>>>>>>>> starting
> > >>>
> > >>> point at least
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
> > >>>>>>>>>>>>>>>> good
> > if
> > >>>
> > >>> we use
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring
> > >>>>>>>>>>>>>>>> components
> > >>>
> > >>> together and
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> generate desc's as secondary...
> > >>>>>>>>>>>>>>>> I think the actual components that would be required
> > >>>>>>>>>>>>>>>> is
> > >>>
> > >>> probably best
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> left up to what is actually required for best
> > >>>>>>>>>>>>>>>> performing
> > >>>
> > >>> c-deid.  The
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we
> > >>>>>>>>>>>>>>>> should
> > treat
> > >>>
> > >>> this as
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> an independent preprocessing component or part of a
> > pipeline
> > >>>
> > >>> (in which
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> case, we may need to propose a change to the type
> > >>>>>>>>>>>>>>>> system or
> > >>>
> > >>> perhaps an
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> > >>>
> > >>> discussion to
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the dev group as you see fit.)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> My 2 cents...
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> > >>>
> > >>> peter.kluegl@averbis.com>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an
> > >>>>>>>>>>>>>>>>> example on
> > >>>
> > >>> how the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> cTAKES
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> community develops or how a project should look like?
> > >>>>>>>>>>>>>>>>> I learned that different people set up UIMA project
> > >>>>>>>>>>>>>>>>> in a
> > >>>
> > >>> quite
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some
> > >>>>>>>>>>>>>>>>> sort of
> > >>>
> > >>> out-dated"
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> approach in the cTAKES repo.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Are there restriction or preferences about the
> > preprocessing
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> components
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
> > project.
> > >>>>>>>>>>>>>>>>> Components: On which components may the componetns
> rely:
> > >>>
> > >>> tokenizer,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> ...
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> parser, ... dict lookup?
> > >>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
> > single
> > >>>
> > >>> AE?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> More comments below.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
> > >>>>>>>>>>>>>>>>>>> avoid
> > >>>
> > >>> duplicate
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to
> RUTA.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta
> > >>>>>>>>>>>>>>>>> Workbench if
> > >>>
> > >>> you want, or
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized
> > >>>>>>>>>>>>>>>>>>> for
> > the
> > >>>
> > >>> initial
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
> > >>>>>>>>>>>>>>>>>>> contribute it
> > >>>
> > >>> too?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly
> > >>>>>>>>>>>>>>>>>> available;
> > i2b2
> > >>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3
> > >>>>>>>>>>>>>>>>>> A_
> > >>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4g
> > >>>>>>>>>>>>>>>>>> oW
> > >>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNn
> > >>>>>>>>>>>>>>>>>> J9
> > >>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP
> > >>>>>>>>>>>>>>>>>> &m
> > >>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEe
> > >>>>>>>>>>>>>>>>>> OR yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= >
> > >>>>>>>>>>>>>>>>>> typically
> > >>>
> > >>> releases the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is
> > >>>>>>>>>>>>>>>>>> done on
> > an
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> individual basis
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate
> > >>>>>>>>>>>>>>>>>> the
> > >>>
> > >>> validation.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to
> > >>>>>>>>>>>>>>>>> the
> > >>>
> > >>> dataset here.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> My first step would be:
> > >>>>>>>>>>>>>>>>>>> - set up a maven project
> > >>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
> > >>>>>>>>>>>>>>>>>>> cTAKES
> > >>>
> > >>> components
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd
> > >>>>>>>>>>>>>>>>>>> party
> > libs
> > >>>
> > >>> jars that
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be
> > >>>>>>>>>>>>>>>>>>> sure to
> > >>>
> > >>> take a look
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> at
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> that over the next few weeks.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> —Pei
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
> > >>>>>>>>>>>>>>>>>> should
> > >>>
> > >>> not be a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> need to
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
> > independent
> > >>>
> > >>> component
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this
> > >>>>>>>>>>>>>>>>>> method
> > >>>
> > >>> have shown
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> useful
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely
> > >>>>>>>>>>>>>>>>>> useful
> > >>>
> > >>> independent
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> deid component.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>>>>>> Azad
> > >>>>>>>>>>>>>>>>>>
> > >>>
> > >
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by andy mcmurry <mc...@gmail.com>.

*** For cross-validation, you can evaluate de-identified notes data from
i2b2 challenge** *
https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-scrubber-deid/data/models/

*Methods for model generation of FeatureSet described here: *

*Improved de-identification of physician notes through integrative modeling
of both public and private medical text*
http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-13-112

Major objective of that study was to help provide external examples to
cross train / retrain other methods.

hope this helps,
--Andy



On Thu, Mar 10, 2016 at 1:27 PM, Savova, Guergana <
Guergana.Savova@childrens.harvard.edu> wrote:

> You can re-build the models that feed into MIST. I personally would not
> use the default model that MIST comes with as it is not trained on clinical
> data. In our previous work we found that hand-annotating about 200 docs for
> PHI (representative of the sample you are going to run the models on)
> results in building a pretty good model - in the 90's for p, r and f1.
> However, even with that high performance, the institution that owns the
> data might be still reluctant to share as it might pose a violation of
> HIPAA through some potential PHI leaks. In cTAKES our approach has been to
> de-couple the de-identifcation from the NLP/information extraction. If a
> user has the need for de-identified data, they could choose their method --
> manual or otherwise -- and then process through cTAKES. Our focus is the
> NLP/IE space, while de-identification is a blend of that plus policy....
>
> --Guergana
>
> -----Original Message-----
> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> Sent: Thursday, March 10, 2016 4:19 PM
> To: dev@ctakes.apache.org
> Subject: RE: Combining Knowledge- and Data-driven Methods for
> De-identification of Clinical Narratives
>
> Thanks Guergana.
>
> > Yes, the current release of cTAKES has a module for the temporal
> expressions which includes dates. The normalizer for the temporal
> expressions is Steven Bethard's timenorm code.
> >
>
> Great.
>
> > However, if you do de-identification of dates/temporal expressions,
> > you
> run the risk of creating incorrect timelines as many of the relative
> temporal expressions (e.g. spring of this year, x-mas time, etc.) are
> unlikely to be correctly shifted by any de-identification tool.
> >
> Indeed, a reason I have not included the dates component.
>
> > One de-identification tool is MIST --
> https://urldefense.proofpoint.com/v2/url?u=http-3A__mist-2Ddeid.sourceforge.net_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=5awdXn2I-hRE0-161tqFDGgmYgQQviQg360uHI4fs2s&e=
> .
> >
> I don't remember them doing well in the community held evaluation in 2014.
> Hence, cDeid :)
> >
> > Guergana Savova, PhD, FACMI
> > Associate Professor
> > PI Natural Language Processing Lab
> > Boston Children's Hospital and Harvard Medical School
> > 300 Longwood Avenue
> > Mailstop: BCH3092
> > Enders 144.1
> > Boston, MA 02115
> > Tel: (617) 919-2972
> > Fax: (617) 730-0817
> > Harvard Scholar:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__scholar.harvard.ed
> > u_guergana-5Fk-5Fsavova_biocv&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
> > ZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGm
> > RCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=3taiTxFp55
> > iQUnc6A6Yemg-XzFQrRjo5QZRQeKHQ29c&e=
> >
> > -----Original Message-----
> > From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> > Sent: Thursday, March 10, 2016 3:42 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Combining Knowledge- and Data-driven Methods for
> De-identification of Clinical Narratives
> >
> > > This means both training data folders? I have access to the data but
> > > not
> > to the challenge description.
> >
> > Yes. Is there any specific information that you are missing?
> > >
> > >
> > >> It would be good to incorporate/refactor (basically, GATE API needs
> > >> to be replaced with UIMA API to generate annotation) the two-pass
> > >> recognition method for cTAKES - which has a wider application on
> longitudinal data.
> > >> This method is used on-top of a number NERs.
> > >
> > >
> > > I'll take a look.
> > >
> > > I do not know how much time I can invest this month. Let's see how
> > > many
> > phases I can translate.
> > >
> > > I added the rules for age. Are there jape rules for creating date
> > annotations?
> > >
> >
> > No. I believe cTAKES has existing component(s) to capture dates?
> >
> > > After all rules are translated, they need some major refactoring.
> > > Jape
> > and Ruta are quite different in some aspects.
> > >
> > Ok.
> >
> > >
> > >
> > >
> > >
> > >
> > >> Please let me know where I can help. I will be available again in
> April.
> > >>
> > >> Cheers,
> > >> Azad
> > >>
> > >> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> sorry, I was quite busy last month.
> > >>>
> > >>> I added a new patch, which needs to be applied.
> > >>>
> > >>> No new rules, but it's possible now to evaluate everything against
> > >>> the labelled data of the challenge.
> > >>>
> > >>> @Azad:
> > >>> Which documents exactly did you use to develop the rules?
> > >>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
> > testing-PHI-Gold-fixed?
> > >>>
> > >>> Best,
> > >>>
> > >>> Peter
> > >>>
> > >>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> the last patch fixed almost all problems.
> > >>>>
> > >>>> I added another one that adds the csv file for the unit test and
> > extends
> > >>>> svn-ignore.
> > >>>>
> > >>>> Best,
> > >>>>
> > >>>> Peter
> > >>>>
> > >>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I added another patch. I missed to manually add one test file to
> > version
> > >>>>> control, and there are still duplicate lines.
> > >>>>> I hope this patch fixes the remaining problems.
> > >>>>>
> > >>>>> Best,
> > >>>>>
> > >>>>> Peter
> > >>>>>
> > >>>>>
> > >>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> the problems were caused by the svn client in my Eclipse. Sorry
> > >>>>>> for
> > the
> > >>>>>> trouble, I should have looked more closely at the ciomplete patch.
> > >>>>>>
> > >>>>>> I attached a new patch created with commandline tools wich
> > >>>>>> looks
> > >>>
> > >>> correct
> > >>>>>>
> > >>>>>> now.
> > >>>>>>
> > >>>>>> Pei, can you apply the new patch?
> > >>>>>>
> > >>>>>> Best,
> > >>>>>>
> > >>>>>> Peter
> > >>>>>>
> > >>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> > >>>>>>>
> > >>>>>>> Thanks Pei.
> > >>>>>>>
> > >>>>>>> I fear there was again a problem with the patch. All new files
> > >>>>>>> are missing (and also the svn-ignore settings).
> > >>>>>>>
> > >>>>>>> Can you take a look?
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>>
> > >>>>>>> Peter
> > >>>>>>>
> > >>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> > >>>>>>>>
> > >>>>>>>> patch applied.
> > >>>>>>>> Thanks,
> > >>>>>>>> Pei
> > >>>>>>>>
> > >>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> > >>>
> > >>> peter.kluegl@averbis.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Pei,
> > >>>>>>>>>
> > >>>>>>>>> can you commit the recent patch for us?
> > >>>>>>>>>
> > >>>>>>>>> CTAKES-384-20160120.patch
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>>
> > >>>>>>>>> Peter
> > >>>>>>>>>
> > >>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>> Sorry I was swamped recently.
> > >>>>>>>>>> But yeah, we can even create an extended type system to
> > >>>>>>>>>> store
> > >>>
> > >>> these items temporarily and add them into the main/core type
> > >>> system afterwards.
> > >>>>>>>>>>
> > >>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
> > >>>>>>>>>> will
> > >>>
> > >>> require much more testing.  If it works, we can upgrade it in our
> > sandbox
> > >>> area or create a branch if necessary.
> > >>>>>>>>>>
> > >>>>>>>>>> —Pei
> > >>>>>>>>>>
> > >>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> > >>>
> > >>> peter.kluegl@averbis.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> a new patch is attached.
> > >>>>>>>>>>>
> > >>>>>>>>>>> @Pei:
> > >>>>>>>>>>> are there suitable annotation types in the cTAKES type
> system?
> > >>>
> > >>> Some
> > >>>>>>>>>>>
> > >>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I
> > >>>>>>>>>>> map it
> > to
> > >>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
> > >>>
> > >>> features...
> > >>>>>>>>>>>
> > >>>>>>>>>>> @Azad:
> > >>>>>>>>>>> I changed the rules a bit, especially the capitalization
> > >>>>>>>>>>> like I
> > >>>
> > >>> use it
> > >>>>>>>>>>>
> > >>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by
> > >>>>>>>>>>> the
> > maven
> > >>>>>>>>>>> plugin. I also added the two regexes for url and email. I
> > >>>
> > >>> extended the
> > >>>>>>>>>>>
> > >>>>>>>>>>> regex for the url. I also changed the evaluation order of
> > >>>>>>>>>>> some
> > >>>
> > >>> rules
> > >>>>>>>>>>>
> > >>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
> > >>>>>>>>>>> for
> > >>>
> > >>> the unit
> > >>>>>>>>>>>
> > >>>>>>>>>>> tests.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Let me know if you need more information about the changes.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Do you wanna have help with the other rule sets? Or should
> > >>>>>>>>>>> we
> > >>>
> > >>> split them up?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Peter
> > >>>>>>>>>>>
> > >>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> great. I will integrate them in the project and in the
> > >>>>>>>>>>>> next
> > >>>
> > >>> patch.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Peter
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Three NERs translated and uploaded.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> PS. I will validate all NERs once we have them all
> completed.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>> Azad
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> > >>>
> > >>> azad.dehghan@gmail.com> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are
> > >>>>>>>>>>>>>> any
> > >>>
> > >>> more volunteers
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>> Azad
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
> > >>>>>>>>>>>>>> <peter.kluegl@averbis.com
> > >
> > >>>
> > >>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> > >>>
> > >>> Unfortunately,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> there is just no spare time right now. I hope I will
> > >>>>>>>>>>>>>>> be able
> > >>>
> > >>> to provide
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> the patches in December.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Peter
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Peter,
> > >>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good
> > >>>>>>>>>>>>>>>> starting
> > >>>
> > >>> point at least
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
> > >>>>>>>>>>>>>>>> good
> > if
> > >>>
> > >>> we use
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring
> > >>>>>>>>>>>>>>>> components
> > >>>
> > >>> together and
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> generate desc's as secondary...
> > >>>>>>>>>>>>>>>> I think the actual components that would be required
> > >>>>>>>>>>>>>>>> is
> > >>>
> > >>> probably best
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> left up to what is actually required for best
> > >>>>>>>>>>>>>>>> performing
> > >>>
> > >>> c-deid.  The
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we
> > >>>>>>>>>>>>>>>> should
> > treat
> > >>>
> > >>> this as
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> an independent preprocessing component or part of a
> > pipeline
> > >>>
> > >>> (in which
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> case, we may need to propose a change to the type
> > >>>>>>>>>>>>>>>> system or
> > >>>
> > >>> perhaps an
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> > >>>
> > >>> discussion to
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the dev group as you see fit.)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> My 2 cents...
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> > >>>
> > >>> peter.kluegl@averbis.com>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an
> > >>>>>>>>>>>>>>>>> example on
> > >>>
> > >>> how the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> cTAKES
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> community develops or how a project should look like?
> > >>>>>>>>>>>>>>>>> I learned that different people set up UIMA project
> > >>>>>>>>>>>>>>>>> in a
> > >>>
> > >>> quite
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some
> > >>>>>>>>>>>>>>>>> sort of
> > >>>
> > >>> out-dated"
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> approach in the cTAKES repo.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Are there restriction or preferences about the
> > preprocessing
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> components
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
> > project.
> > >>>>>>>>>>>>>>>>> Components: On which components may the componetns
> rely:
> > >>>
> > >>> tokenizer,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> ...
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> parser, ... dict lookup?
> > >>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
> > single
> > >>>
> > >>> AE?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> More comments below.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
> > >>>>>>>>>>>>>>>>>>> avoid
> > >>>
> > >>> duplicate
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to
> RUTA.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta
> > >>>>>>>>>>>>>>>>> Workbench if
> > >>>
> > >>> you want, or
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized
> > >>>>>>>>>>>>>>>>>>> for
> > the
> > >>>
> > >>> initial
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
> > >>>>>>>>>>>>>>>>>>> contribute it
> > >>>
> > >>> too?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly
> > >>>>>>>>>>>>>>>>>> available;
> > i2b2
> > >>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3
> > >>>>>>>>>>>>>>>>>> A_
> > >>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4g
> > >>>>>>>>>>>>>>>>>> oW
> > >>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNn
> > >>>>>>>>>>>>>>>>>> J9
> > >>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP
> > >>>>>>>>>>>>>>>>>> &m
> > >>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEe
> > >>>>>>>>>>>>>>>>>> OR yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= >
> > >>>>>>>>>>>>>>>>>> typically
> > >>>
> > >>> releases the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is
> > >>>>>>>>>>>>>>>>>> done on
> > an
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> individual basis
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate
> > >>>>>>>>>>>>>>>>>> the
> > >>>
> > >>> validation.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to
> > >>>>>>>>>>>>>>>>> the
> > >>>
> > >>> dataset here.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> My first step would be:
> > >>>>>>>>>>>>>>>>>>> - set up a maven project
> > >>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
> > >>>>>>>>>>>>>>>>>>> cTAKES
> > >>>
> > >>> components
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd
> > >>>>>>>>>>>>>>>>>>> party
> > libs
> > >>>
> > >>> jars that
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be
> > >>>>>>>>>>>>>>>>>>> sure to
> > >>>
> > >>> take a look
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> at
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> that over the next few weeks.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> —Pei
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
> > >>>>>>>>>>>>>>>>>> should
> > >>>
> > >>> not be a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> need to
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
> > independent
> > >>>
> > >>> component
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this
> > >>>>>>>>>>>>>>>>>> method
> > >>>
> > >>> have shown
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> useful
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely
> > >>>>>>>>>>>>>>>>>> useful
> > >>>
> > >>> independent
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> deid component.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>>>>>> Azad
> > >>>>>>>>>>>>>>>>>>
> > >>>
> > >
>

RE: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.

You can re-build the models that feed into MIST. I personally would not use the default model that MIST comes with as it is not trained on clinical data. In our previous work we found that hand-annotating about 200 docs for PHI (representative of the sample you are going to run the models on) results in building a pretty good model - in the 90's for p, r and f1. However, even with that high performance, the institution that owns the data might be still reluctant to share as it might pose a violation of HIPAA through some potential PHI leaks. In cTAKES our approach has been to de-couple the de-identifcation from the NLP/information extraction. If a user has the need for de-identified data, they could choose their method -- manual or otherwise -- and then process through cTAKES. Our focus is the NLP/IE space, while de-identification is a blend of that plus policy....

--Guergana

-----Original Message-----
From: Azad Dehghan [mailto:azad.dehghan@gmail.com] 
Sent: Thursday, March 10, 2016 4:19 PM
To: dev@ctakes.apache.org
Subject: RE: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Thanks Guergana.

> Yes, the current release of cTAKES has a module for the temporal
expressions which includes dates. The normalizer for the temporal expressions is Steven Bethard's timenorm code.
>

Great.

> However, if you do de-identification of dates/temporal expressions, 
> you
run the risk of creating incorrect timelines as many of the relative temporal expressions (e.g. spring of this year, x-mas time, etc.) are unlikely to be correctly shifted by any de-identification tool.
>
Indeed, a reason I have not included the dates component.

> One de-identification tool is MIST -- https://urldefense.proofpoint.com/v2/url?u=http-3A__mist-2Ddeid.sourceforge.net_&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=5awdXn2I-hRE0-161tqFDGgmYgQQviQg360uHI4fs2s&e=  .
>
I don't remember them doing well in the community held evaluation in 2014.
Hence, cDeid :)
>
> Guergana Savova, PhD, FACMI
> Associate Professor
> PI Natural Language Processing Lab
> Boston Children's Hospital and Harvard Medical School
> 300 Longwood Avenue
> Mailstop: BCH3092
> Enders 144.1
> Boston, MA 02115
> Tel: (617) 919-2972
> Fax: (617) 730-0817
> Harvard Scholar: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__scholar.harvard.ed
> u_guergana-5Fk-5Fsavova_biocv&d=BQIFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
> ZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGm
> RCJNAr-rCmP&m=FlURWGr18rKbgM76o8Hxoo1rbC2D2h-kk611lbKnPik&s=3taiTxFp55
> iQUnc6A6Yemg-XzFQrRjo5QZRQeKHQ29c&e=
>
> -----Original Message-----
> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> Sent: Thursday, March 10, 2016 3:42 PM
> To: dev@ctakes.apache.org
> Subject: Re: Combining Knowledge- and Data-driven Methods for
De-identification of Clinical Narratives
>
> > This means both training data folders? I have access to the data but 
> > not
> to the challenge description.
>
> Yes. Is there any specific information that you are missing?
> >
> >
> >> It would be good to incorporate/refactor (basically, GATE API needs 
> >> to be replaced with UIMA API to generate annotation) the two-pass 
> >> recognition method for cTAKES - which has a wider application on
longitudinal data.
> >> This method is used on-top of a number NERs.
> >
> >
> > I'll take a look.
> >
> > I do not know how much time I can invest this month. Let's see how 
> > many
> phases I can translate.
> >
> > I added the rules for age. Are there jape rules for creating date
> annotations?
> >
>
> No. I believe cTAKES has existing component(s) to capture dates?
>
> > After all rules are translated, they need some major refactoring. 
> > Jape
> and Ruta are quite different in some aspects.
> >
> Ok.
>
> >
> >
> >
> >
> >
> >> Please let me know where I can help. I will be available again in
April.
> >>
> >> Cheers,
> >> Azad
> >>
> >> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
wrote:
> >>
> >>> Hi,
> >>>
> >>> sorry, I was quite busy last month.
> >>>
> >>> I added a new patch, which needs to be applied.
> >>>
> >>> No new rules, but it's possible now to evaluate everything against 
> >>> the labelled data of the challenge.
> >>>
> >>> @Azad:
> >>> Which documents exactly did you use to develop the rules?
> >>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
> testing-PHI-Gold-fixed?
> >>>
> >>> Best,
> >>>
> >>> Peter
> >>>
> >>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> >>>>
> >>>> Hi,
> >>>>
> >>>> the last patch fixed almost all problems.
> >>>>
> >>>> I added another one that adds the csv file for the unit test and
> extends
> >>>> svn-ignore.
> >>>>
> >>>> Best,
> >>>>
> >>>> Peter
> >>>>
> >>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I added another patch. I missed to manually add one test file to
> version
> >>>>> control, and there are still duplicate lines.
> >>>>> I hope this patch fixes the remaining problems.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>>
> >>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> the problems were caused by the svn client in my Eclipse. Sorry 
> >>>>>> for
> the
> >>>>>> trouble, I should have looked more closely at the ciomplete patch.
> >>>>>>
> >>>>>> I attached a new patch created with commandline tools wich 
> >>>>>> looks
> >>>
> >>> correct
> >>>>>>
> >>>>>> now.
> >>>>>>
> >>>>>> Pei, can you apply the new patch?
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> >>>>>>>
> >>>>>>> Thanks Pei.
> >>>>>>>
> >>>>>>> I fear there was again a problem with the patch. All new files 
> >>>>>>> are missing (and also the svn-ignore settings).
> >>>>>>>
> >>>>>>> Can you take a look?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Peter
> >>>>>>>
> >>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> >>>>>>>>
> >>>>>>>> patch applied.
> >>>>>>>> Thanks,
> >>>>>>>> Pei
> >>>>>>>>
> >>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> >>>
> >>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Pei,
> >>>>>>>>>
> >>>>>>>>> can you commit the recent patch for us?
> >>>>>>>>>
> >>>>>>>>> CTAKES-384-20160120.patch
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> Peter
> >>>>>>>>>
> >>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>> Sorry I was swamped recently.
> >>>>>>>>>> But yeah, we can even create an extended type system to 
> >>>>>>>>>> store
> >>>
> >>> these items temporarily and add them into the main/core type 
> >>> system afterwards.
> >>>>>>>>>>
> >>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it 
> >>>>>>>>>> will
> >>>
> >>> require much more testing.  If it works, we can upgrade it in our
> sandbox
> >>> area or create a branch if necessary.
> >>>>>>>>>>
> >>>>>>>>>> —Pei
> >>>>>>>>>>
> >>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> >>>
> >>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> a new patch is attached.
> >>>>>>>>>>>
> >>>>>>>>>>> @Pei:
> >>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
> >>>
> >>> Some
> >>>>>>>>>>>
> >>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I 
> >>>>>>>>>>> map it
> to
> >>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
> >>>
> >>> features...
> >>>>>>>>>>>
> >>>>>>>>>>> @Azad:
> >>>>>>>>>>> I changed the rules a bit, especially the capitalization 
> >>>>>>>>>>> like I
> >>>
> >>> use it
> >>>>>>>>>>>
> >>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by 
> >>>>>>>>>>> the
> maven
> >>>>>>>>>>> plugin. I also added the two regexes for url and email. I
> >>>
> >>> extended the
> >>>>>>>>>>>
> >>>>>>>>>>> regex for the url. I also changed the evaluation order of 
> >>>>>>>>>>> some
> >>>
> >>> rules
> >>>>>>>>>>>
> >>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv 
> >>>>>>>>>>> for
> >>>
> >>> the unit
> >>>>>>>>>>>
> >>>>>>>>>>> tests.
> >>>>>>>>>>>
> >>>>>>>>>>> Let me know if you need more information about the changes.
> >>>>>>>>>>>
> >>>>>>>>>>> Do you wanna have help with the other rule sets? Or should 
> >>>>>>>>>>> we
> >>>
> >>> split them up?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Peter
> >>>>>>>>>>>
> >>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> great. I will integrate them in the project and in the 
> >>>>>>>>>>>> next
> >>>
> >>> patch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Peter
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Three NERs translated and uploaded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> PS. I will validate all NERs once we have them all
completed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> >>>
> >>> azad.dehghan@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are 
> >>>>>>>>>>>>>> any
> >>>
> >>> more volunteers
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
> >>>>>>>>>>>>>> <peter.kluegl@averbis.com
> >
> >>>
> >>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> >>>
> >>> Unfortunately,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> there is just no spare time right now. I hope I will 
> >>>>>>>>>>>>>>> be able
> >>>
> >>> to provide
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the patches in December.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good 
> >>>>>>>>>>>>>>>> starting
> >>>
> >>> point at least
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be 
> >>>>>>>>>>>>>>>> good
> if
> >>>
> >>> we use
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring 
> >>>>>>>>>>>>>>>> components
> >>>
> >>> together and
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> generate desc's as secondary...
> >>>>>>>>>>>>>>>> I think the actual components that would be required 
> >>>>>>>>>>>>>>>> is
> >>>
> >>> probably best
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> left up to what is actually required for best 
> >>>>>>>>>>>>>>>> performing
> >>>
> >>> c-deid.  The
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we 
> >>>>>>>>>>>>>>>> should
> treat
> >>>
> >>> this as
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> an independent preprocessing component or part of a
> pipeline
> >>>
> >>> (in which
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> case, we may need to propose a change to the type 
> >>>>>>>>>>>>>>>> system or
> >>>
> >>> perhaps an
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> >>>
> >>> discussion to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the dev group as you see fit.)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My 2 cents...
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> >>>
> >>> peter.kluegl@averbis.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an 
> >>>>>>>>>>>>>>>>> example on
> >>>
> >>> how the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> cTAKES
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> community develops or how a project should look like?
> >>>>>>>>>>>>>>>>> I learned that different people set up UIMA project 
> >>>>>>>>>>>>>>>>> in a
> >>>
> >>> quite
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some 
> >>>>>>>>>>>>>>>>> sort of
> >>>
> >>> out-dated"
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> approach in the cTAKES repo.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Are there restriction or preferences about the
> preprocessing
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
> project.
> >>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
> >>>
> >>> tokenizer,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> parser, ... dict lookup?
> >>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
> single
> >>>
> >>> AE?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> More comments below.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to 
> >>>>>>>>>>>>>>>>>>> avoid
> >>>
> >>> duplicate
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta 
> >>>>>>>>>>>>>>>>> Workbench if
> >>>
> >>> you want, or
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized 
> >>>>>>>>>>>>>>>>>>> for
> the
> >>>
> >>> initial
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to 
> >>>>>>>>>>>>>>>>>>> contribute it
> >>>
> >>> too?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly 
> >>>>>>>>>>>>>>>>>> available;
> i2b2
> >>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3
> >>>>>>>>>>>>>>>>>> A_ 
> >>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4g
> >>>>>>>>>>>>>>>>>> oW
> >>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNn
> >>>>>>>>>>>>>>>>>> J9 
> >>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP
> >>>>>>>>>>>>>>>>>> &m 
> >>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEe
> >>>>>>>>>>>>>>>>>> OR yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= > 
> >>>>>>>>>>>>>>>>>> typically
> >>>
> >>> releases the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is 
> >>>>>>>>>>>>>>>>>> done on
> an
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> individual basis
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate 
> >>>>>>>>>>>>>>>>>> the
> >>>
> >>> validation.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to 
> >>>>>>>>>>>>>>>>> the
> >>>
> >>> dataset here.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> My first step would be:
> >>>>>>>>>>>>>>>>>>> - set up a maven project
> >>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with 
> >>>>>>>>>>>>>>>>>>> cTAKES
> >>>
> >>> components
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd 
> >>>>>>>>>>>>>>>>>>> party
> libs
> >>>
> >>> jars that
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be 
> >>>>>>>>>>>>>>>>>>> sure to
> >>>
> >>> take a look
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> —Pei
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is 
> >>>>>>>>>>>>>>>>>> should
> >>>
> >>> not be a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
> independent
> >>>
> >>> component
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this 
> >>>>>>>>>>>>>>>>>> method
> >>>
> >>> have shown
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely 
> >>>>>>>>>>>>>>>>>> useful
> >>>
> >>> independent
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> deid component.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>>>
> >>>
> >

RE: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Thanks Guergana.

> Yes, the current release of cTAKES has a module for the temporal
expressions which includes dates. The normalizer for the temporal
expressions is Steven Bethard's timenorm code.
>

Great.

> However, if you do de-identification of dates/temporal expressions, you
run the risk of creating incorrect timelines as many of the relative
temporal expressions (e.g. spring of this year, x-mas time, etc.) are
unlikely to be correctly shifted by any de-identification tool.
>
Indeed, a reason I have not included the dates component.

> One de-identification tool is MIST -- http://mist-deid.sourceforge.net/ .
>
I don't remember them doing well in the community held evaluation in 2014.
Hence, cDeid :)
>
> Guergana Savova, PhD, FACMI
> Associate Professor
> PI Natural Language Processing Lab
> Boston Children's Hospital and Harvard Medical School
> 300 Longwood Avenue
> Mailstop: BCH3092
> Enders 144.1
> Boston, MA 02115
> Tel: (617) 919-2972
> Fax: (617) 730-0817
> Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv
>
> -----Original Message-----
> From: Azad Dehghan [mailto:azad.dehghan@gmail.com]
> Sent: Thursday, March 10, 2016 3:42 PM
> To: dev@ctakes.apache.org
> Subject: Re: Combining Knowledge- and Data-driven Methods for
De-identification of Clinical Narratives
>
> > This means both training data folders? I have access to the data but
> > not
> to the challenge description.
>
> Yes. Is there any specific information that you are missing?
> >
> >
> >> It would be good to incorporate/refactor (basically, GATE API needs
> >> to be replaced with UIMA API to generate annotation) the two-pass
> >> recognition method for cTAKES - which has a wider application on
longitudinal data.
> >> This method is used on-top of a number NERs.
> >
> >
> > I'll take a look.
> >
> > I do not know how much time I can invest this month. Let's see how
> > many
> phases I can translate.
> >
> > I added the rules for age. Are there jape rules for creating date
> annotations?
> >
>
> No. I believe cTAKES has existing component(s) to capture dates?
>
> > After all rules are translated, they need some major refactoring. Jape
> and Ruta are quite different in some aspects.
> >
> Ok.
>
> >
> >
> >
> >
> >
> >> Please let me know where I can help. I will be available again in
April.
> >>
> >> Cheers,
> >> Azad
> >>
> >> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com>
wrote:
> >>
> >>> Hi,
> >>>
> >>> sorry, I was quite busy last month.
> >>>
> >>> I added a new patch, which needs to be applied.
> >>>
> >>> No new rules, but it's possible now to evaluate everything against
> >>> the labelled data of the challenge.
> >>>
> >>> @Azad:
> >>> Which documents exactly did you use to develop the rules?
> >>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
> testing-PHI-Gold-fixed?
> >>>
> >>> Best,
> >>>
> >>> Peter
> >>>
> >>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> >>>>
> >>>> Hi,
> >>>>
> >>>> the last patch fixed almost all problems.
> >>>>
> >>>> I added another one that adds the csv file for the unit test and
> extends
> >>>> svn-ignore.
> >>>>
> >>>> Best,
> >>>>
> >>>> Peter
> >>>>
> >>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I added another patch. I missed to manually add one test file to
> version
> >>>>> control, and there are still duplicate lines.
> >>>>> I hope this patch fixes the remaining problems.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>>
> >>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> the problems were caused by the svn client in my Eclipse. Sorry
> >>>>>> for
> the
> >>>>>> trouble, I should have looked more closely at the ciomplete patch.
> >>>>>>
> >>>>>> I attached a new patch created with commandline tools wich looks
> >>>
> >>> correct
> >>>>>>
> >>>>>> now.
> >>>>>>
> >>>>>> Pei, can you apply the new patch?
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> >>>>>>>
> >>>>>>> Thanks Pei.
> >>>>>>>
> >>>>>>> I fear there was again a problem with the patch. All new files
> >>>>>>> are missing (and also the svn-ignore settings).
> >>>>>>>
> >>>>>>> Can you take a look?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> Peter
> >>>>>>>
> >>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> >>>>>>>>
> >>>>>>>> patch applied.
> >>>>>>>> Thanks,
> >>>>>>>> Pei
> >>>>>>>>
> >>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> >>>
> >>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Pei,
> >>>>>>>>>
> >>>>>>>>> can you commit the recent patch for us?
> >>>>>>>>>
> >>>>>>>>> CTAKES-384-20160120.patch
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> Peter
> >>>>>>>>>
> >>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>> Sorry I was swamped recently.
> >>>>>>>>>> But yeah, we can even create an extended type system to store
> >>>
> >>> these items temporarily and add them into the main/core type system
> >>> afterwards.
> >>>>>>>>>>
> >>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it
> >>>>>>>>>> will
> >>>
> >>> require much more testing.  If it works, we can upgrade it in our
> sandbox
> >>> area or create a branch if necessary.
> >>>>>>>>>>
> >>>>>>>>>> —Pei
> >>>>>>>>>>
> >>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> >>>
> >>> peter.kluegl@averbis.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> a new patch is attached.
> >>>>>>>>>>>
> >>>>>>>>>>> @Pei:
> >>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
> >>>
> >>> Some
> >>>>>>>>>>>
> >>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map
> >>>>>>>>>>> it
> to
> >>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
> >>>
> >>> features...
> >>>>>>>>>>>
> >>>>>>>>>>> @Azad:
> >>>>>>>>>>> I changed the rules a bit, especially the capitalization
> >>>>>>>>>>> like I
> >>>
> >>> use it
> >>>>>>>>>>>
> >>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the
> maven
> >>>>>>>>>>> plugin. I also added the two regexes for url and email. I
> >>>
> >>> extended the
> >>>>>>>>>>>
> >>>>>>>>>>> regex for the url. I also changed the evaluation order of
> >>>>>>>>>>> some
> >>>
> >>> rules
> >>>>>>>>>>>
> >>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv
> >>>>>>>>>>> for
> >>>
> >>> the unit
> >>>>>>>>>>>
> >>>>>>>>>>> tests.
> >>>>>>>>>>>
> >>>>>>>>>>> Let me know if you need more information about the changes.
> >>>>>>>>>>>
> >>>>>>>>>>> Do you wanna have help with the other rule sets? Or should
> >>>>>>>>>>> we
> >>>
> >>> split them up?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Peter
> >>>>>>>>>>>
> >>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> great. I will integrate them in the project and in the next
> >>>
> >>> patch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Peter
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Three NERs translated and uploaded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> PS. I will validate all NERs once we have them all
completed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> >>>
> >>> azad.dehghan@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are
> >>>>>>>>>>>>>> any
> >>>
> >>> more volunteers
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl"
> >>>>>>>>>>>>>> <peter.kluegl@averbis.com
> >
> >>>
> >>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> >>>
> >>> Unfortunately,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be
> >>>>>>>>>>>>>>> able
> >>>
> >>> to provide
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the patches in December.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Peter
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
> >>>
> >>> point at least
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be
> >>>>>>>>>>>>>>>> good
> if
> >>>
> >>> we use
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
> >>>
> >>> together and
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> generate desc's as secondary...
> >>>>>>>>>>>>>>>> I think the actual components that would be required is
> >>>
> >>> probably best
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> left up to what is actually required for best
> >>>>>>>>>>>>>>>> performing
> >>>
> >>> c-deid.  The
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should
> treat
> >>>
> >>> this as
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> an independent preprocessing component or part of a
> pipeline
> >>>
> >>> (in which
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> case, we may need to propose a change to the type
> >>>>>>>>>>>>>>>> system or
> >>>
> >>> perhaps an
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> >>>
> >>> discussion to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the dev group as you see fit.)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My 2 cents...
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> >>>
> >>> peter.kluegl@averbis.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example
> >>>>>>>>>>>>>>>>> on
> >>>
> >>> how the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> cTAKES
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> community develops or how a project should look like?
> >>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in
> >>>>>>>>>>>>>>>>> a
> >>>
> >>> quite
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort
> >>>>>>>>>>>>>>>>> of
> >>>
> >>> out-dated"
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> approach in the cTAKES repo.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Are there restriction or preferences about the
> preprocessing
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> components
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
> project.
> >>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
> >>>
> >>> tokenizer,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> parser, ... dict lookup?
> >>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
> single
> >>>
> >>> AE?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> More comments below.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to
> >>>>>>>>>>>>>>>>>>> avoid
> >>>
> >>> duplicate
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench
> >>>>>>>>>>>>>>>>> if
> >>>
> >>> you want, or
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized
> >>>>>>>>>>>>>>>>>>> for
> the
> >>>
> >>> initial
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to
> >>>>>>>>>>>>>>>>>>> contribute it
> >>>
> >>> too?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available;
> i2b2
> >>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A_
> >>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4goW
> >>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9
> >>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m
> >>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEeOR
> >>>>>>>>>>>>>>>>>> yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= > typically
> >>>
> >>> releases the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done
> >>>>>>>>>>>>>>>>>> on
> an
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> individual basis
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
> >>>
> >>> validation.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
> >>>
> >>> dataset here.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> My first step would be:
> >>>>>>>>>>>>>>>>>>> - set up a maven project
> >>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with
> >>>>>>>>>>>>>>>>>>> cTAKES
> >>>
> >>> components
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party
> libs
> >>>
> >>> jars that
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure
> >>>>>>>>>>>>>>>>>>> to
> >>>
> >>> take a look
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> —Pei
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is
> >>>>>>>>>>>>>>>>>> should
> >>>
> >>> not be a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
> independent
> >>>
> >>> component
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this
> >>>>>>>>>>>>>>>>>> method
> >>>
> >>> have shown
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> useful
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely
> >>>>>>>>>>>>>>>>>> useful
> >>>
> >>> independent
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> deid component.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>>>>
> >>>
> >

Re: developer installation instruction

Posted by Pei Chen <ch...@apache.org>.

Hi Dima,
Updated to http://subclipse.tigris.org/update_1.12.x
Thanks for pointed that out.
--Pei

On Thu, Mar 10, 2016 at 5:04 PM, Dligach, Dmitriy <dd...@luc.edu> wrote:
> It looks like a small update needs to be made to the developer installation.
>
> This page:
>
> https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+Developer+Install+Guide
>
> references an older Subclipse update site. The newest one is
>
> http://subclipse.tigris.org/update_1.12.x
>
> Dima

developer installation instruction

Posted by "Dligach, Dmitriy" <dd...@luc.edu>.

It looks like a small update needs to be made to the developer installation. 

This page:

https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+Developer+Install+Guide

references an older Subclipse update site. The newest one is

http://subclipse.tigris.org/update_1.12.x

Dima

RE: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.

Yes, the current release of cTAKES has a module for the temporal expressions which includes dates. The normalizer for the temporal expressions is Steven Bethard's timenorm code.

However, if you do de-identification of dates/temporal expressions, you run the risk of creating incorrect timelines as many of the relative temporal expressions (e.g. spring of this year, x-mas time, etc.) are unlikely to be correctly shifted by any de-identification tool.

One de-identification tool is MIST -- http://mist-deid.sourceforge.net/ . 

Hope this helps with the de-identification items....
--Guergana

Guergana Savova, PhD, FACMI
Associate Professor
PI Natural Language Processing Lab
Boston Children's Hospital and Harvard Medical School
300 Longwood Avenue
Mailstop: BCH3092
Enders 144.1
Boston, MA 02115
Tel: (617) 919-2972
Fax: (617) 730-0817
Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv

-----Original Message-----
From: Azad Dehghan [mailto:azad.dehghan@gmail.com] 
Sent: Thursday, March 10, 2016 3:42 PM
To: dev@ctakes.apache.org
Subject: Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

> This means both training data folders? I have access to the data but 
> not
to the challenge description.

Yes. Is there any specific information that you are missing?
>
>
>> It would be good to incorporate/refactor (basically, GATE API needs 
>> to be replaced with UIMA API to generate annotation) the two-pass 
>> recognition method for cTAKES - which has a wider application on longitudinal data.
>> This method is used on-top of a number NERs.
>
>
> I'll take a look.
>
> I do not know how much time I can invest this month. Let's see how 
> many
phases I can translate.
>
> I added the rules for age. Are there jape rules for creating date
annotations?
>

No. I believe cTAKES has existing component(s) to capture dates?

> After all rules are translated, they need some major refactoring. Jape
and Ruta are quite different in some aspects.
>
Ok.

>
>
>
>
>
>> Please let me know where I can help. I will be available again in April.
>>
>> Cheers,
>> Azad
>>
>> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com> wrote:
>>
>>> Hi,
>>>
>>> sorry, I was quite busy last month.
>>>
>>> I added a new patch, which needs to be applied.
>>>
>>> No new rules, but it's possible now to evaluate everything against 
>>> the labelled data of the challenge.
>>>
>>> @Azad:
>>> Which documents exactly did you use to develop the rules?
>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
testing-PHI-Gold-fixed?
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>>>
>>>> Hi,
>>>>
>>>> the last patch fixed almost all problems.
>>>>
>>>> I added another one that adds the csv file for the unit test and
extends
>>>> svn-ignore.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I added another patch. I missed to manually add one test file to
version
>>>>> control, and there are still duplicate lines.
>>>>> I hope this patch fixes the remaining problems.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> the problems were caused by the svn client in my Eclipse. Sorry 
>>>>>> for
the
>>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>>
>>>>>> I attached a new patch created with commandline tools wich looks
>>>
>>> correct
>>>>>>
>>>>>> now.
>>>>>>
>>>>>> Pei, can you apply the new patch?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>>>
>>>>>>> Thanks Pei.
>>>>>>>
>>>>>>> I fear there was again a problem with the patch. All new files 
>>>>>>> are missing (and also the svn-ignore settings).
>>>>>>>
>>>>>>> Can you take a look?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>>>
>>>>>>>> patch applied.
>>>>>>>> Thanks,
>>>>>>>> Pei
>>>>>>>>
>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>>>
>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Pei,
>>>>>>>>>
>>>>>>>>> can you commit the recent patch for us?
>>>>>>>>>
>>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>>> But yeah, we can even create an extended type system to store
>>>
>>> these items temporarily and add them into the main/core type system 
>>> afterwards.
>>>>>>>>>>
>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it 
>>>>>>>>>> will
>>>
>>> require much more testing.  If it works, we can upgrade it in our
sandbox
>>> area or create a branch if necessary.
>>>>>>>>>>
>>>>>>>>>> —Pei
>>>>>>>>>>
>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>>>
>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> a new patch is attached.
>>>>>>>>>>>
>>>>>>>>>>> @Pei:
>>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
>>>
>>> Some
>>>>>>>>>>>
>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map 
>>>>>>>>>>> it
to
>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>>>
>>> features...
>>>>>>>>>>>
>>>>>>>>>>> @Azad:
>>>>>>>>>>> I changed the rules a bit, especially the capitalization 
>>>>>>>>>>> like I
>>>
>>> use it
>>>>>>>>>>>
>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the
maven
>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>>>
>>> extended the
>>>>>>>>>>>
>>>>>>>>>>> regex for the url. I also changed the evaluation order of 
>>>>>>>>>>> some
>>>
>>> rules
>>>>>>>>>>>
>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv 
>>>>>>>>>>> for
>>>
>>> the unit
>>>>>>>>>>>
>>>>>>>>>>> tests.
>>>>>>>>>>>
>>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>>
>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should 
>>>>>>>>>>> we
>>>
>>> split them up?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> great. I will integrate them in the project and in the next
>>>
>>> patch.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>>>
>>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are 
>>>>>>>>>>>>>> any
>>>
>>> more volunteers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" 
>>>>>>>>>>>>>> <peter.kluegl@averbis.com
>
>>>
>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>
>>> Unfortunately,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be 
>>>>>>>>>>>>>>> able
>>>
>>> to provide
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
>>>
>>> point at least
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be 
>>>>>>>>>>>>>>>> good
if
>>>
>>> we use
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
>>>
>>> together and
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>>> I think the actual components that would be required is
>>>
>>> probably best
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> left up to what is actually required for best 
>>>>>>>>>>>>>>>> performing
>>>
>>> c-deid.  The
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should
treat
>>>
>>> this as
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
pipeline
>>>
>>> (in which
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> case, we may need to propose a change to the type 
>>>>>>>>>>>>>>>> system or
>>>
>>> perhaps an
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>>>
>>> discussion to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>>>
>>> peter.kluegl@averbis.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example 
>>>>>>>>>>>>>>>>> on
>>>
>>> how the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in 
>>>>>>>>>>>>>>>>> a
>>>
>>> quite
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort 
>>>>>>>>>>>>>>>>> of
>>>
>>> out-dated"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
preprocessing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
project.
>>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
>>>
>>> tokenizer,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
single
>>>
>>> AE?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to 
>>>>>>>>>>>>>>>>>>> avoid
>>>
>>> duplicate
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench 
>>>>>>>>>>>>>>>>> if
>>>
>>> you want, or
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized 
>>>>>>>>>>>>>>>>>>> for
the
>>>
>>> initial
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to 
>>>>>>>>>>>>>>>>>>> contribute it
>>>
>>> too?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available;
i2b2
>>>>>>>>>>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A_
>>>>>>>>>>>>>>>>>> _www.i2b2.org_NLP_DataSets_Main.php&d=BQIFaQ&c=qS4goW
>>>>>>>>>>>>>>>>>> BT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9
>>>>>>>>>>>>>>>>>> mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m
>>>>>>>>>>>>>>>>>> =1Qpd4A2PgVD13w31PkkvmJf6I0PTCatCzgBgsnetPOg&s=aAEeOR
>>>>>>>>>>>>>>>>>> yMtz7NCv-6EEgiABVY_Rf6zLnJghQh2DA_CKQ&e= > typically
>>>
>>> releases the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done 
>>>>>>>>>>>>>>>>>> on
an
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
>>>
>>> validation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
>>>
>>> dataset here.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with 
>>>>>>>>>>>>>>>>>>> cTAKES
>>>
>>> components
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party
libs
>>>
>>> jars that
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure 
>>>>>>>>>>>>>>>>>>> to
>>>
>>> take a look
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is 
>>>>>>>>>>>>>>>>>> should
>>>
>>> not be a
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
independent
>>>
>>> component
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this 
>>>>>>>>>>>>>>>>>> method
>>>
>>> have shown
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely 
>>>>>>>>>>>>>>>>>> useful
>>>
>>> independent
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>
>>>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

> This means both training data folders? I have access to the data but not
to the challenge description.

Yes. Is there any specific information that you are missing?
>
>
>> It would be good to incorporate/refactor (basically, GATE API needs to be
>> replaced with UIMA API to generate annotation) the two-pass recognition
>> method for cTAKES - which has a wider application on longitudinal data.
>> This method is used on-top of a number NERs.
>
>
> I'll take a look.
>
> I do not know how much time I can invest this month. Let's see how many
phases I can translate.
>
> I added the rules for age. Are there jape rules for creating date
annotations?
>

No. I believe cTAKES has existing component(s) to capture dates?

> After all rules are translated, they need some major refactoring. Jape
and Ruta are quite different in some aspects.
>
Ok.

>
>
>
>
>
>> Please let me know where I can help. I will be available again in April.
>>
>> Cheers,
>> Azad
>>
>> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com> wrote:
>>
>>> Hi,
>>>
>>> sorry, I was quite busy last month.
>>>
>>> I added a new patch, which needs to be applied.
>>>
>>> No new rules, but it's possible now to evaluate everything against the
>>> labelled data of the challenge.
>>>
>>> @Azad:
>>> Which documents exactly did you use to develop the rules?
>>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or
testing-PHI-Gold-fixed?
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>>>
>>>> Hi,
>>>>
>>>> the last patch fixed almost all problems.
>>>>
>>>> I added another one that adds the csv file for the unit test and
extends
>>>> svn-ignore.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I added another patch. I missed to manually add one test file to
version
>>>>> control, and there are still duplicate lines.
>>>>> I hope this patch fixes the remaining problems.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>>
>>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> the problems were caused by the svn client in my Eclipse. Sorry for
the
>>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>>
>>>>>> I attached a new patch created with commandline tools wich looks
>>>
>>> correct
>>>>>>
>>>>>> now.
>>>>>>
>>>>>> Pei, can you apply the new patch?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>>>
>>>>>>> Thanks Pei.
>>>>>>>
>>>>>>> I fear there was again a problem with the patch. All new files are
>>>>>>> missing (and also the svn-ignore settings).
>>>>>>>
>>>>>>> Can you take a look?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>>>
>>>>>>>> patch applied.
>>>>>>>> Thanks,
>>>>>>>> Pei
>>>>>>>>
>>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>>>
>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Pei,
>>>>>>>>>
>>>>>>>>> can you commit the recent patch for us?
>>>>>>>>>
>>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>>> But yeah, we can even create an extended type system to store
>>>
>>> these items temporarily and add them into the main/core type system
>>> afterwards.
>>>>>>>>>>
>>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it will
>>>
>>> require much more testing.  If it works, we can upgrade it in our
sandbox
>>> area or create a branch if necessary.
>>>>>>>>>>
>>>>>>>>>> —Pei
>>>>>>>>>>
>>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>>>
>>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> a new patch is attached.
>>>>>>>>>>>
>>>>>>>>>>> @Pei:
>>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
>>>
>>> Some
>>>>>>>>>>>
>>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map it
to
>>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>>>
>>> features...
>>>>>>>>>>>
>>>>>>>>>>> @Azad:
>>>>>>>>>>> I changed the rules a bit, especially the capitalization like I
>>>
>>> use it
>>>>>>>>>>>
>>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the
maven
>>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>>>
>>> extended the
>>>>>>>>>>>
>>>>>>>>>>> regex for the url. I also changed the evaluation order of some
>>>
>>> rules
>>>>>>>>>>>
>>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv for
>>>
>>> the unit
>>>>>>>>>>>
>>>>>>>>>>> tests.
>>>>>>>>>>>
>>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>>
>>>>>>>>>>> Do you wanna have help with the other rule sets? Or should we
>>>
>>> split them up?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> great. I will integrate them in the project and in the next
>>>
>>> patch.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>>>
>>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
>>>
>>> more volunteers
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <peter.kluegl@averbis.com
>
>>>
>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>>>
>>> Unfortunately,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be able
>>>
>>> to provide
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
>>>
>>> point at least
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good
if
>>>
>>> we use
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
>>>
>>> together and
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>>> I think the actual components that would be required is
>>>
>>> probably best
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> left up to what is actually required for best performing
>>>
>>> c-deid.  The
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should
treat
>>>
>>> this as
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> an independent preprocessing component or part of a
pipeline
>>>
>>> (in which
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> case, we may need to propose a change to the type system or
>>>
>>> perhaps an
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>>>
>>> discussion to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>>>
>>> peter.kluegl@averbis.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on
>>>
>>> how the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in a
>>>
>>> quite
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of
>>>
>>> out-dated"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are there restriction or preferences about the
preprocessing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the
project.
>>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
>>>
>>> tokenizer,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a
single
>>>
>>> AE?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
>>>
>>> duplicate
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if
>>>
>>> you want, or
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for
the
>>>
>>> initial
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it
>>>
>>> too?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available;
i2b2
>>>>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
>>>
>>> releases the
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on
an
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
>>>
>>> validation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
>>>
>>> dataset here.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
>>>
>>> components
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party
libs
>>>
>>> jars that
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to
>>>
>>> take a look
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should
>>>
>>> not be a
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an
independent
>>>
>>> component
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
>>>
>>> have shown
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
>>>
>>> independent
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>>
>>>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

Am 10.03.2016 um 20:29 schrieb Azad Dehghan:
> Thanks Peter,
>
> The rules were modeled using the training data.

This means both training data folders? I have access to the data but not 
to the challenge description.

> It would be good to incorporate/refactor (basically, GATE API needs to be
> replaced with UIMA API to generate annotation) the two-pass recognition
> method for cTAKES - which has a wider application on longitudinal data.
> This method is used on-top of a number NERs.

I'll take a look.

I do not know how much time I can invest this month. Let's see how many 
phases I can translate.

I added the rules for age. Are there jape rules for creating date 
annotations?

After all rules are translated, they need some major refactoring. Jape 
and Ruta are quite different in some aspects.

Best,

Peter




> Please let me know where I can help. I will be available again in April.
>
> Cheers,
> Azad
>
> On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com> wrote:
>
>> Hi,
>>
>> sorry, I was quite busy last month.
>>
>> I added a new patch, which needs to be applied.
>>
>> No new rules, but it's possible now to evaluate everything against the
>> labelled data of the challenge.
>>
>> @Azad:
>> Which documents exactly did you use to develop the rules?
>> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or testing-PHI-Gold-fixed?
>>
>> Best,
>>
>> Peter
>>
>> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
>>> Hi,
>>>
>>> the last patch fixed almost all problems.
>>>
>>> I added another one that adds the csv file for the unit test and extends
>>> svn-ignore.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>>>> Hi,
>>>>
>>>> I added another patch. I missed to manually add one test file to version
>>>> control, and there are still duplicate lines.
>>>> I hope this patch fixes the remaining problems.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>>>> Hi,
>>>>>
>>>>> the problems were caused by the svn client in my Eclipse. Sorry for the
>>>>> trouble, I should have looked more closely at the ciomplete patch.
>>>>>
>>>>> I attached a new patch created with commandline tools wich looks
>> correct
>>>>> now.
>>>>>
>>>>> Pei, can you apply the new patch?
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>>>> Thanks Pei.
>>>>>>
>>>>>> I fear there was again a problem with the patch. All new files are
>>>>>> missing (and also the svn-ignore settings).
>>>>>>
>>>>>> Can you take a look?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>>>> patch applied.
>>>>>>> Thanks,
>>>>>>> Pei
>>>>>>>
>>>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
>> peter.kluegl@averbis.com> wrote:
>>>>>>>> Hi Pei,
>>>>>>>>
>>>>>>>> can you commit the recent patch for us?
>>>>>>>>
>>>>>>>> CTAKES-384-20160120.patch
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>>>> Hi,
>>>>>>>>> Sorry I was swamped recently.
>>>>>>>>> But yeah, we can even create an extended type system to store
>> these items temporarily and add them into the main/core type system
>> afterwards.
>>>>>>>>> There was an existing item to upgrade UIMA, but agreed- it will
>> require much more testing.  If it works, we can upgrade it in our sandbox
>> area or create a branch if necessary.
>>>>>>>>> —Pei
>>>>>>>>>
>>>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
>> peter.kluegl@averbis.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> a new patch is attached.
>>>>>>>>>>
>>>>>>>>>> @Pei:
>>>>>>>>>> are there suitable annotation types in the cTAKES type system?
>> Some
>>>>>>>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>>>>>>>> IdentifiedAnnotation right now, but there are many empty
>> features...
>>>>>>>>>> @Azad:
>>>>>>>>>> I changed the rules a bit, especially the capitalization like I
>> use it
>>>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>>>>>>>> plugin. I also added the two regexes for url and email. I
>> extended the
>>>>>>>>>> regex for the url. I also changed the evaluation order of some
>> rules
>>>>>>>>>> (with @). Feel free to add simple examples to examples.csv for
>> the unit
>>>>>>>>>> tests.
>>>>>>>>>>
>>>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>>>
>>>>>>>>>> Do you wanna have help with the other rule sets? Or should we
>> split them up?
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> great. I will integrate them in the project and in the next
>> patch.
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>>>
>>>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Azad
>>>>>>>>>>>>
>>>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
>> azad.dehghan@gmail.com> wrote:
>>>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
>> more volunteers
>>>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com>
>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
>> Unfortunately,
>>>>>>>>>>>>>> there is just no spare time right now. I hope I will be able
>> to provide
>>>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
>> point at least
>>>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if
>> we use
>>>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
>> together and
>>>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>>>> I think the actual components that would be required is
>> probably best
>>>>>>>>>>>>>>> left up to what is actually required for best performing
>> c-deid.  The
>>>>>>>>>>>>>>> output would be interesting, I'm not sure if we should treat
>> this as
>>>>>>>>>>>>>>> an independent preprocessing component or part of a pipeline
>> (in which
>>>>>>>>>>>>>>> case, we may need to propose a change to the type system or
>> perhaps an
>>>>>>>>>>>>>>> alternative JCas view.  You can probably open up that
>> discussion to
>>>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
>> peter.kluegl@averbis.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on
>> how the
>>>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>>>> I learned that different people set up UIMA project in a
>> quite
>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of
>> out-dated"
>>>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>>>>>>>> Components: On which components may the componetns rely:
>> tokenizer,
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a single
>> AE?
>>>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
>> duplicate
>>>>>>>>>>>>> work
>>>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if
>> you want, or
>>>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for the
>> initial
>>>>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it
>> too?
>>>>>>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
>> releases the
>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
>> validation.
>>>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
>> dataset here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
>> components
>>>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs
>> jars that
>>>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to
>> take a look
>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should
>> not be a
>>>>>>>>>>>>> need to
>>>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, just a thought: we may want to create an independent
>> component
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
>> have shown
>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
>> independent
>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>>>
>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Thanks Peter,

The rules were modeled using the training data.

It would be good to incorporate/refactor (basically, GATE API needs to be
replaced with UIMA API to generate annotation) the two-pass recognition
method for cTAKES - which has a wider application on longitudinal data.
This method is used on-top of a number NERs.

Please let me know where I can help. I will be available again in April.

Cheers,
Azad

On 10 March 2016 at 13:13, Peter Klügl <pe...@averbis.com> wrote:

> Hi,
>
> sorry, I was quite busy last month.
>
> I added a new patch, which needs to be applied.
>
> No new rules, but it's possible now to evaluate everything against the
> labelled data of the challenge.
>
> @Azad:
> Which documents exactly did you use to develop the rules?
> training-PHI-Gold-Set1, training-PHI-Gold-Set2 or testing-PHI-Gold-fixed?
>
> Best,
>
> Peter
>
> Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> > Hi,
> >
> > the last patch fixed almost all problems.
> >
> > I added another one that adds the csv file for the unit test and extends
> > svn-ignore.
> >
> > Best,
> >
> > Peter
> >
> > Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> >> Hi,
> >>
> >> I added another patch. I missed to manually add one test file to version
> >> control, and there are still duplicate lines.
> >> I hope this patch fixes the remaining problems.
> >>
> >> Best,
> >>
> >> Peter
> >>
> >>
> >> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> >>> Hi,
> >>>
> >>> the problems were caused by the svn client in my Eclipse. Sorry for the
> >>> trouble, I should have looked more closely at the ciomplete patch.
> >>>
> >>> I attached a new patch created with commandline tools wich looks
> correct
> >>> now.
> >>>
> >>> Pei, can you apply the new patch?
> >>>
> >>> Best,
> >>>
> >>> Peter
> >>>
> >>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> >>>> Thanks Pei.
> >>>>
> >>>> I fear there was again a problem with the patch. All new files are
> >>>> missing (and also the svn-ignore settings).
> >>>>
> >>>> Can you take a look?
> >>>>
> >>>> Best,
> >>>>
> >>>> Peter
> >>>>
> >>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
> >>>>> patch applied.
> >>>>> Thanks,
> >>>>> Pei
> >>>>>
> >>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <
> peter.kluegl@averbis.com> wrote:
> >>>>>> Hi Pei,
> >>>>>>
> >>>>>> can you commit the recent patch for us?
> >>>>>>
> >>>>>> CTAKES-384-20160120.patch
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
> >>>>>>> Hi,
> >>>>>>> Sorry I was swamped recently.
> >>>>>>> But yeah, we can even create an extended type system to store
> these items temporarily and add them into the main/core type system
> afterwards.
> >>>>>>> There was an existing item to upgrade UIMA, but agreed- it will
> require much more testing.  If it works, we can upgrade it in our sandbox
> area or create a branch if necessary.
> >>>>>>>
> >>>>>>> —Pei
> >>>>>>>
> >>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <
> peter.kluegl@averbis.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> a new patch is attached.
> >>>>>>>>
> >>>>>>>> @Pei:
> >>>>>>>> are there suitable annotation types in the cTAKES type system?
> Some
> >>>>>>>> project in cTAKES uses something like OntologyMatch... I map it to
> >>>>>>>> IdentifiedAnnotation right now, but there are many empty
> features...
> >>>>>>>>
> >>>>>>>> @Azad:
> >>>>>>>> I changed the rules a bit, especially the capitalization like I
> use it
> >>>>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
> >>>>>>>> plugin. I also added the two regexes for url and email. I
> extended the
> >>>>>>>> regex for the url. I also changed the evaluation order of some
> rules
> >>>>>>>> (with @). Feel free to add simple examples to examples.csv for
> the unit
> >>>>>>>> tests.
> >>>>>>>>
> >>>>>>>> Let me know if you need more information about the changes.
> >>>>>>>>
> >>>>>>>> Do you wanna have help with the other rule sets? Or should we
> split them up?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Peter
> >>>>>>>>
> >>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> great. I will integrate them in the project and in the next
> patch.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> Peter
> >>>>>>>>>
> >>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> >>>>>>>>>> Three NERs translated and uploaded.
> >>>>>>>>>>
> >>>>>>>>>> PS. I will validate all NERs once we have them all completed.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Azad
> >>>>>>>>>>
> >>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <
> azad.dehghan@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> This is on my todo list for Dec. as well. If there are any
> more volunteers
> >>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Azad
> >>>>>>>>>>>
> >>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com>
> wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I just wanted to mention that I haven't forgot about it.
> Unfortunately,
> >>>>>>>>>>>> there is just no spare time right now. I hope I will be able
> to provide
> >>>>>>>>>>>> the patches in December.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Peter
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> >>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>> I think the ctakes-examples is probably a good starting
> point at least
> >>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if
> we use
> >>>>>>>>>>>>> uimaFIT style as primary approach to wiring components
> together and
> >>>>>>>>>>>>> generate desc's as secondary...
> >>>>>>>>>>>>> I think the actual components that would be required is
> probably best
> >>>>>>>>>>>>> left up to what is actually required for best performing
> c-deid.  The
> >>>>>>>>>>>>> output would be interesting, I'm not sure if we should treat
> this as
> >>>>>>>>>>>>> an independent preprocessing component or part of a pipeline
> (in which
> >>>>>>>>>>>>> case, we may need to propose a change to the type system or
> perhaps an
> >>>>>>>>>>>>> alternative JCas view.  You can probably open up that
> discussion to
> >>>>>>>>>>>>> the dev group as you see fit.)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> My 2 cents...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <
> peter.kluegl@averbis.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on
> how the
> >>>>>>>>>>> cTAKES
> >>>>>>>>>>>>>> community develops or how a project should look like?
> >>>>>>>>>>>>>> I learned that different people set up UIMA project in a
> quite
> >>>>>>>>>>> different
> >>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of
> out-dated"
> >>>>>>>>>>>>>> approach in the cTAKES repo.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Are there restriction or preferences about the preprocessing
> >>>>>>>>>>> components
> >>>>>>>>>>>>>> that should be used and the kind of "output" of the project.
> >>>>>>>>>>>>>> Components: On which components may the componetns rely:
> tokenizer,
> >>>>>>>>>>> ...
> >>>>>>>>>>>>>> parser, ... dict lookup?
> >>>>>>>>>>>>>> "output": Should the project provide a pipeline or a single
> AE?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> More comments below.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid
> duplicate
> >>>>>>>>>>> work
> >>>>>>>>>>>>>>>> and to coordnate the efforts ...
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
> >>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if
> you want, or
> >>>>>>>>>>>>>> wait until I set up the project with ruta integration.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If any questions arise, just ask :-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Is there a development dataset which was utilized for the
> initial
> >>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it
> too?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
> >>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically
> releases the
> >>>>>>>>>>> data
> >>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
> >>>>>>>>>>> individual basis
> >>>>>>>>>>>>>>> and involve a Data Use Agreement.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the
> validation.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the
> dataset here.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My first step would be:
> >>>>>>>>>>>>>>>> - set up a maven project
> >>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES
> components
> >>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs
> jars that
> >>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to
> take a look
> >>>>>>>>>>> at
> >>>>>>>>>>>>>>>> that over the next few weeks.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> —Pei
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should
> not be a
> >>>>>>>>>>> need to
> >>>>>>>>>>>>>>> worry about the 3rd party libs.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Also, just a thought: we may want to create an independent
> component
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method
> have shown
> >>>>>>>>>>> useful
> >>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful
> independent
> >>>>>>>>>>> of the
> >>>>>>>>>>>>>>> deid component.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>> Azad
> >>>>>>>>>>>>>>>
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

sorry, I was quite busy last month.

I added a new patch, which needs to be applied.

No new rules, but it's possible now to evaluate everything against the
labelled data of the challenge.

@Azad:
Which documents exactly did you use to develop the rules?
training-PHI-Gold-Set1, training-PHI-Gold-Set2 or testing-PHI-Gold-fixed?

Best,

Peter

Am 03.02.2016 um 09:05 schrieb Peter Klügl:
> Hi,
>
> the last patch fixed almost all problems.
>
> I added another one that adds the csv file for the unit test and extends
> svn-ignore.
>
> Best,
>
> Peter
>
> Am 02.02.2016 um 09:16 schrieb Peter Klügl:
>> Hi,
>>
>> I added another patch. I missed to manually add one test file to version
>> control, and there are still duplicate lines.
>> I hope this patch fixes the remaining problems.
>>
>> Best,
>>
>> Peter
>>
>>
>> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>>> Hi,
>>>
>>> the problems were caused by the svn client in my Eclipse. Sorry for the
>>> trouble, I should have looked more closely at the ciomplete patch.
>>>
>>> I attached a new patch created with commandline tools wich looks correct
>>> now.
>>>
>>> Pei, can you apply the new patch?
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>>> Thanks Pei.
>>>>
>>>> I fear there was again a problem with the patch. All new files are
>>>> missing (and also the svn-ignore settings).
>>>>
>>>> Can you take a look?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>>> patch applied.
>>>>> Thanks,
>>>>> Pei
>>>>>
>>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>>> Hi Pei,
>>>>>>
>>>>>> can you commit the recent patch for us?
>>>>>>
>>>>>> CTAKES-384-20160120.patch
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>>> Hi,
>>>>>>> Sorry I was swamped recently.
>>>>>>> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
>>>>>>> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>>>>>>>
>>>>>>> —Pei
>>>>>>>
>>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> a new patch is attached.
>>>>>>>>
>>>>>>>> @Pei:
>>>>>>>> are there suitable annotation types in the cTAKES type system? Some
>>>>>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>>>>>> IdentifiedAnnotation right now, but there are many empty features...
>>>>>>>>
>>>>>>>> @Azad:
>>>>>>>> I changed the rules a bit, especially the capitalization like I use it
>>>>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>>>>>> plugin. I also added the two regexes for url and email. I extended the
>>>>>>>> regex for the url. I also changed the evaluation order of some rules
>>>>>>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>>>>>>> tests.
>>>>>>>>
>>>>>>>> Let me know if you need more information about the changes.
>>>>>>>>
>>>>>>>> Do you wanna have help with the other rule sets? Or should we split them up?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> great. I will integrate them in the project and in the next patch.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>>
>>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Azad
>>>>>>>>>>
>>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Azad
>>>>>>>>>>>
>>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>>>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>>>>>>>> the patches in December.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>>> I think the actual components that would be required is probably best
>>>>>>>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>>>>>>> cTAKES
>>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>>>>>>> different
>>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>>>>>> components
>>>>>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>>>>>>>> ...
>>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>>>>>>>> work
>>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>>>>>>>> data
>>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>>>>> individual basis
>>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>>>>>>>> at
>>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>>>>>>>> need to
>>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>>>>>>>> for
>>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>>>>>>> useful
>>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>>>>>>>> of the
>>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

the last patch fixed almost all problems.

I added another one that adds the csv file for the unit test and extends
svn-ignore.

Best,

Peter

Am 02.02.2016 um 09:16 schrieb Peter Klügl:
> Hi,
>
> I added another patch. I missed to manually add one test file to version
> control, and there are still duplicate lines.
> I hope this patch fixes the remaining problems.
>
> Best,
>
> Peter
>
>
> Am 29.01.2016 um 10:34 schrieb Peter Klügl:
>> Hi,
>>
>> the problems were caused by the svn client in my Eclipse. Sorry for the
>> trouble, I should have looked more closely at the ciomplete patch.
>>
>> I attached a new patch created with commandline tools wich looks correct
>> now.
>>
>> Pei, can you apply the new patch?
>>
>> Best,
>>
>> Peter
>>
>> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>>> Thanks Pei.
>>>
>>> I fear there was again a problem with the patch. All new files are
>>> missing (and also the svn-ignore settings).
>>>
>>> Can you take a look?
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>>> patch applied.
>>>> Thanks,
>>>> Pei
>>>>
>>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>> Hi Pei,
>>>>>
>>>>> can you commit the recent patch for us?
>>>>>
>>>>> CTAKES-384-20160120.patch
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>>> Hi,
>>>>>> Sorry I was swamped recently.
>>>>>> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
>>>>>> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>>>>>>
>>>>>> —Pei
>>>>>>
>>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> a new patch is attached.
>>>>>>>
>>>>>>> @Pei:
>>>>>>> are there suitable annotation types in the cTAKES type system? Some
>>>>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>>>>> IdentifiedAnnotation right now, but there are many empty features...
>>>>>>>
>>>>>>> @Azad:
>>>>>>> I changed the rules a bit, especially the capitalization like I use it
>>>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>>>>> plugin. I also added the two regexes for url and email. I extended the
>>>>>>> regex for the url. I also changed the evaluation order of some rules
>>>>>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>>>>>> tests.
>>>>>>>
>>>>>>> Let me know if you need more information about the changes.
>>>>>>>
>>>>>>> Do you wanna have help with the other rule sets? Or should we split them up?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> great. I will integrate them in the project and in the next patch.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>>
>>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Azad
>>>>>>>>>
>>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Azad
>>>>>>>>>>
>>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>>>>>>> the patches in December.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>>> I think the actual components that would be required is probably best
>>>>>>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>>
>>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>>>>>> cTAKES
>>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>>>>>> different
>>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>>>>> components
>>>>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>>>>>>> ...
>>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>>>>>
>>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>>>>>>> work
>>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>>>>>>> data
>>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>>>> individual basis
>>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>>>>>>> at
>>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>>>>>>> need to
>>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>>>>>>> for
>>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>>>>>> useful
>>>>>>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>>>>>>> of the
>>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I added another patch. I missed to manually add one test file to version
control, and there are still duplicate lines.
I hope this patch fixes the remaining problems.

Best,

Peter


Am 29.01.2016 um 10:34 schrieb Peter Klügl:
> Hi,
>
> the problems were caused by the svn client in my Eclipse. Sorry for the
> trouble, I should have looked more closely at the ciomplete patch.
>
> I attached a new patch created with commandline tools wich looks correct
> now.
>
> Pei, can you apply the new patch?
>
> Best,
>
> Peter
>
> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>> Thanks Pei.
>>
>> I fear there was again a problem with the patch. All new files are
>> missing (and also the svn-ignore settings).
>>
>> Can you take a look?
>>
>> Best,
>>
>> Peter
>>
>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>> patch applied.
>>> Thanks,
>>> Pei
>>>
>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>> Hi Pei,
>>>>
>>>> can you commit the recent patch for us?
>>>>
>>>> CTAKES-384-20160120.patch
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>> Hi,
>>>>> Sorry I was swamped recently.
>>>>> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
>>>>> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>>>>>
>>>>> —Pei
>>>>>
>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> a new patch is attached.
>>>>>>
>>>>>> @Pei:
>>>>>> are there suitable annotation types in the cTAKES type system? Some
>>>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>>>> IdentifiedAnnotation right now, but there are many empty features...
>>>>>>
>>>>>> @Azad:
>>>>>> I changed the rules a bit, especially the capitalization like I use it
>>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>>>> plugin. I also added the two regexes for url and email. I extended the
>>>>>> regex for the url. I also changed the evaluation order of some rules
>>>>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>>>>> tests.
>>>>>>
>>>>>> Let me know if you need more information about the changes.
>>>>>>
>>>>>> Do you wanna have help with the other rule sets? Or should we split them up?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>> Hi,
>>>>>>>
>>>>>>> great. I will integrate them in the project and in the next patch.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>> Three NERs translated and uploaded.
>>>>>>>>
>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Azad
>>>>>>>>
>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Azad
>>>>>>>>>
>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>>>>>> the patches in December.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>> I think the actual components that would be required is probably best
>>>>>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>>
>>>>>>>>>>> My 2 cents...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>>>>>> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>>>>> cTAKES
>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>>>>> different
>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>>
>>>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>>>> components
>>>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>>>>>> ...
>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>>>>
>>>>>>>>>>>> More comments below.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>>>>>> work
>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>>
>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>>>>
>>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>>>>>> data
>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>>> individual basis
>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>>>>
>>>>>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>>>>>> at
>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>>>>>> need to
>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>>>>>> for
>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>>>>> useful
>>>>>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>>>>>> of the
>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Azad
>>>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <pe...@wiredinformatics.com>.

CTAKES-384-20160129.patch applied.

> On Jan 29, 2016, at 4:34 AM, Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> the problems were caused by the svn client in my Eclipse. Sorry for the
> trouble, I should have looked more closely at the ciomplete patch.
> 
> I attached a new patch created with commandline tools wich looks correct
> now.
> 
> Pei, can you apply the new patch?
> 
> Best,
> 
> Peter
> 
> Am 28.01.2016 um 15:57 schrieb Peter Klügl:
>> Thanks Pei.
>> 
>> I fear there was again a problem with the patch. All new files are
>> missing (and also the svn-ignore settings).
>> 
>> Can you take a look?
>> 
>> Best,
>> 
>> Peter
>> 
>> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>>> patch applied.
>>> Thanks,
>>> Pei
>>> 
>>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>> Hi Pei,
>>>> 
>>>> can you commit the recent patch for us?
>>>> 
>>>> CTAKES-384-20160120.patch
>>>> 
>>>> Best,
>>>> 
>>>> Peter
>>>> 
>>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>>> Hi,
>>>>> Sorry I was swamped recently.
>>>>> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
>>>>> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>>>>> 
>>>>> —Pei
>>>>> 
>>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> a new patch is attached.
>>>>>> 
>>>>>> @Pei:
>>>>>> are there suitable annotation types in the cTAKES type system? Some
>>>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>>>> IdentifiedAnnotation right now, but there are many empty features...
>>>>>> 
>>>>>> @Azad:
>>>>>> I changed the rules a bit, especially the capitalization like I use it
>>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>>>> plugin. I also added the two regexes for url and email. I extended the
>>>>>> regex for the url. I also changed the evaluation order of some rules
>>>>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>>>>> tests.
>>>>>> 
>>>>>> Let me know if you need more information about the changes.
>>>>>> 
>>>>>> Do you wanna have help with the other rule sets? Or should we split them up?
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Peter
>>>>>> 
>>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> great. I will integrate them in the project and in the next patch.
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Peter
>>>>>>> 
>>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>>> Three NERs translated and uploaded.
>>>>>>>> 
>>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Azad
>>>>>>>> 
>>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Azad
>>>>>>>>> 
>>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>>>>>> the patches in December.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> 
>>>>>>>>>> Peter
>>>>>>>>>> 
>>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>>> I think the actual components that would be required is probably best
>>>>>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>> 
>>>>>>>>>>> My 2 cents...
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>>>>>> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>>>>> cTAKES
>>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>>>>> different
>>>>>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>> 
>>>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>>>> components
>>>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>>>>>> ...
>>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>>>> 
>>>>>>>>>>>> More comments below.
>>>>>>>>>>>> 
>>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>>>>>> work
>>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>> 
>>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>>>>>> data
>>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>>> individual basis
>>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>>>>>> at
>>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>>>>>> need to
>>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>>>>>> for
>>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>>>>> useful
>>>>>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>>>>>> of the
>>>>>>>>>>>>> deid component.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Azad
>>>>>>>>>>>>> 
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

the problems were caused by the svn client in my Eclipse. Sorry for the
trouble, I should have looked more closely at the ciomplete patch.

I attached a new patch created with commandline tools wich looks correct
now.

Pei, can you apply the new patch?

Best,

Peter

Am 28.01.2016 um 15:57 schrieb Peter Klügl:
> Thanks Pei.
>
> I fear there was again a problem with the patch. All new files are
> missing (and also the svn-ignore settings).
>
> Can you take a look?
>
> Best,
>
> Peter
>
> Am 28.01.2016 um 14:43 schrieb Pei Chen:
>> patch applied.
>> Thanks,
>> Pei
>>
>> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <pe...@averbis.com> wrote:
>>> Hi Pei,
>>>
>>> can you commit the recent patch for us?
>>>
>>> CTAKES-384-20160120.patch
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>>> Hi,
>>>> Sorry I was swamped recently.
>>>> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
>>>> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>>>>
>>>> —Pei
>>>>
>>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> a new patch is attached.
>>>>>
>>>>> @Pei:
>>>>> are there suitable annotation types in the cTAKES type system? Some
>>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>>> IdentifiedAnnotation right now, but there are many empty features...
>>>>>
>>>>> @Azad:
>>>>> I changed the rules a bit, especially the capitalization like I use it
>>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>>> plugin. I also added the two regexes for url and email. I extended the
>>>>> regex for the url. I also changed the evaluation order of some rules
>>>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>>>> tests.
>>>>>
>>>>> Let me know if you need more information about the changes.
>>>>>
>>>>> Do you wanna have help with the other rule sets? Or should we split them up?
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>>> Hi,
>>>>>>
>>>>>> great. I will integrate them in the project and in the next patch.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>>> Three NERs translated and uploaded.
>>>>>>>
>>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Azad
>>>>>>>
>>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Azad
>>>>>>>>
>>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>>>>> the patches in December.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>>> Hi Peter,
>>>>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>>>> generate desc's as secondary...
>>>>>>>>>> I think the actual components that would be required is probably best
>>>>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>>
>>>>>>>>>> My 2 cents...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>>>>> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>>>> cTAKES
>>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>>>> different
>>>>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>>
>>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>>> components
>>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>>>>> ...
>>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>>>
>>>>>>>>>>> More comments below.
>>>>>>>>>>>
>>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>>>>> work
>>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>>
>>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>>
>>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>>
>>>>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>>>
>>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>>>>> data
>>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>>> individual basis
>>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>>
>>>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>>>
>>>>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>>>>> at
>>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> —Pei
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>>>>> need to
>>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>>>>> for
>>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>>>> useful
>>>>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>>>>> of the
>>>>>>>>>>>> deid component.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Azad
>>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Thanks Pei.

I fear there was again a problem with the patch. All new files are
missing (and also the svn-ignore settings).

Can you take a look?

Best,

Peter

Am 28.01.2016 um 14:43 schrieb Pei Chen:
> patch applied.
> Thanks,
> Pei
>
> On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <pe...@averbis.com> wrote:
>> Hi Pei,
>>
>> can you commit the recent patch for us?
>>
>> CTAKES-384-20160120.patch
>>
>> Best,
>>
>> Peter
>>
>> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>>> Hi,
>>> Sorry I was swamped recently.
>>> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
>>> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>>>
>>> —Pei
>>>
>>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> a new patch is attached.
>>>>
>>>> @Pei:
>>>> are there suitable annotation types in the cTAKES type system? Some
>>>> project in cTAKES uses something like OntologyMatch... I map it to
>>>> IdentifiedAnnotation right now, but there are many empty features...
>>>>
>>>> @Azad:
>>>> I changed the rules a bit, especially the capitalization like I use it
>>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>>> plugin. I also added the two regexes for url and email. I extended the
>>>> regex for the url. I also changed the evaluation order of some rules
>>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>>> tests.
>>>>
>>>> Let me know if you need more information about the changes.
>>>>
>>>> Do you wanna have help with the other rule sets? Or should we split them up?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>>> Hi,
>>>>>
>>>>> great. I will integrate them in the project and in the next patch.
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>>> Three NERs translated and uploaded.
>>>>>>
>>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>>
>>>>>> Cheers,
>>>>>> Azad
>>>>>>
>>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>>>
>>>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Azad
>>>>>>>
>>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>>>> the patches in December.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>>> Hi Peter,
>>>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>>> generate desc's as secondary...
>>>>>>>>> I think the actual components that would be required is probably best
>>>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>>> the dev group as you see fit.)
>>>>>>>>>
>>>>>>>>> My 2 cents...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>>>> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>>> cTAKES
>>>>>>>>>> community develops or how a project should look like?
>>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>>> different
>>>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>>
>>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>>> components
>>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>>>> ...
>>>>>>>>>> parser, ... dict lookup?
>>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>>
>>>>>>>>>> More comments below.
>>>>>>>>>>
>>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>>>> work
>>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>>
>>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>>
>>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>>
>>>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>>
>>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>>>> data
>>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>>> individual basis
>>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>>
>>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>>
>>>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> My first step would be:
>>>>>>>>>>>> - set up a maven project
>>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>>>> at
>>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>>
>>>>>>>>>>>> —Pei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>>>> need to
>>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>>
>>>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>>>> for
>>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>>> useful
>>>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>>>> of the
>>>>>>>>>>> deid component.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Azad
>>>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <ch...@apache.org>.

patch applied.
Thanks,
Pei

On Thu, Jan 28, 2016 at 4:14 AM, Peter Klügl <pe...@averbis.com> wrote:
> Hi Pei,
>
> can you commit the recent patch for us?
>
> CTAKES-384-20160120.patch
>
> Best,
>
> Peter
>
> Am 20.01.2016 um 19:35 schrieb Pei Chen:
>> Hi,
>> Sorry I was swamped recently.
>> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
>> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>>
>> —Pei
>>
>>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>>
>>> Hi,
>>>
>>> a new patch is attached.
>>>
>>> @Pei:
>>> are there suitable annotation types in the cTAKES type system? Some
>>> project in cTAKES uses something like OntologyMatch... I map it to
>>> IdentifiedAnnotation right now, but there are many empty features...
>>>
>>> @Azad:
>>> I changed the rules a bit, especially the capitalization like I use it
>>> in ruta normally. The wordlist are compiled to a trie by the maven
>>> plugin. I also added the two regexes for url and email. I extended the
>>> regex for the url. I also changed the evaluation order of some rules
>>> (with @). Feel free to add simple examples to examples.csv for the unit
>>> tests.
>>>
>>> Let me know if you need more information about the changes.
>>>
>>> Do you wanna have help with the other rule sets? Or should we split them up?
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>>> Hi,
>>>>
>>>> great. I will integrate them in the project and in the next patch.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>>> Three NERs translated and uploaded.
>>>>>
>>>>> PS. I will validate all NERs once we have them all completed.
>>>>>
>>>>> Cheers,
>>>>> Azad
>>>>>
>>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>>
>>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>>
>>>>>> Cheers,
>>>>>> Azad
>>>>>>
>>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>>> the patches in December.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>>> Hi Peter,
>>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>>> generate desc's as secondary...
>>>>>>>> I think the actual components that would be required is probably best
>>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>>> the dev group as you see fit.)
>>>>>>>>
>>>>>>>> My 2 cents...
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>>> cTAKES
>>>>>>>>> community develops or how a project should look like?
>>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>>> different
>>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>>> approach in the cTAKES repo.
>>>>>>>>>
>>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>>> components
>>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>>> ...
>>>>>>>>> parser, ... dict lookup?
>>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>>
>>>>>>>>> More comments below.
>>>>>>>>>
>>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>>> work
>>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>>
>>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>>
>>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>>
>>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>>
>>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>>> data
>>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>>> individual basis
>>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>>
>>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>>
>>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> My first step would be:
>>>>>>>>>>> - set up a maven project
>>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>>> at
>>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>>
>>>>>>>>>>> —Pei
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>>> need to
>>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>>
>>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>>> for
>>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>>> useful
>>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>>> of the
>>>>>>>>>> deid component.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Azad
>>>>>>>>>>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi Pei,

can you commit the recent patch for us?

CTAKES-384-20160120.patch

Best,

Peter

Am 20.01.2016 um 19:35 schrieb Pei Chen:
> Hi,
> Sorry I was swamped recently.
> But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
> There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.
>
> —Pei
>
>> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
>>
>> Hi,
>>
>> a new patch is attached.
>>
>> @Pei:
>> are there suitable annotation types in the cTAKES type system? Some
>> project in cTAKES uses something like OntologyMatch... I map it to
>> IdentifiedAnnotation right now, but there are many empty features...
>>
>> @Azad:
>> I changed the rules a bit, especially the capitalization like I use it
>> in ruta normally. The wordlist are compiled to a trie by the maven
>> plugin. I also added the two regexes for url and email. I extended the
>> regex for the url. I also changed the evaluation order of some rules
>> (with @). Feel free to add simple examples to examples.csv for the unit
>> tests.
>>
>> Let me know if you need more information about the changes.
>>
>> Do you wanna have help with the other rule sets? Or should we split them up?
>>
>> Best,
>>
>> Peter
>>
>> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>>> Hi,
>>>
>>> great. I will integrate them in the project and in the next patch.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>>> Three NERs translated and uploaded.
>>>>
>>>> PS. I will validate all NERs once we have them all completed.
>>>>
>>>> Cheers,
>>>> Azad
>>>>
>>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>>>
>>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>>> for translating JAPE to RUTA, please get in touch.
>>>>>
>>>>> Cheers,
>>>>> Azad
>>>>>
>>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>>> the patches in December.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>>> Hi Peter,
>>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>>> generate desc's as secondary...
>>>>>>> I think the actual components that would be required is probably best
>>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>>> the dev group as you see fit.)
>>>>>>>
>>>>>>> My 2 cents...
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>>> cTAKES
>>>>>>>> community develops or how a project should look like?
>>>>>>>> I learned that different people set up UIMA project in a quite
>>>>> different
>>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>>> approach in the cTAKES repo.
>>>>>>>>
>>>>>>>> Are there restriction or preferences about the preprocessing
>>>>> components
>>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>>> ...
>>>>>>>> parser, ... dict lookup?
>>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>>>
>>>>>>>> More comments below.
>>>>>>>>
>>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>>> work
>>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>>>
>>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>>> wait until I set up the project with ruta integration.
>>>>>>>>
>>>>>>>> If any questions arise, just ask :-)
>>>>>>>>
>>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>>>
>>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>>> data
>>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>>> individual basis
>>>>>>>>> and involve a Data Use Agreement.
>>>>>>>>>
>>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>>>
>>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> My first step would be:
>>>>>>>>>> - set up a maven project
>>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>>> at
>>>>>>>>>> that over the next few weeks.
>>>>>>>>>>
>>>>>>>>>> —Pei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>>> need to
>>>>>>>>> worry about the 3rd party libs.
>>>>>>>>>
>>>>>>>>> Also, just a thought: we may want to create an independent component
>>>>> for
>>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>>> useful
>>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>>> of the
>>>>>>>>> deid component.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Azad
>>>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <pe...@wiredinformatics.com>.

Hi,
Sorry I was swamped recently.
But yeah, we can even create an extended type system to store these items temporarily and add them into the main/core type system afterwards.
There was an existing item to upgrade UIMA, but agreed- it will require much more testing.  If it works, we can upgrade it in our sandbox area or create a branch if necessary.

—Pei

> On Jan 18, 2016, at 9:06 AM, Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> a new patch is attached.
> 
> @Pei:
> are there suitable annotation types in the cTAKES type system? Some
> project in cTAKES uses something like OntologyMatch... I map it to
> IdentifiedAnnotation right now, but there are many empty features...
> 
> @Azad:
> I changed the rules a bit, especially the capitalization like I use it
> in ruta normally. The wordlist are compiled to a trie by the maven
> plugin. I also added the two regexes for url and email. I extended the
> regex for the url. I also changed the evaluation order of some rules
> (with @). Feel free to add simple examples to examples.csv for the unit
> tests.
> 
> Let me know if you need more information about the changes.
> 
> Do you wanna have help with the other rule sets? Or should we split them up?
> 
> Best,
> 
> Peter
> 
> Am 18.01.2016 um 11:04 schrieb Peter Klügl:
>> Hi,
>> 
>> great. I will integrate them in the project and in the next patch.
>> 
>> Best,
>> 
>> Peter
>> 
>> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>>> Three NERs translated and uploaded.
>>> 
>>> PS. I will validate all NERs once we have them all completed.
>>> 
>>> Cheers,
>>> Azad
>>> 
>>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>> 
>>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>>> for translating JAPE to RUTA, please get in touch.
>>>> 
>>>> Cheers,
>>>> Azad
>>>> 
>>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>>> there is just no spare time right now. I hope I will be able to provide
>>>>> the patches in December.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>>> Hi Peter,
>>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>>> uimaFIT style as primary approach to wiring components together and
>>>>>> generate desc's as secondary...
>>>>>> I think the actual components that would be required is probably best
>>>>>> left up to what is actually required for best performing c-deid.  The
>>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>>> the dev group as you see fit.)
>>>>>> 
>>>>>> My 2 cents...
>>>>>> 
>>>>>> 
>>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>>> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Is there a cTAKES project that may serve as an example on how the
>>>> cTAKES
>>>>>>> community develops or how a project should look like?
>>>>>>> I learned that different people set up UIMA project in a quite
>>>> different
>>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>>> approach in the cTAKES repo.
>>>>>>> 
>>>>>>> Are there restriction or preferences about the preprocessing
>>>> components
>>>>>>> that should be used and the kind of "output" of the project.
>>>>>>> Components: On which components may the componetns rely: tokenizer,
>>>> ...
>>>>>>> parser, ... dict lookup?
>>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>> 
>>>>>>> More comments below.
>>>>>>> 
>>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>>> work
>>>>>>>>> and to coordnate the efforts ...
>>>>>>>>> 
>>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>>> wait until I set up the project with ruta integration.
>>>>>>> 
>>>>>>> If any questions arise, just ask :-)
>>>>>>> 
>>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>> 
>>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>>> data
>>>>>>>> sets 12 months after a given challenge; this is done on an
>>>> individual basis
>>>>>>>> and involve a Data Use Agreement.
>>>>>>>> 
>>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>> 
>>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>> 
>>>>>>> 
>>>>>>>>> My first step would be:
>>>>>>>>> - set up a maven project
>>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>>> at
>>>>>>>>> that over the next few weeks.
>>>>>>>>> 
>>>>>>>>> —Pei
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>>> need to
>>>>>>>> worry about the 3rd party libs.
>>>>>>>> 
>>>>>>>> Also, just a thought: we may want to create an independent component
>>>> for
>>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>>> useful
>>>>>>>> for general NER on longitudinal data and surely useful independent
>>>> of the
>>>>>>>> deid component.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Azad
>>>>>>>> 
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

a new patch is attached.

@Pei:
are there suitable annotation types in the cTAKES type system? Some
project in cTAKES uses something like OntologyMatch... I map it to
IdentifiedAnnotation right now, but there are many empty features...

@Azad:
I changed the rules a bit, especially the capitalization like I use it
in ruta normally. The wordlist are compiled to a trie by the maven
plugin. I also added the two regexes for url and email. I extended the
regex for the url. I also changed the evaluation order of some rules
(with @). Feel free to add simple examples to examples.csv for the unit
tests.

Let me know if you need more information about the changes.

Do you wanna have help with the other rule sets? Or should we split them up?

Best,

Peter

Am 18.01.2016 um 11:04 schrieb Peter Klügl:
> Hi,
>
> great. I will integrate them in the project and in the next patch.
>
> Best,
>
> Peter
>
> Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
>> Three NERs translated and uploaded.
>>
>> PS. I will validate all NERs once we have them all completed.
>>
>> Cheers,
>> Azad
>>
>> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>>
>>> This is on my todo list for Dec. as well. If there are any more volunteers
>>> for translating JAPE to RUTA, please get in touch.
>>>
>>> Cheers,
>>> Azad
>>>
>>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>>> Hi,
>>>>
>>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>>> there is just no spare time right now. I hope I will be able to provide
>>>> the patches in December.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>>> Hi Peter,
>>>>> I think the ctakes-examples is probably a good starting point at least
>>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>>> uimaFIT style as primary approach to wiring components together and
>>>>> generate desc's as secondary...
>>>>> I think the actual components that would be required is probably best
>>>>> left up to what is actually required for best performing c-deid.  The
>>>>> output would be interesting, I'm not sure if we should treat this as
>>>>> an independent preprocessing component or part of a pipeline (in which
>>>>> case, we may need to propose a change to the type system or perhaps an
>>>>> alternative JCas view.  You can probably open up that discussion to
>>>>> the dev group as you see fit.)
>>>>>
>>>>> My 2 cents...
>>>>>
>>>>>
>>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Is there a cTAKES project that may serve as an example on how the
>>> cTAKES
>>>>>> community develops or how a project should look like?
>>>>>> I learned that different people set up UIMA project in a quite
>>> different
>>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>>> approach in the cTAKES repo.
>>>>>>
>>>>>> Are there restriction or preferences about the preprocessing
>>> components
>>>>>> that should be used and the kind of "output" of the project.
>>>>>> Components: On which components may the componetns rely: tokenizer,
>>> ...
>>>>>> parser, ... dict lookup?
>>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>>
>>>>>> More comments below.
>>>>>>
>>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>>> work
>>>>>>>> and to coordnate the efforts ...
>>>>>>>>
>>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>>> wait until I set up the project with ruta integration.
>>>>>>
>>>>>> If any questions arise, just ask :-)
>>>>>>
>>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>>
>>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>>> data
>>>>>>> sets 12 months after a given challenge; this is done on an
>>> individual basis
>>>>>>> and involve a Data Use Agreement.
>>>>>>>
>>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>>
>>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>>
>>>>>>
>>>>>>>> My first step would be:
>>>>>>>> - set up a maven project
>>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>>
>>>>>>>>
>>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>>> at
>>>>>>>> that over the next few weeks.
>>>>>>>>
>>>>>>>> —Pei
>>>>>>>>
>>>>>>>>
>>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>>> need to
>>>>>>> worry about the 3rd party libs.
>>>>>>>
>>>>>>> Also, just a thought: we may want to create an independent component
>>> for
>>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>>> useful
>>>>>>> for general NER on longitudinal data and surely useful independent
>>> of the
>>>>>>> deid component.
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Azad
>>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

great. I will integrate them in the project and in the next patch.

Best,

Peter

Am 18.01.2016 um 00:58 schrieb Azad Dehghan:
> Three NERs translated and uploaded.
>
> PS. I will validate all NERs once we have them all completed.
>
> Cheers,
> Azad
>
> On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:
>
>> This is on my todo list for Dec. as well. If there are any more volunteers
>> for translating JAPE to RUTA, please get in touch.
>>
>> Cheers,
>> Azad
>>
>> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>>> Hi,
>>>
>>> I just wanted to mention that I haven't forgot about it. Unfortunately,
>>> there is just no spare time right now. I hope I will be able to provide
>>> the patches in December.
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 06.11.2015 um 16:40 schrieb Pei Chen:
>>>> Hi Peter,
>>>> I think the ctakes-examples is probably a good starting point at least
>>>> in terms of maven modules, etc.  I think it would be good if we use
>>>> uimaFIT style as primary approach to wiring components together and
>>>> generate desc's as secondary...
>>>> I think the actual components that would be required is probably best
>>>> left up to what is actually required for best performing c-deid.  The
>>>> output would be interesting, I'm not sure if we should treat this as
>>>> an independent preprocessing component or part of a pipeline (in which
>>>> case, we may need to propose a change to the type system or perhaps an
>>>> alternative JCas view.  You can probably open up that discussion to
>>>> the dev group as you see fit.)
>>>>
>>>> My 2 cents...
>>>>
>>>>
>>>> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
>> wrote:
>>>>> Hi,
>>>>>
>>>>> Is there a cTAKES project that may serve as an example on how the
>> cTAKES
>>>>> community develops or how a project should look like?
>>>>> I learned that different people set up UIMA project in a quite
>> different
>>>>> manner and I do not what to get inspired by "some sort of out-dated"
>>>>> approach in the cTAKES repo.
>>>>>
>>>>> Are there restriction or preferences about the preprocessing
>> components
>>>>> that should be used and the kind of "output" of the project.
>>>>> Components: On which components may the componetns rely: tokenizer,
>> ...
>>>>> parser, ... dict lookup?
>>>>> "output": Should the project provide a pipeline or a single AE?
>>>>>
>>>>> More comments below.
>>>>>
>>>>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>>>> Who else plans to provide patches for it? Just to avoid duplicate
>> work
>>>>>>> and to coordnate the efforts ...
>>>>>>>
>>>>>> I would like to help with the translating JAPE to RUTA.
>>>>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>>>>> wait until I set up the project with ruta integration.
>>>>>
>>>>> If any questions arise, just ask :-)
>>>>>
>>>>>>> Is there a development dataset which was utilized for the initial
>>>>>>> development, and if yes, is it possible to contribute it too?
>>>>>>>
>>>>>> The data set is unfortunately not publicly available; i2b2
>>>>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
>> data
>>>>>> sets 12 months after a given challenge; this is done on an
>> individual basis
>>>>>> and involve a Data Use Agreement.
>>>>>>
>>>>>> However, I will be able to conduct and coordinate the validation.
>>>>>>
>>>>> Ok, I'll investigate if we have already access to the dataset here.
>>>>>
>>>>>
>>>>>>> My first step would be:
>>>>>>> - set up a maven project
>>>>>>> - set up a development pipeline in a test (with cTAKES components
>>>>>>> replacing the previous ANNIE preprocessing)
>>>>>>>
>>>>>>>
>>>>>>> But one item that we need to review is the 3rd party libs jars that
>>>>>>> were included to ensure compatibility.  I’ll be sure to take a look
>> at
>>>>>>> that over the next few weeks.
>>>>>>>
>>>>>>> —Pei
>>>>>>>
>>>>>>>
>>>>>> @Pei - once ANNIE components are replaced there is should not be a
>> need to
>>>>>> worry about the 3rd party libs.
>>>>>>
>>>>>> Also, just a thought: we may want to create an independent component
>> for
>>>>>> the Two Pass recognition (TwoPass.java) as this method have shown
>> useful
>>>>>> for general NER on longitudinal data and surely useful independent
>> of the
>>>>>> deid component.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Azad
>>>>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

Three NERs translated and uploaded.

PS. I will validate all NERs once we have them all completed.

Cheers,
Azad

On 24 November 2015 at 10:37, Azad Dehghan <az...@gmail.com> wrote:

> This is on my todo list for Dec. as well. If there are any more volunteers
> for translating JAPE to RUTA, please get in touch.
>
> Cheers,
> Azad
>
> On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
> >
> > Hi,
> >
> > I just wanted to mention that I haven't forgot about it. Unfortunately,
> > there is just no spare time right now. I hope I will be able to provide
> > the patches in December.
> >
> > Best,
> >
> > Peter
> >
> > Am 06.11.2015 um 16:40 schrieb Pei Chen:
> > > Hi Peter,
> > > I think the ctakes-examples is probably a good starting point at least
> > > in terms of maven modules, etc.  I think it would be good if we use
> > > uimaFIT style as primary approach to wiring components together and
> > > generate desc's as secondary...
> > > I think the actual components that would be required is probably best
> > > left up to what is actually required for best performing c-deid.  The
> > > output would be interesting, I'm not sure if we should treat this as
> > > an independent preprocessing component or part of a pipeline (in which
> > > case, we may need to propose a change to the type system or perhaps an
> > > alternative JCas view.  You can probably open up that discussion to
> > > the dev group as you see fit.)
> > >
> > > My 2 cents...
> > >
> > >
> > > On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
> wrote:
> > >> Hi,
> > >>
> > >> Is there a cTAKES project that may serve as an example on how the
> cTAKES
> > >> community develops or how a project should look like?
> > >> I learned that different people set up UIMA project in a quite
> different
> > >> manner and I do not what to get inspired by "some sort of out-dated"
> > >> approach in the cTAKES repo.
> > >>
> > >> Are there restriction or preferences about the preprocessing
> components
> > >> that should be used and the kind of "output" of the project.
> > >> Components: On which components may the componetns rely: tokenizer,
> ...
> > >> parser, ... dict lookup?
> > >> "output": Should the project provide a pipeline or a single AE?
> > >>
> > >> More comments below.
> > >>
> > >> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> > >>>>
> > >>>> Who else plans to provide patches for it? Just to avoid duplicate
> work
> > >>>> and to coordnate the efforts ...
> > >>>>
> > >>> I would like to help with the translating JAPE to RUTA.
> > >> You can already go ahead with the UIMA Ruta Workbench if you want, or
> > >> wait until I set up the project with ruta integration.
> > >>
> > >> If any questions arise, just ask :-)
> > >>
> > >>>> Is there a development dataset which was utilized for the initial
> > >>>> development, and if yes, is it possible to contribute it too?
> > >>>>
> > >>> The data set is unfortunately not publicly available; i2b2
> > >>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
> data
> > >>> sets 12 months after a given challenge; this is done on an
> individual basis
> > >>> and involve a Data Use Agreement.
> > >>>
> > >>> However, I will be able to conduct and coordinate the validation.
> > >>>
> > >> Ok, I'll investigate if we have already access to the dataset here.
> > >>
> > >>
> > >>>> My first step would be:
> > >>>> - set up a maven project
> > >>>> - set up a development pipeline in a test (with cTAKES components
> > >>>> replacing the previous ANNIE preprocessing)
> > >>>>
> > >>>>
> > >>>> But one item that we need to review is the 3rd party libs jars that
> > >>>> were included to ensure compatibility.  I’ll be sure to take a look
> at
> > >>>> that over the next few weeks.
> > >>>>
> > >>>> —Pei
> > >>>>
> > >>>>
> > >>> @Pei - once ANNIE components are replaced there is should not be a
> need to
> > >>> worry about the 3rd party libs.
> > >>>
> > >>> Also, just a thought: we may want to create an independent component
> for
> > >>> the Two Pass recognition (TwoPass.java) as this method have shown
> useful
> > >>> for general NER on longitudinal data and surely useful independent
> of the
> > >>> deid component.
> > >>>
> > >>>
> > >>> Cheers,
> > >>> Azad
> > >>>
> >
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

This is on my todo list for Dec. as well. If there are any more volunteers
for translating JAPE to RUTA, please get in touch.

Cheers,
Azad

On 24 Nov 2015 09:55, "Peter Klügl" <pe...@averbis.com> wrote:
>
> Hi,
>
> I just wanted to mention that I haven't forgot about it. Unfortunately,
> there is just no spare time right now. I hope I will be able to provide
> the patches in December.
>
> Best,
>
> Peter
>
> Am 06.11.2015 um 16:40 schrieb Pei Chen:
> > Hi Peter,
> > I think the ctakes-examples is probably a good starting point at least
> > in terms of maven modules, etc.  I think it would be good if we use
> > uimaFIT style as primary approach to wiring components together and
> > generate desc's as secondary...
> > I think the actual components that would be required is probably best
> > left up to what is actually required for best performing c-deid.  The
> > output would be interesting, I'm not sure if we should treat this as
> > an independent preprocessing component or part of a pipeline (in which
> > case, we may need to propose a change to the type system or perhaps an
> > alternative JCas view.  You can probably open up that discussion to
> > the dev group as you see fit.)
> >
> > My 2 cents...
> >
> >
> > On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com>
wrote:
> >> Hi,
> >>
> >> Is there a cTAKES project that may serve as an example on how the
cTAKES
> >> community develops or how a project should look like?
> >> I learned that different people set up UIMA project in a quite
different
> >> manner and I do not what to get inspired by "some sort of out-dated"
> >> approach in the cTAKES repo.
> >>
> >> Are there restriction or preferences about the preprocessing components
> >> that should be used and the kind of "output" of the project.
> >> Components: On which components may the componetns rely: tokenizer, ...
> >> parser, ... dict lookup?
> >> "output": Should the project provide a pipeline or a single AE?
> >>
> >> More comments below.
> >>
> >> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
> >>>>
> >>>> Who else plans to provide patches for it? Just to avoid duplicate
work
> >>>> and to coordnate the efforts ...
> >>>>
> >>> I would like to help with the translating JAPE to RUTA.
> >> You can already go ahead with the UIMA Ruta Workbench if you want, or
> >> wait until I set up the project with ruta integration.
> >>
> >> If any questions arise, just ask :-)
> >>
> >>>> Is there a development dataset which was utilized for the initial
> >>>> development, and if yes, is it possible to contribute it too?
> >>>>
> >>> The data set is unfortunately not publicly available; i2b2
> >>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the
data
> >>> sets 12 months after a given challenge; this is done on an individual
basis
> >>> and involve a Data Use Agreement.
> >>>
> >>> However, I will be able to conduct and coordinate the validation.
> >>>
> >> Ok, I'll investigate if we have already access to the dataset here.
> >>
> >>
> >>>> My first step would be:
> >>>> - set up a maven project
> >>>> - set up a development pipeline in a test (with cTAKES components
> >>>> replacing the previous ANNIE preprocessing)
> >>>>
> >>>>
> >>>> But one item that we need to review is the 3rd party libs jars that
> >>>> were included to ensure compatibility.  I’ll be sure to take a look
at
> >>>> that over the next few weeks.
> >>>>
> >>>> —Pei
> >>>>
> >>>>
> >>> @Pei - once ANNIE components are replaced there is should not be a
need to
> >>> worry about the 3rd party libs.
> >>>
> >>> Also, just a thought: we may want to create an independent component
for
> >>> the Two Pass recognition (TwoPass.java) as this method have shown
useful
> >>> for general NER on longitudinal data and surely useful independent of
the
> >>> deid component.
> >>>
> >>>
> >>> Cheers,
> >>> Azad
> >>>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I just wanted to mention that I haven't forgot about it. Unfortunately,
there is just no spare time right now. I hope I will be able to provide
the patches in December.

Best,

Peter

Am 06.11.2015 um 16:40 schrieb Pei Chen:
> Hi Peter,
> I think the ctakes-examples is probably a good starting point at least
> in terms of maven modules, etc.  I think it would be good if we use
> uimaFIT style as primary approach to wiring components together and
> generate desc's as secondary...
> I think the actual components that would be required is probably best
> left up to what is actually required for best performing c-deid.  The
> output would be interesting, I'm not sure if we should treat this as
> an independent preprocessing component or part of a pipeline (in which
> case, we may need to propose a change to the type system or perhaps an
> alternative JCas view.  You can probably open up that discussion to
> the dev group as you see fit.)
>
> My 2 cents...
>
>
> On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com> wrote:
>> Hi,
>>
>> Is there a cTAKES project that may serve as an example on how the cTAKES
>> community develops or how a project should look like?
>> I learned that different people set up UIMA project in a quite different
>> manner and I do not what to get inspired by "some sort of out-dated"
>> approach in the cTAKES repo.
>>
>> Are there restriction or preferences about the preprocessing components
>> that should be used and the kind of "output" of the project.
>> Components: On which components may the componetns rely: tokenizer, ...
>> parser, ... dict lookup?
>> "output": Should the project provide a pipeline or a single AE?
>>
>> More comments below.
>>
>> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>>
>>>> Who else plans to provide patches for it? Just to avoid duplicate work
>>>> and to coordnate the efforts ...
>>>>
>>> I would like to help with the translating JAPE to RUTA.
>> You can already go ahead with the UIMA Ruta Workbench if you want, or
>> wait until I set up the project with ruta integration.
>>
>> If any questions arise, just ask :-)
>>
>>>> Is there a development dataset which was utilized for the initial
>>>> development, and if yes, is it possible to contribute it too?
>>>>
>>> The data set is unfortunately not publicly available; i2b2
>>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the data
>>> sets 12 months after a given challenge; this is done on an individual basis
>>> and involve a Data Use Agreement.
>>>
>>> However, I will be able to conduct and coordinate the validation.
>>>
>> Ok, I'll investigate if we have already access to the dataset here.
>>
>>
>>>> My first step would be:
>>>> - set up a maven project
>>>> - set up a development pipeline in a test (with cTAKES components
>>>> replacing the previous ANNIE preprocessing)
>>>>
>>>>
>>>> But one item that we need to review is the 3rd party libs jars that
>>>> were included to ensure compatibility.  I’ll be sure to take a look at
>>>> that over the next few weeks.
>>>>
>>>> —Pei
>>>>
>>>>
>>> @Pei - once ANNIE components are replaced there is should not be a need to
>>> worry about the 3rd party libs.
>>>
>>> Also, just a thought: we may want to create an independent component for
>>> the Two Pass recognition (TwoPass.java) as this method have shown useful
>>> for general NER on longitudinal data and surely useful independent of the
>>> deid component.
>>>
>>>
>>> Cheers,
>>> Azad
>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <ch...@apache.org>.

Hi Peter,
I think the ctakes-examples is probably a good starting point at least
in terms of maven modules, etc.  I think it would be good if we use
uimaFIT style as primary approach to wiring components together and
generate desc's as secondary...
I think the actual components that would be required is probably best
left up to what is actually required for best performing c-deid.  The
output would be interesting, I'm not sure if we should treat this as
an independent preprocessing component or part of a pipeline (in which
case, we may need to propose a change to the type system or perhaps an
alternative JCas view.  You can probably open up that discussion to
the dev group as you see fit.)

My 2 cents...


On Fri, Nov 6, 2015 at 3:38 AM, Peter Klügl <pe...@averbis.com> wrote:
> Hi,
>
> Is there a cTAKES project that may serve as an example on how the cTAKES
> community develops or how a project should look like?
> I learned that different people set up UIMA project in a quite different
> manner and I do not what to get inspired by "some sort of out-dated"
> approach in the cTAKES repo.
>
> Are there restriction or preferences about the preprocessing components
> that should be used and the kind of "output" of the project.
> Components: On which components may the componetns rely: tokenizer, ...
> parser, ... dict lookup?
> "output": Should the project provide a pipeline or a single AE?
>
> More comments below.
>
> Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>>
>>>
>>> Who else plans to provide patches for it? Just to avoid duplicate work
>>> and to coordnate the efforts ...
>>>
>> I would like to help with the translating JAPE to RUTA.
>
> You can already go ahead with the UIMA Ruta Workbench if you want, or
> wait until I set up the project with ruta integration.
>
> If any questions arise, just ask :-)
>
>>
>>> Is there a development dataset which was utilized for the initial
>>> development, and if yes, is it possible to contribute it too?
>>>
>> The data set is unfortunately not publicly available; i2b2
>> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the data
>> sets 12 months after a given challenge; this is done on an individual basis
>> and involve a Data Use Agreement.
>>
>> However, I will be able to conduct and coordinate the validation.
>>
>
> Ok, I'll investigate if we have already access to the dataset here.
>
>
>>> My first step would be:
>>> - set up a maven project
>>> - set up a development pipeline in a test (with cTAKES components
>>> replacing the previous ANNIE preprocessing)
>>>
>>>
>>
>>> But one item that we need to review is the 3rd party libs jars that
>>> were included to ensure compatibility.  I’ll be sure to take a look at
>>> that over the next few weeks.
>>>
>>> —Pei
>>>
>>>
>> @Pei - once ANNIE components are replaced there is should not be a need to
>> worry about the 3rd party libs.
>>
>> Also, just a thought: we may want to create an independent component for
>> the Two Pass recognition (TwoPass.java) as this method have shown useful
>> for general NER on longitudinal data and surely useful independent of the
>> deid component.
>>
>>
>> Cheers,
>> Azad
>>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

Is there a cTAKES project that may serve as an example on how the cTAKES
community develops or how a project should look like?
I learned that different people set up UIMA project in a quite different
manner and I do not what to get inspired by "some sort of out-dated"
approach in the cTAKES repo.

Are there restriction or preferences about the preprocessing components
that should be used and the kind of "output" of the project.
Components: On which components may the componetns rely: tokenizer, ...
parser, ... dict lookup?
"output": Should the project provide a pipeline or a single AE?

More comments below.

Am 03.11.2015 um 16:54 schrieb Azad Dehghan:
>>
>>
>> Who else plans to provide patches for it? Just to avoid duplicate work
>> and to coordnate the efforts ...
>>
> I would like to help with the translating JAPE to RUTA.

You can already go ahead with the UIMA Ruta Workbench if you want, or
wait until I set up the project with ruta integration.

If any questions arise, just ask :-)

>
>> Is there a development dataset which was utilized for the initial
>> development, and if yes, is it possible to contribute it too?
>>
> The data set is unfortunately not publicly available; i2b2
> <https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the data
> sets 12 months after a given challenge; this is done on an individual basis
> and involve a Data Use Agreement.
>
> However, I will be able to conduct and coordinate the validation.
>

Ok, I'll investigate if we have already access to the dataset here.


>> My first step would be:
>> - set up a maven project
>> - set up a development pipeline in a test (with cTAKES components
>> replacing the previous ANNIE preprocessing)
>>
>>
>
>> But one item that we need to review is the 3rd party libs jars that
>> were included to ensure compatibility.  I’ll be sure to take a look at
>> that over the next few weeks.
>>
>> —Pei
>>
>>
> @Pei - once ANNIE components are replaced there is should not be a need to
> worry about the 3rd party libs.
>
> Also, just a thought: we may want to create an independent component for
> the Two Pass recognition (TwoPass.java) as this method have shown useful
> for general NER on longitudinal data and surely useful independent of the
> deid component.
>
>
> Cheers,
> Azad
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

>
>
>
> Who else plans to provide patches for it? Just to avoid duplicate work
> and to coordnate the efforts ...
>

I would like to help with the translating JAPE to RUTA.


>
> Is there a development dataset which was utilized for the initial
> development, and if yes, is it possible to contribute it too?
>

The data set is unfortunately not publicly available; i2b2
<https://www.i2b2.org/NLP/DataSets/Main.php> typically releases the data
sets 12 months after a given challenge; this is done on an individual basis
and involve a Data Use Agreement.

However, I will be able to conduct and coordinate the validation.


>
> My first step would be:
> - set up a maven project
> - set up a development pipeline in a test (with cTAKES components
> replacing the previous ANNIE preprocessing)
>
>


>
> But one item that we need to review is the 3rd party libs jars that
> were included to ensure compatibility.  I’ll be sure to take a look at
> that over the next few weeks.
>
> —Pei
>
>
@Pei - once ANNIE components are replaced there is should not be a need to
worry about the 3rd party libs.

Also, just a thought: we may want to create an independent component for
the Two Pass recognition (TwoPass.java) as this method have shown useful
for general NER on longitudinal data and surely useful independent of the
deid component.


Cheers,
Azad

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <ch...@apache.org>.

Thanks Peter.

I’ve been swamped lately.

But one item that we need to review is the 3rd party libs jars that
were included to ensure compatibility.  I’ll be sure to take a look at
that over the next few weeks.

—Pei

On Tue, Nov 3, 2015 at 10:10 AM, Peter Klügl <pe...@averbis.com> wrote:
> Yes, I will do that.
>
> Who else plans to provide patches for it? Just to avoid duplicate work
> and to coordnate the efforts ...
>
> Is there a development dataset which was utilized for the initial
> development, and if yes, is it possible to contribute it too?
>
> My first step would be:
> - set up a maven project
> - set up a development pipeline in a test (with cTAKES components
> replacing the previous ANNIE preprocessing)
>
> Best,
>
> Peter
>
> Am 02.11.2015 um 18:04 schrieb Pei Chen:
>> Hi Peter/Azad,
>> Per INFRA-10579, since there is only a limited history, they suggested
>> we do the export/import ourselves.
>> The initial code base has been imported into:
>> sandbox/ctakes-clinical-deid
>>
>> Feel free to do an svn co
>> https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-clinical-deid/
>> and attach any patches to CTAKES-384 until you have commit access...
>>
>> --Pei
>>
>>
>> On Mon, Nov 2, 2015 at 3:56 AM, Peter Klügl <pe...@averbis.com> wrote:
>>> Hi,
>>>
>>> I just wanted to ask about the current status. I assume that the source
>>> is not yet in svn, right? Let me know when I can do something (in case I
>>> miss some mail on this list).
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 13.10.2015 um 21:17 schrieb Pei Chen:
>>>> Thanks Azad.
>>>> I submitted a Jira to infra to help us do the import (that way we will try
>>>> and preserve the commit history).
>>>> In the meantime, would you mind filling out the ICLA[1].
>>>>
>>>> [Reminder: Let's keep it in sandbox and not release it until all of the 3rd
>>>> party dependencies licenses have been verified.]
>>>>
>>>> [1] http://www.apache.org/licenses/#clas
>>>>
>>>> Thanks,
>>>> Pei
>>>>
>>>>     Pei Chen
>>>> Wired Informatics <http://www.wiredinformatics.com>
>>>> 265 Franklin St Ste 1702
>>>> Boston, MA 02110
>>>> tel: (617) 433-7544
>>>> Pei.Chen@wiredinformatics.com
>>>>
>>>> On Sun, Oct 11, 2015 at 3:51 PM, Azad Dehghan <az...@gmail.com>
>>>> wrote:
>>>>
>>>>> 1: Yes. Sorted.
>>>>> 3: Code attached to the Jira.
>>>>>
>>>>> Azad
>>>>>
>>>>> On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
>>>>> wrote:
>>>>>
>>>>>> This is great news!
>>>>>>> What is the current status and procedure? Is there an explicit
>>>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>>>> sourceforge project?
>>>>>> Jira has been opened to track this:
>>>>>> https://issues.apache.org/jira/browse/CTAKES-384
>>>>>>
>>>>>> 1) Azad, would you be willing to switch licenses?  I believe it's
>>>>>> currently GNU3 -> ASL 2.0?
>>>>>> 2) Create a project/module in cTAKES sandbox for this
>>>>>> 3) Export/Import sourceforge and attach the code to the Jira initially.
>>>>>> One of the current cTAKES committers can commit it to the repo (Until
>>>>> folks
>>>>>> can commit directly to the ctakes repo directly going forward.)
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
>>>>>> Sent: Thursday, October 08, 2015 8:06 AM
>>>>>> To: dev@ctakes.apache.org
>>>>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>>>>>> De-identification of Clinical Narratives
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I can offer my help here if required.
>>>>>>
>>>>>> I have experience in translating JAPE rules to UIMA Ruta and already
>>>>>> worked with clinical notes, e.g., also concerning deidentification.
>>>>>>
>>>>>> The problem is that I can only invest a few hours in the next two weeks.
>>>>>> I will have more time next month or even more next year.
>>>>>>
>>>>>> What is the current status and procedure? Is there an explicit
>>>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>>>> sourceforge project?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>> Am 01.10.2015 um 16:20 schrieb Pei Chen:
>>>>>>> Hi Azad,
>>>>>>> This is awesome news.  Thanks for adding in the code that was
>>>>>>> referenced by the paper.  I'll create a Jira to track we need to port
>>>>>>> it over to UIMA/Ruta.
>>>>>>>
>>>>>>> In the meantime, the link is at:
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
>>>>>>>
>>>>> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>>>> for those who may be interested in helping out...
>>>>>>> --Pei
>>>>>>>
>>>>>>> Hello Pei,
>>>>>>>
>>>>>>> I hope all is well.
>>>>>>>
>>>>>>> I have now uploaded the source code for cDeid
>>>>>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
>>>>>>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
>>>>>>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
>>>>>>>
>>>>> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>>>> ) ; I have tried to make the code as portable and modular as possible
>>>>> with
>>>>>> some trade-off for performance. This should help with porting the code to
>>>>>> cTAKES/UIMA.
>>>>>>> Once you let the community know I will try to get involved to help
>>>>>>> with translating JAPE to RUTA, etc.
>>>>>>>
>>>>>>> Best,
>>>>>>> Azad
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Yes, I will do that.

Who else plans to provide patches for it? Just to avoid duplicate work
and to coordnate the efforts ...

Is there a development dataset which was utilized for the initial
development, and if yes, is it possible to contribute it too?

My first step would be:
- set up a maven project
- set up a development pipeline in a test (with cTAKES components
replacing the previous ANNIE preprocessing)

Best,

Peter

Am 02.11.2015 um 18:04 schrieb Pei Chen:
> Hi Peter/Azad,
> Per INFRA-10579, since there is only a limited history, they suggested
> we do the export/import ourselves.
> The initial code base has been imported into:
> sandbox/ctakes-clinical-deid
>
> Feel free to do an svn co
> https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-clinical-deid/
> and attach any patches to CTAKES-384 until you have commit access...
>
> --Pei
>
>
> On Mon, Nov 2, 2015 at 3:56 AM, Peter Klügl <pe...@averbis.com> wrote:
>> Hi,
>>
>> I just wanted to ask about the current status. I assume that the source
>> is not yet in svn, right? Let me know when I can do something (in case I
>> miss some mail on this list).
>>
>> Best,
>>
>> Peter
>>
>> Am 13.10.2015 um 21:17 schrieb Pei Chen:
>>> Thanks Azad.
>>> I submitted a Jira to infra to help us do the import (that way we will try
>>> and preserve the commit history).
>>> In the meantime, would you mind filling out the ICLA[1].
>>>
>>> [Reminder: Let's keep it in sandbox and not release it until all of the 3rd
>>> party dependencies licenses have been verified.]
>>>
>>> [1] http://www.apache.org/licenses/#clas
>>>
>>> Thanks,
>>> Pei
>>>
>>>     Pei Chen
>>> Wired Informatics <http://www.wiredinformatics.com>
>>> 265 Franklin St Ste 1702
>>> Boston, MA 02110
>>> tel: (617) 433-7544
>>> Pei.Chen@wiredinformatics.com
>>>
>>> On Sun, Oct 11, 2015 at 3:51 PM, Azad Dehghan <az...@gmail.com>
>>> wrote:
>>>
>>>> 1: Yes. Sorted.
>>>> 3: Code attached to the Jira.
>>>>
>>>> Azad
>>>>
>>>> On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
>>>> wrote:
>>>>
>>>>> This is great news!
>>>>>> What is the current status and procedure? Is there an explicit
>>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>>> sourceforge project?
>>>>> Jira has been opened to track this:
>>>>> https://issues.apache.org/jira/browse/CTAKES-384
>>>>>
>>>>> 1) Azad, would you be willing to switch licenses?  I believe it's
>>>>> currently GNU3 -> ASL 2.0?
>>>>> 2) Create a project/module in cTAKES sandbox for this
>>>>> 3) Export/Import sourceforge and attach the code to the Jira initially.
>>>>> One of the current cTAKES committers can commit it to the repo (Until
>>>> folks
>>>>> can commit directly to the ctakes repo directly going forward.)
>>>>>
>>>>> -----Original Message-----
>>>>> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
>>>>> Sent: Thursday, October 08, 2015 8:06 AM
>>>>> To: dev@ctakes.apache.org
>>>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>>>>> De-identification of Clinical Narratives
>>>>>
>>>>> Hi,
>>>>>
>>>>> I can offer my help here if required.
>>>>>
>>>>> I have experience in translating JAPE rules to UIMA Ruta and already
>>>>> worked with clinical notes, e.g., also concerning deidentification.
>>>>>
>>>>> The problem is that I can only invest a few hours in the next two weeks.
>>>>> I will have more time next month or even more next year.
>>>>>
>>>>> What is the current status and procedure? Is there an explicit
>>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>>> sourceforge project?
>>>>>
>>>>> Best,
>>>>>
>>>>> Peter
>>>>>
>>>>> Am 01.10.2015 um 16:20 schrieb Pei Chen:
>>>>>> Hi Azad,
>>>>>> This is awesome news.  Thanks for adding in the code that was
>>>>>> referenced by the paper.  I'll create a Jira to track we need to port
>>>>>> it over to UIMA/Ruta.
>>>>>>
>>>>>> In the meantime, the link is at:
>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
>>>>>>
>>>> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>>> for those who may be interested in helping out...
>>>>>> --Pei
>>>>>>
>>>>>> Hello Pei,
>>>>>>
>>>>>> I hope all is well.
>>>>>>
>>>>>> I have now uploaded the source code for cDeid
>>>>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
>>>>>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
>>>>>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
>>>>>>
>>>> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>>> ) ; I have tried to make the code as portable and modular as possible
>>>> with
>>>>> some trade-off for performance. This should help with porting the code to
>>>>> cTAKES/UIMA.
>>>>>> Once you let the community know I will try to get involved to help
>>>>>> with translating JAPE to RUTA, etc.
>>>>>>
>>>>>> Best,
>>>>>> Azad

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <ch...@apache.org>.

Hi Peter/Azad,
Per INFRA-10579, since there is only a limited history, they suggested
we do the export/import ourselves.
The initial code base has been imported into:
sandbox/ctakes-clinical-deid

Feel free to do an svn co
https://svn.apache.org/repos/asf/ctakes/sandbox/ctakes-clinical-deid/
and attach any patches to CTAKES-384 until you have commit access...

--Pei


On Mon, Nov 2, 2015 at 3:56 AM, Peter Klügl <pe...@averbis.com> wrote:
> Hi,
>
> I just wanted to ask about the current status. I assume that the source
> is not yet in svn, right? Let me know when I can do something (in case I
> miss some mail on this list).
>
> Best,
>
> Peter
>
> Am 13.10.2015 um 21:17 schrieb Pei Chen:
>> Thanks Azad.
>> I submitted a Jira to infra to help us do the import (that way we will try
>> and preserve the commit history).
>> In the meantime, would you mind filling out the ICLA[1].
>>
>> [Reminder: Let's keep it in sandbox and not release it until all of the 3rd
>> party dependencies licenses have been verified.]
>>
>> [1] http://www.apache.org/licenses/#clas
>>
>> Thanks,
>> Pei
>>
>>     Pei Chen
>> Wired Informatics <http://www.wiredinformatics.com>
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> tel: (617) 433-7544
>> Pei.Chen@wiredinformatics.com
>>
>> On Sun, Oct 11, 2015 at 3:51 PM, Azad Dehghan <az...@gmail.com>
>> wrote:
>>
>>> 1: Yes. Sorted.
>>> 3: Code attached to the Jira.
>>>
>>> Azad
>>>
>>> On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
>>> wrote:
>>>
>>>> This is great news!
>>>>> What is the current status and procedure? Is there an explicit
>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>> sourceforge project?
>>>> Jira has been opened to track this:
>>>> https://issues.apache.org/jira/browse/CTAKES-384
>>>>
>>>> 1) Azad, would you be willing to switch licenses?  I believe it's
>>>> currently GNU3 -> ASL 2.0?
>>>> 2) Create a project/module in cTAKES sandbox for this
>>>> 3) Export/Import sourceforge and attach the code to the Jira initially.
>>>> One of the current cTAKES committers can commit it to the repo (Until
>>> folks
>>>> can commit directly to the ctakes repo directly going forward.)
>>>>
>>>> -----Original Message-----
>>>> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
>>>> Sent: Thursday, October 08, 2015 8:06 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>>>> De-identification of Clinical Narratives
>>>>
>>>> Hi,
>>>>
>>>> I can offer my help here if required.
>>>>
>>>> I have experience in translating JAPE rules to UIMA Ruta and already
>>>> worked with clinical notes, e.g., also concerning deidentification.
>>>>
>>>> The problem is that I can only invest a few hours in the next two weeks.
>>>> I will have more time next month or even more next year.
>>>>
>>>> What is the current status and procedure? Is there an explicit
>>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>>> sourceforge project?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> Am 01.10.2015 um 16:20 schrieb Pei Chen:
>>>>> Hi Azad,
>>>>> This is awesome news.  Thanks for adding in the code that was
>>>>> referenced by the paper.  I'll create a Jira to track we need to port
>>>>> it over to UIMA/Ruta.
>>>>>
>>>>> In the meantime, the link is at:
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
>>>>>
>>> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>> for those who may be interested in helping out...
>>>>> --Pei
>>>>>
>>>>> Hello Pei,
>>>>>
>>>>> I hope all is well.
>>>>>
>>>>> I have now uploaded the source code for cDeid
>>>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
>>>>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
>>>>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
>>>>>
>>> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>>> ) ; I have tried to make the code as portable and modular as possible
>>> with
>>>> some trade-off for performance. This should help with porting the code to
>>>> cTAKES/UIMA.
>>>>> Once you let the community know I will try to get involved to help
>>>>> with translating JAPE to RUTA, etc.
>>>>>
>>>>> Best,
>>>>> Azad
>>>>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I just wanted to ask about the current status. I assume that the source
is not yet in svn, right? Let me know when I can do something (in case I
miss some mail on this list).

Best,

Peter

Am 13.10.2015 um 21:17 schrieb Pei Chen:
> Thanks Azad.
> I submitted a Jira to infra to help us do the import (that way we will try
> and preserve the commit history).
> In the meantime, would you mind filling out the ICLA[1].
>
> [Reminder: Let's keep it in sandbox and not release it until all of the 3rd
> party dependencies licenses have been verified.]
>
> [1] http://www.apache.org/licenses/#clas
>
> Thanks,
> Pei
>
>     Pei Chen
> Wired Informatics <http://www.wiredinformatics.com>
> 265 Franklin St Ste 1702
> Boston, MA 02110
> tel: (617) 433-7544
> Pei.Chen@wiredinformatics.com
>
> On Sun, Oct 11, 2015 at 3:51 PM, Azad Dehghan <az...@gmail.com>
> wrote:
>
>> 1: Yes. Sorted.
>> 3: Code attached to the Jira.
>>
>> Azad
>>
>> On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
>> wrote:
>>
>>> This is great news!
>>>> What is the current status and procedure? Is there an explicit
>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>> sourceforge project?
>>> Jira has been opened to track this:
>>> https://issues.apache.org/jira/browse/CTAKES-384
>>>
>>> 1) Azad, would you be willing to switch licenses?  I believe it's
>>> currently GNU3 -> ASL 2.0?
>>> 2) Create a project/module in cTAKES sandbox for this
>>> 3) Export/Import sourceforge and attach the code to the Jira initially.
>>> One of the current cTAKES committers can commit it to the repo (Until
>> folks
>>> can commit directly to the ctakes repo directly going forward.)
>>>
>>> -----Original Message-----
>>> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
>>> Sent: Thursday, October 08, 2015 8:06 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>>> De-identification of Clinical Narratives
>>>
>>> Hi,
>>>
>>> I can offer my help here if required.
>>>
>>> I have experience in translating JAPE rules to UIMA Ruta and already
>>> worked with clinical notes, e.g., also concerning deidentification.
>>>
>>> The problem is that I can only invest a few hours in the next two weeks.
>>> I will have more time next month or even more next year.
>>>
>>> What is the current status and procedure? Is there an explicit
>>> contribution to cTAKES? Is there an ICLA? What about the license of the
>>> sourceforge project?
>>>
>>> Best,
>>>
>>> Peter
>>>
>>> Am 01.10.2015 um 16:20 schrieb Pei Chen:
>>>> Hi Azad,
>>>> This is awesome news.  Thanks for adding in the code that was
>>>> referenced by the paper.  I'll create a Jira to track we need to port
>>>> it over to UIMA/Ruta.
>>>>
>>>> In the meantime, the link is at:
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
>>>>
>> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>> for those who may be interested in helping out...
>>>> --Pei
>>>>
>>>> Hello Pei,
>>>>
>>>> I hope all is well.
>>>>
>>>> I have now uploaded the source code for cDeid
>>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
>>>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
>>>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
>>>>
>> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>>> ) ; I have tried to make the code as portable and modular as possible
>> with
>>> some trade-off for performance. This should help with porting the code to
>>> cTAKES/UIMA.
>>>> Once you let the community know I will try to get involved to help
>>>> with translating JAPE to RUTA, etc.
>>>>
>>>> Best,
>>>> Azad
>>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Pei Chen <pe...@wiredinformatics.com>.

Thanks Azad.
I submitted a Jira to infra to help us do the import (that way we will try
and preserve the commit history).
In the meantime, would you mind filling out the ICLA[1].

[Reminder: Let's keep it in sandbox and not release it until all of the 3rd
party dependencies licenses have been verified.]

[1] http://www.apache.org/licenses/#clas

Thanks,
Pei

    Pei Chen
Wired Informatics <http://www.wiredinformatics.com>
265 Franklin St Ste 1702
Boston, MA 02110
tel: (617) 433-7544
Pei.Chen@wiredinformatics.com

On Sun, Oct 11, 2015 at 3:51 PM, Azad Dehghan <az...@gmail.com>
wrote:

> 1: Yes. Sorted.
> 3: Code attached to the Jira.
>
> Azad
>
> On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
> wrote:
>
> > This is great news!
> > > What is the current status and procedure? Is there an explicit
> > contribution to cTAKES? Is there an ICLA? What about the license of the
> > sourceforge project?
> > Jira has been opened to track this:
> > https://issues.apache.org/jira/browse/CTAKES-384
> >
> > 1) Azad, would you be willing to switch licenses?  I believe it's
> > currently GNU3 -> ASL 2.0?
> > 2) Create a project/module in cTAKES sandbox for this
> > 3) Export/Import sourceforge and attach the code to the Jira initially.
> > One of the current cTAKES committers can commit it to the repo (Until
> folks
> > can commit directly to the ctakes repo directly going forward.)
> >
> > -----Original Message-----
> > From: Peter Klügl [mailto:peter.kluegl@averbis.com]
> > Sent: Thursday, October 08, 2015 8:06 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Combining Knowledge- and Data-driven Methods for
> > De-identification of Clinical Narratives
> >
> > Hi,
> >
> > I can offer my help here if required.
> >
> > I have experience in translating JAPE rules to UIMA Ruta and already
> > worked with clinical notes, e.g., also concerning deidentification.
> >
> > The problem is that I can only invest a few hours in the next two weeks.
> > I will have more time next month or even more next year.
> >
> > What is the current status and procedure? Is there an explicit
> > contribution to cTAKES? Is there an ICLA? What about the license of the
> > sourceforge project?
> >
> > Best,
> >
> > Peter
> >
> > Am 01.10.2015 um 16:20 schrieb Pei Chen:
> > > Hi Azad,
> > > This is awesome news.  Thanks for adding in the code that was
> > > referenced by the paper.  I'll create a Jira to track we need to port
> > > it over to UIMA/Ruta.
> > >
> > > In the meantime, the link is at:
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
> > >
> >
> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> > for those who may be interested in helping out...
> > >
> > > --Pei
> > >
> > > Hello Pei,
> > >
> > > I hope all is well.
> > >
> > > I have now uploaded the source code for cDeid
> > > (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
> > > _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
> > > hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
> > >
> >
> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> > ) ; I have tried to make the code as portable and modular as possible
> with
> > some trade-off for performance. This should help with porting the code to
> > cTAKES/UIMA.
> > >
> > > Once you let the community know I will try to get involved to help
> > > with translating JAPE to RUTA, etc.
> > >
> > > Best,
> > > Azad
> >
> >
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Azad Dehghan <az...@gmail.com>.

1: Yes. Sorted.
3: Code attached to the Jira.

Azad

On 8 October 2015 at 20:03, Chen, Pei <Pe...@childrens.harvard.edu>
wrote:

> This is great news!
> > What is the current status and procedure? Is there an explicit
> contribution to cTAKES? Is there an ICLA? What about the license of the
> sourceforge project?
> Jira has been opened to track this:
> https://issues.apache.org/jira/browse/CTAKES-384
>
> 1) Azad, would you be willing to switch licenses?  I believe it's
> currently GNU3 -> ASL 2.0?
> 2) Create a project/module in cTAKES sandbox for this
> 3) Export/Import sourceforge and attach the code to the Jira initially.
> One of the current cTAKES committers can commit it to the repo (Until folks
> can commit directly to the ctakes repo directly going forward.)
>
> -----Original Message-----
> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
> Sent: Thursday, October 08, 2015 8:06 AM
> To: dev@ctakes.apache.org
> Subject: Re: Combining Knowledge- and Data-driven Methods for
> De-identification of Clinical Narratives
>
> Hi,
>
> I can offer my help here if required.
>
> I have experience in translating JAPE rules to UIMA Ruta and already
> worked with clinical notes, e.g., also concerning deidentification.
>
> The problem is that I can only invest a few hours in the next two weeks.
> I will have more time next month or even more next year.
>
> What is the current status and procedure? Is there an explicit
> contribution to cTAKES? Is there an ICLA? What about the license of the
> sourceforge project?
>
> Best,
>
> Peter
>
> Am 01.10.2015 um 16:20 schrieb Pei Chen:
> > Hi Azad,
> > This is awesome news.  Thanks for adding in the code that was
> > referenced by the paper.  I'll create a Jira to track we need to port
> > it over to UIMA/Ruta.
> >
> > In the meantime, the link is at:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
> >
> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> for those who may be interested in helping out...
> >
> > --Pei
> >
> > Hello Pei,
> >
> > I hope all is well.
> >
> > I have now uploaded the source code for cDeid
> > (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
> > _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
> > hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
> >
> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> ) ; I have tried to make the code as portable and modular as possible with
> some trade-off for performance. This should help with porting the code to
> cTAKES/UIMA.
> >
> > Once you let the community know I will try to get involved to help
> > with translating JAPE to RUTA, etc.
> >
> > Best,
> > Azad
>
>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

As far as I know, you can convert as long as ALL original authors / copyright holders agree to the conversion. Only the original authors may assign new licenses to their work. You might also want to double check that the codebase doesn't contain any copy/pasted code from third sources.

As a third party, you cannot convert GPL code to ASL.

Mind, I am not a lawyer.

If you need a more advice, post to legal-discuss@asf.

Cheers,

-- Richard

On 08.10.2015, at 21:32, andy mcmurry <mc...@gmail.com> wrote:

> caution: Im not sure you can convert GPL3 to ASL2
> anyone know for sure?
> 
> On Thu, Oct 8, 2015 at 12:03 PM, Chen, Pei <Pe...@childrens.harvard.edu>
> wrote:
> 
>> This is great news!
>>> What is the current status and procedure? Is there an explicit
>> contribution to cTAKES? Is there an ICLA? What about the license of the
>> sourceforge project?
>> Jira has been opened to track this:
>> https://issues.apache.org/jira/browse/CTAKES-384
>> 
>> 1) Azad, would you be willing to switch licenses?  I believe it's
>> currently GNU3 -> ASL 2.0?
>> 2) Create a project/module in cTAKES sandbox for this
>> 3) Export/Import sourceforge and attach the code to the Jira initially.
>> One of the current cTAKES committers can commit it to the repo (Until folks
>> can commit directly to the ctakes repo directly going forward.)
>> 
>> -----Original Message-----
>> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
>> Sent: Thursday, October 08, 2015 8:06 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: Combining Knowledge- and Data-driven Methods for
>> De-identification of Clinical Narratives
>> 
>> Hi,
>> 
>> I can offer my help here if required.
>> 
>> I have experience in translating JAPE rules to UIMA Ruta and already
>> worked with clinical notes, e.g., also concerning deidentification.
>> 
>> The problem is that I can only invest a few hours in the next two weeks.
>> I will have more time next month or even more next year.
>> 
>> What is the current status and procedure? Is there an explicit
>> contribution to cTAKES? Is there an ICLA? What about the license of the
>> sourceforge project?
>> 
>> Best,
>> 
>> Peter
>> 
>> Am 01.10.2015 um 16:20 schrieb Pei Chen:
>>> Hi Azad,
>>> This is awesome news.  Thanks for adding in the code that was
>>> referenced by the paper.  I'll create a Jira to track we need to port
>>> it over to UIMA/Ruta.
>>> 
>>> In the meantime, the link is at:
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
>>> 
>> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>> for those who may be interested in helping out...
>>> 
>>> --Pei
>>> 
>>> Hello Pei,
>>> 
>>> I hope all is well.
>>> 
>>> I have now uploaded the source code for cDeid
>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
>>> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
>>> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
>>> 
>> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
>> ) ; I have tried to make the code as portable and modular as possible with
>> some trade-off for performance. This should help with porting the code to
>> cTAKES/UIMA.
>>> 
>>> Once you let the community know I will try to get involved to help
>>> with translating JAPE to RUTA, etc.
>>> 
>>> Best,
>>> Azad
>> 
>>

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by andy mcmurry <mc...@gmail.com>.

caution: Im not sure you can convert GPL3 to ASL2
anyone know for sure?

On Thu, Oct 8, 2015 at 12:03 PM, Chen, Pei <Pe...@childrens.harvard.edu>
wrote:

> This is great news!
> > What is the current status and procedure? Is there an explicit
> contribution to cTAKES? Is there an ICLA? What about the license of the
> sourceforge project?
> Jira has been opened to track this:
> https://issues.apache.org/jira/browse/CTAKES-384
>
> 1) Azad, would you be willing to switch licenses?  I believe it's
> currently GNU3 -> ASL 2.0?
> 2) Create a project/module in cTAKES sandbox for this
> 3) Export/Import sourceforge and attach the code to the Jira initially.
> One of the current cTAKES committers can commit it to the repo (Until folks
> can commit directly to the ctakes repo directly going forward.)
>
> -----Original Message-----
> From: Peter Klügl [mailto:peter.kluegl@averbis.com]
> Sent: Thursday, October 08, 2015 8:06 AM
> To: dev@ctakes.apache.org
> Subject: Re: Combining Knowledge- and Data-driven Methods for
> De-identification of Clinical Narratives
>
> Hi,
>
> I can offer my help here if required.
>
> I have experience in translating JAPE rules to UIMA Ruta and already
> worked with clinical notes, e.g., also concerning deidentification.
>
> The problem is that I can only invest a few hours in the next two weeks.
> I will have more time next month or even more next year.
>
> What is the current status and procedure? Is there an explicit
> contribution to cTAKES? Is there an ICLA? What about the license of the
> sourceforge project?
>
> Best,
>
> Peter
>
> Am 01.10.2015 um 16:20 schrieb Pei Chen:
> > Hi Azad,
> > This is awesome news.  Thanks for adding in the code that was
> > referenced by the paper.  I'll create a Jira to track we need to port
> > it over to UIMA/Ruta.
> >
> > In the meantime, the link is at:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
> >
> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> for those who may be interested in helping out...
> >
> > --Pei
> >
> > Hello Pei,
> >
> > I hope all is well.
> >
> > I have now uploaded the source code for cDeid
> > (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
> > _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
> > hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
> >
> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=
> ) ; I have tried to make the code as portable and modular as possible with
> some trade-off for performance. This should help with porting the code to
> cTAKES/UIMA.
> >
> > Once you let the community know I will try to get involved to help
> > with translating JAPE to RUTA, etc.
> >
> > Best,
> > Azad
>
>

RE: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.

This is great news!
> What is the current status and procedure? Is there an explicit contribution to cTAKES? Is there an ICLA? What about the license of the sourceforge project?
Jira has been opened to track this: https://issues.apache.org/jira/browse/CTAKES-384

1) Azad, would you be willing to switch licenses?  I believe it's currently GNU3 -> ASL 2.0?
2) Create a project/module in cTAKES sandbox for this
3) Export/Import sourceforge and attach the code to the Jira initially.  One of the current cTAKES committers can commit it to the repo (Until folks can commit directly to the ctakes repo directly going forward.)

-----Original Message-----
From: Peter Klügl [mailto:peter.kluegl@averbis.com] 
Sent: Thursday, October 08, 2015 8:06 AM
To: dev@ctakes.apache.org
Subject: Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Hi,

I can offer my help here if required.

I have experience in translating JAPE rules to UIMA Ruta and already worked with clinical notes, e.g., also concerning deidentification.

The problem is that I can only invest a few hours in the next two weeks.
I will have more time next month or even more next year.

What is the current status and procedure? Is there an explicit contribution to cTAKES? Is there an ICLA? What about the license of the sourceforge project?

Best,

Peter

Am 01.10.2015 um 16:20 schrieb Pei Chen:
> Hi Azad,
> This is awesome news.  Thanks for adding in the code that was 
> referenced by the paper.  I'll create a Jira to track we need to port 
> it over to UIMA/Ruta.
>
> In the meantime, the link is at:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p_
> clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m=yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e=  for those who may be interested in helping out...
>
> --Pei
>
> Hello Pei,
>
> I hope all is well.
>
> I have now uploaded the source code for cDeid 
> (https://urldefense.proofpoint.com/v2/url?u=http-3A__sourceforge.net_p
> _clinical-2Ddeid_code_ci_master_tree_&d=BQICaQ&c=qS4goWBT7poplM69zy_3x
> hKwEW14JZMSdioCoppxeFU&r=huK2MFkj300qccT8OSuuoYhy_xEYujfPwiAxhPVz5WY&m
> =yjhqco4EH0XrR798kbkzfYcFQ8z8MR9UF8mMRSjKTH0&s=_k7AbwzkVrRwTrNC3LArZ5hQ5Q47eh06KCDla7UBugY&e= ) ; I have tried to make the code as portable and modular as possible with some trade-off for performance. This should help with porting the code to cTAKES/UIMA.
>
> Once you let the community know I will try to get involved to help 
> with translating JAPE to RUTA, etc.
>
> Best,
> Azad

Re: Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives

Posted by Peter Klügl <pe...@averbis.com>.

Hi,

I can offer my help here if required.

I have experience in translating JAPE rules to UIMA Ruta and already
worked with clinical notes, e.g., also concerning deidentification.

The problem is that I can only invest a few hours in the next two weeks.
I will have more time next month or even more next year.

What is the current status and procedure? Is there an explicit
contribution to cTAKES? Is there an ICLA? What about the license of the
sourceforge project?

Best,

Peter

Am 01.10.2015 um 16:20 schrieb Pei Chen:
> Hi Azad,
> This is awesome news.  Thanks for adding in the code that was
> referenced by the paper.  I'll create a Jira to track we need to port
> it over to UIMA/Ruta.
>
> In the meantime, the link is at:
> http://sourceforge.net/p/clinical-deid/code/ci/master/tree/ for those
> who may be interested in helping out...
>
> --Pei
>
> Hello Pei,
>
> I hope all is well.
>
> I have now uploaded the source code for cDeid
> (http://sourceforge.net/p/clinical-deid/code/ci/master/tree/) ; I have
> tried to make the code as portable and modular as possible with some
> trade-off for performance. This should help with porting the code to
> cTAKES/UIMA.
>
> Once you let the community know I will try to get involved to help
> with translating JAPE to RUTA, etc.
>
> Best,
> Azad