You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by gandhi rajan <ga...@gmail.com> on 2019/07/17 16:26:08 UTC

Re: Synthetic replacement feature in cTAKES Scrubber

Hi Masoud, we had a similar requirement to identify patient names in the
narratives text and I had a discussion with Sean Finan on patient name
identification feature in cTAKES. What he told at that point in time was
cTAKES dint supported patient name identification feature. Also as far as I
know, I m not really sure whether scrubber made it to the cTAKES codebase.

Sean, Please correct me if I m wrong.

On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:

> Dear cTAKES developer,
> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
> Institute for Clinical and Translational Research and work on
> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
> goals we are targeting is de-identification of a large number of notes
> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
> has been very helpful.
>
> One of our most desired features in the de-identification pipeline is
> synthetic replacement (e.g. Nancy->Sally; random female first name
> consistently replaces a female first name.). I wasn't able to find
> information about this feature in cTAKES Scrubber. Is synthetic replacement
> functionality part of the cTAKES Scrubber, or can it be added by
> post-processing the output? For instance, if we know the name Nancy is
> removed from multiple places, can we use a name dictionary to insert random
> female first names in those places (just a thought)?
> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
> candidates and I'm hoping that we could find ways to collaborate.
>
> Thank you very much,
> Masoud
>
> ----
> Masoud Rouhizadeh, PhD
> Faculty - Division of Health Science Informatics (DHSI)
> NLP Lead - Institute for Clinical and Translational Research (ICTR)
> Johns Hopkins University School of Medicine
> https://www.cs.jhu.edu/~mrou/
>
>

-- 
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by Masoud Rouhizadeh <mr...@jhu.edu>.

Hi all,

Thank you so much for your great feedback. I've learned a lot form these real-word, hands-on research insights on de-id. 

This is clearly not very related to cTAKES anymore, and I don't want to spam cTAKES dev mailing list. I'm wondering is there any other mailing list where we could have these types of discussions? 

Thanks,
Masoud


On 7/19/19, 11:24 AM, "Finan, Sean" <Se...@childrens.harvard.edu> wrote:

    Hi all,
    
    Replacement consistency on the patient vs. note level may or may not be important depending upon the project.
    
    For instance, If you need to place patients in a corpus on a timeline, then it is definitely necessary to be consistent across patients - not just consistent with names and dates across a patient, but also with unique names across a corpus if there are no other (deid'd) unique identifiers.  For other corpus-wide groupings, like diagnosis counts, correlation counts, etc. then consistency is important.
    
    I should also point out that time shifting severity can be very important depending upon the project.  For instance, headaches occur at any age, so if that is the study focus then a +/- 5 year shift may be ok.  However, a study related to something occurring during or around puberty may require a more narrow shift, say +/- 1 year.  Also consider date shift amounts for studies involving any drugs that would have been new or discontinued during the time studied, changes in diagnostic criteria, etc.  It may be necessary to use smaller shifts or maybe only shifts forward and not backward, etc.
    
    Lastly, something (like an internal quality control run?) -could- even require physician deids to be consistent across the entire corpus (not just per-patient).   This is a really special case, but it could happen.
    
    At any rate, the point is just that deid should not be viewed simple from any angle.
    
    Sean
    ________________________________________
    From: Lingren, Todd <To...@cchmc.org>
    Sent: Friday, July 19, 2019 10:27 AM
    To: dev@ctakes.apache.org
    Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    
    Hi Masoud,
    
    The replacement was the same within a note, but not standardized across the complete record for a patient. Date shifting was also within a note, not across a record. The NER task doesn't really matter in this regard, and even for more extensive time-series info extraction/prediction, that shouldn't be relying on PHI anyway.
    
    One other point about addresses, we obfuscated the road type. For example if the address said 123 Main Street, we would change that to 429 First Avenue, or something like that. And woudn't use Main Street (only Main Avenue/Road/Drive/Boulevard) in other replacements.
    
    
    
    ----------------------
    
    Todd Lingren, M.S.
    Division of Biomedical Informatics
    Cincinnati Children's Hospital
    todd.lingren@cchmc.org
    (513) 803-9032
    
    
    ________________________________
    From: Masoud Rouhizadeh <mr...@jhu.edu>
    Sent: Thursday, July 18, 2019 12:27:41 PM
    To: dev@ctakes.apache.org
    Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    
    Thanks, everyone, for their great feedback. Very helpful insights!
    
    Here are a few comments and questions:
    
    (1) Peter: great paper! I agree that replacing the same real person’s name by the same pseudonym makes the text easier to interpret but on the other hand, wouldn't it make the de-identification less robust? I think if we pick a random pseudonym in each instance, it would be difficult to find the real name (in case it is missed by the de-id system) when it is surrounded by (lots of) pseudonyms.
    
    (2) Peter: I'd appreciate if you could share your code. That would be helpful indeed.
    
    (3) Todd: in your work, did you replace the same real person’s name by the same pseudonym across the note or you assigned a random name each time?
    
    (4) Date shifting can be complicated. In addition to the admission case that Peter pointed out, we would need to deal with consistency. Will shifting the date by a random yet consistent number across that single note is sufficient or should we do this at the patient level? For instance, if some signs and symptoms observed and reported 1 year before the diagnosis, this trajectory should be preserved. Age would be another issue. Some risk factors are age-specific.
    
    (5) Does anyone have any thoughts of using metadata from structured fields (e.g. name, DOB, SSN, contact info) to help the note de-identification system? if the note de-id system is aware of the person's real name, we could make it more sensitive to that name, or if we know the street in which the person lives, we can pay more attention to that in the free text. Just wondering if any de-id tool uses this information systematically?
    
    Thank you all!
    Masoud
    
    On 7/17/19, 3:01 PM, "Lingren, Todd" <To...@cchmc.org> wrote:
    
        We had some similar work on de-id and "re-id".
    
        The impact on performance for NER tasks was minimal.
    
        https://urldefense.proofpoint.com/v2/url?u=https-3A__academic.oup.com_jamia_article_20_1_84_2909298&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=xzS_XK0gt4YopRVPglqHrHUhwt2S30J1416oTrTG75g&e=
    
        The replacing PHI task was employed with data based on US CENSUS distribution.
    
        https://urldefense.proofpoint.com/v2/url?u=https-3A__www.sciencedirect.com_science_article_pii_S1532046414000161&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=hO0yTWXqzQ8KxA_MYqdIFu2HtAEOm4wgNS-R3ipRh6o&e=
    
    
    
        ----------------------
    
        Todd Lingren, M.S.
        Division of Biomedical Informatics
        Cincinnati Children's Hospital
        todd.lingren@cchmc.org
        (513) 803-9032
    
    
        ________________________________
        From: Peter Szolovits <ps...@mit.edu>
        Sent: Wednesday, July 17, 2019 1:12:21 PM
        To: dev@ctakes.apache.org
        Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    
        My group has done considerable work on de-identification and on synthesizing pseudonymous data to replace the original PHI with plausible but inauthentic data (sometimes confusingly called re-identification).
    
        One conclusion I reached from that work is that the de-identification and the pseudonym generation should be tightly coupled. For example, if de-id replaces all people’s names by [person], then there is no way in the pseudonym generation to make sure that the same real person’s name is replaced by the same pseudonym in every occurrence, leading to much harder to interpret text.  The same goes for other PHI categories.
    
        I think it’s also important to keep similar formatting if the pseudonymized data are going to be used for NLP learning tasks.  So, for example, the format of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. Nicknames are a problem as well; if the same document also refers to Joe, and the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement for Joe should be Bob.  Gender is also tough because there are so many names that are either ambiguous or not in name dictionaries.
    
        Date shifting also introduces pseudonymization problems.  For example, a patient admitted on December 15 may have a note saying they are expected to be discharged right after Christmas. If the admission date is shifted, say to mid-January, then retaining the discharge expectation would imply a very long anticipated hospital stay.
    
        We published a paper on this topic:
        https://urldefense.proofpoint.com/v2/url?u=https-3A__link.springer.com_chapter_10.1007_978-2D3-2D319-2D23633-2D9-5F27&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=VdZxU2g1EpC_NsszRn967-ej75GV--bMCJWICUd769s&e=  <https://urldefense.proofpoint.com/v2/url?u=https-3A__link.springer.com_chapter_10.1007_978-2D3-2D319-2D23633-2D9-5F27&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=VdZxU2g1EpC_NsszRn967-ej75GV--bMCJWICUd769s&e= >
    
        I also have some old Java code that deal with a few of these issues, and would be happy to share with anyone interested, though it’s far from production quality and does not address all the issues we know.
    
        —Peter Szolovits
    
        > On Jul 17, 2019, at 12:42 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
        >
        > Hi All,
        >
        > ctakes-scrubber is not in any ctakes release and it is not in the main repository.  It never went beyond experimental and resides within the ctakes sandbox.  https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=QlyV8bpJ_6w_HjmV2s6aeFwCuFxwTzeCN2v80TeuOSw&e=  <https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=QlyV8bpJ_6w_HjmV2s6aeFwCuFxwTzeCN2v80TeuOSw&e= >
        >
        > From what I recall, scrubber does not have "real" name replacement, but instead de-identifies entities by removing them and inserting a tag indicating the type of entity.  For instance: "John has a rash" -> "[person] has a rash".   That is not verbatim, but it is the general idea.
        >
        > If you can get ctakes-scrubber working in your project then it would be pretty easy to create an engine that does nothing except replace such generic tags with random names, dates, institutions, etc.
        >
        > Sean
        > ________________________________________
        > From: gandhi rajan <gandhirajan.n@gmail.com <ma...@gmail.com>>
        > Sent: Wednesday, July 17, 2019 12:26 PM
        > To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
        > Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
        >
        > Hi Masoud, we had a similar requirement to identify patient names in the
        > narratives text and I had a discussion with Sean Finan on patient name
        > identification feature in cTAKES. What he told at that point in time was
        > cTAKES dint supported patient name identification feature. Also as far as I
        > know, I m not really sure whether scrubber made it to the cTAKES codebase.
        >
        > Sean, Please correct me if I m wrong.
        >
        > On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
        >
        >> Dear cTAKES developer,
        >> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
        >> Institute for Clinical and Translational Research and work on
        >> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
        >> goals we are targeting is de-identification of a large number of notes
        >> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
        >> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
        >> has been very helpful.
        >>
        >> One of our most desired features in the de-identification pipeline is
        >> synthetic replacement (e.g. Nancy->Sally; random female first name
        >> consistently replaces a female first name.). I wasn't able to find
        >> information about this feature in cTAKES Scrubber. Is synthetic replacement
        >> functionality part of the cTAKES Scrubber, or can it be added by
        >> post-processing the output? For instance, if we know the name Nancy is
        >> removed from multiple places, can we use a name dictionary to insert random
        >> female first names in those places (just a thought)?
        >> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
        >> candidates and I'm hoping that we could find ways to collaborate.
        >>
        >> Thank you very much,
        >> Masoud
        >>
        >> ----
        >> Masoud Rouhizadeh, PhD
        >> Faculty - Division of Health Science Informatics (DHSI)
        >> NLP Lead - Institute for Clinical and Translational Research (ICTR)
        >> Johns Hopkins University School of Medicine
        >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
        >>
        >>
        >
        > --
        > Regards,
        > Gandhi
        >
        > "The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi all,

Replacement consistency on the patient vs. note level may or may not be important depending upon the project.

For instance, If you need to place patients in a corpus on a timeline, then it is definitely necessary to be consistent across patients - not just consistent with names and dates across a patient, but also with unique names across a corpus if there are no other (deid'd) unique identifiers.  For other corpus-wide groupings, like diagnosis counts, correlation counts, etc. then consistency is important.

I should also point out that time shifting severity can be very important depending upon the project.  For instance, headaches occur at any age, so if that is the study focus then a +/- 5 year shift may be ok.  However, a study related to something occurring during or around puberty may require a more narrow shift, say +/- 1 year.  Also consider date shift amounts for studies involving any drugs that would have been new or discontinued during the time studied, changes in diagnostic criteria, etc.  It may be necessary to use smaller shifts or maybe only shifts forward and not backward, etc.

Lastly, something (like an internal quality control run?) -could- even require physician deids to be consistent across the entire corpus (not just per-patient).   This is a really special case, but it could happen.

At any rate, the point is just that deid should not be viewed simple from any angle.

Sean
________________________________________
From: Lingren, Todd <To...@cchmc.org>
Sent: Friday, July 19, 2019 10:27 AM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Hi Masoud,

The replacement was the same within a note, but not standardized across the complete record for a patient. Date shifting was also within a note, not across a record. The NER task doesn't really matter in this regard, and even for more extensive time-series info extraction/prediction, that shouldn't be relying on PHI anyway.

One other point about addresses, we obfuscated the road type. For example if the address said 123 Main Street, we would change that to 429 First Avenue, or something like that. And woudn't use Main Street (only Main Avenue/Road/Drive/Boulevard) in other replacements.

----------------------

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.lingren@cchmc.org
(513) 803-9032

________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Thursday, July 18, 2019 12:27:41 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Thanks, everyone, for their great feedback. Very helpful insights!

Here are a few comments and questions:

(1) Peter: great paper! I agree that replacing the same real person’s name by the same pseudonym makes the text easier to interpret but on the other hand, wouldn't it make the de-identification less robust? I think if we pick a random pseudonym in each instance, it would be difficult to find the real name (in case it is missed by the de-id system) when it is surrounded by (lots of) pseudonyms.

(2) Peter: I'd appreciate if you could share your code. That would be helpful indeed.

(3) Todd: in your work, did you replace the same real person’s name by the same pseudonym across the note or you assigned a random name each time?

(4) Date shifting can be complicated. In addition to the admission case that Peter pointed out, we would need to deal with consistency. Will shifting the date by a random yet consistent number across that single note is sufficient or should we do this at the patient level? For instance, if some signs and symptoms observed and reported 1 year before the diagnosis, this trajectory should be preserved. Age would be another issue. Some risk factors are age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields (e.g. name, DOB, SSN, contact info) to help the note de-identification system? if the note de-id system is aware of the person's real name, we could make it more sensitive to that name, or if we know the street in which the person lives, we can pay more attention to that in the free text. Just wondering if any de-id tool uses this information systematically?

Thank you all!
Masoud

On 7/17/19, 3:01 PM, "Lingren, Todd" <To...@cchmc.org> wrote:

    We had some similar work on de-id and "re-id".

    The impact on performance for NER tasks was minimal.

    https://urldefense.proofpoint.com/v2/url?u=https-3A__academic.oup.com_jamia_article_20_1_84_2909298&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=xzS_XK0gt4YopRVPglqHrHUhwt2S30J1416oTrTG75g&e=

    The replacing PHI task was employed with data based on US CENSUS distribution.

    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.sciencedirect.com_science_article_pii_S1532046414000161&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=hO0yTWXqzQ8KxA_MYqdIFu2HtAEOm4wgNS-R3ipRh6o&e=

    ----------------------

    Todd Lingren, M.S.
    Division of Biomedical Informatics
    Cincinnati Children's Hospital
    todd.lingren@cchmc.org
    (513) 803-9032

    ________________________________
    From: Peter Szolovits <ps...@mit.edu>
    Sent: Wednesday, July 17, 2019 1:12:21 PM
    To: dev@ctakes.apache.org
    Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

    My group has done considerable work on de-identification and on synthesizing pseudonymous data to replace the original PHI with plausible but inauthentic data (sometimes confusingly called re-identification).

    One conclusion I reached from that work is that the de-identification and the pseudonym generation should be tightly coupled. For example, if de-id replaces all people’s names by [person], then there is no way in the pseudonym generation to make sure that the same real person’s name is replaced by the same pseudonym in every occurrence, leading to much harder to interpret text.  The same goes for other PHI categories.

    I think it’s also important to keep similar formatting if the pseudonymized data are going to be used for NLP learning tasks.  So, for example, the format of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. Nicknames are a problem as well; if the same document also refers to Joe, and the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement for Joe should be Bob.  Gender is also tough because there are so many names that are either ambiguous or not in name dictionaries.

    Date shifting also introduces pseudonymization problems.  For example, a patient admitted on December 15 may have a note saying they are expected to be discharged right after Christmas. If the admission date is shifted, say to mid-January, then retaining the discharge expectation would imply a very long anticipated hospital stay.

    We published a paper on this topic:
    https://urldefense.proofpoint.com/v2/url?u=https-3A__link.springer.com_chapter_10.1007_978-2D3-2D319-2D23633-2D9-5F27&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=VdZxU2g1EpC_NsszRn967-ej75GV--bMCJWICUd769s&e=  <https://urldefense.proofpoint.com/v2/url?u=https-3A__link.springer.com_chapter_10.1007_978-2D3-2D319-2D23633-2D9-5F27&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=VdZxU2g1EpC_NsszRn967-ej75GV--bMCJWICUd769s&e= >

    I also have some old Java code that deal with a few of these issues, and would be happy to share with anyone interested, though it’s far from production quality and does not address all the issues we know.

    —Peter Szolovits

    > On Jul 17, 2019, at 12:42 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
    >
    > Hi All,
    >
    > ctakes-scrubber is not in any ctakes release and it is not in the main repository.  It never went beyond experimental and resides within the ctakes sandbox.  https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=QlyV8bpJ_6w_HjmV2s6aeFwCuFxwTzeCN2v80TeuOSw&e=  <https://urldefense.proofpoint.com/v2/url?u=https-3A__svn.apache.org_repos_asf_ctakes_sandbox_&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=6W89gjAzseX6p7ykbILxuXxSLnkoxvHvx4fb-bXZyd4&s=QlyV8bpJ_6w_HjmV2s6aeFwCuFxwTzeCN2v80TeuOSw&e= >
    >
    > From what I recall, scrubber does not have "real" name replacement, but instead de-identifies entities by removing them and inserting a tag indicating the type of entity.  For instance: "John has a rash" -> "[person] has a rash".   That is not verbatim, but it is the general idea.
    >
    > If you can get ctakes-scrubber working in your project then it would be pretty easy to create an engine that does nothing except replace such generic tags with random names, dates, institutions, etc.
    >
    > Sean
    > ________________________________________
    > From: gandhi rajan <gandhirajan.n@gmail.com <ma...@gmail.com>>
    > Sent: Wednesday, July 17, 2019 12:26 PM
    > To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
    > Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    >
    > Hi Masoud, we had a similar requirement to identify patient names in the
    > narratives text and I had a discussion with Sean Finan on patient name
    > identification feature in cTAKES. What he told at that point in time was
    > cTAKES dint supported patient name identification feature. Also as far as I
    > know, I m not really sure whether scrubber made it to the cTAKES codebase.
    >
    > Sean, Please correct me if I m wrong.
    >
    > On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
    >
    >> Dear cTAKES developer,
    >> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
    >> Institute for Clinical and Translational Research and work on
    >> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
    >> goals we are targeting is de-identification of a large number of notes
    >> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
    >> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
    >> has been very helpful.
    >>
    >> One of our most desired features in the de-identification pipeline is
    >> synthetic replacement (e.g. Nancy->Sally; random female first name
    >> consistently replaces a female first name.). I wasn't able to find
    >> information about this feature in cTAKES Scrubber. Is synthetic replacement
    >> functionality part of the cTAKES Scrubber, or can it be added by
    >> post-processing the output? For instance, if we know the name Nancy is
    >> removed from multiple places, can we use a name dictionary to insert random
    >> female first names in those places (just a thought)?
    >> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
    >> candidates and I'm hoping that we could find ways to collaborate.
    >>
    >> Thank you very much,
    >> Masoud
    >>
    >> ----
    >> Masoud Rouhizadeh, PhD
    >> Faculty - Division of Health Science Informatics (DHSI)
    >> NLP Lead - Institute for Clinical and Translational Research (ICTR)
    >> Johns Hopkins University School of Medicine
    >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
    >>
    >>
    >
    > --
    > Regards,
    > Gandhi
    >
    > "The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by "Lingren, Todd" <To...@cchmc.org>.

Hi Masoud,

The replacement was the same within a note, but not standardized across the complete record for a patient. Date shifting was also within a note, not across a record. The NER task doesn't really matter in this regard, and even for more extensive time-series info extraction/prediction, that shouldn't be relying on PHI anyway.

One other point about addresses, we obfuscated the road type. For example if the address said 123 Main Street, we would change that to 429 First Avenue, or something like that. And woudn't use Main Street (only Main Avenue/Road/Drive/Boulevard) in other replacements.



----------------------

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.lingren@cchmc.org
(513) 803-9032


________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Thursday, July 18, 2019 12:27:41 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Thanks, everyone, for their great feedback. Very helpful insights!

Here are a few comments and questions:

(1) Peter: great paper! I agree that replacing the same real person’s name by the same pseudonym makes the text easier to interpret but on the other hand, wouldn't it make the de-identification less robust? I think if we pick a random pseudonym in each instance, it would be difficult to find the real name (in case it is missed by the de-id system) when it is surrounded by (lots of) pseudonyms.

(2) Peter: I'd appreciate if you could share your code. That would be helpful indeed.

(3) Todd: in your work, did you replace the same real person’s name by the same pseudonym across the note or you assigned a random name each time?

(4) Date shifting can be complicated. In addition to the admission case that Peter pointed out, we would need to deal with consistency. Will shifting the date by a random yet consistent number across that single note is sufficient or should we do this at the patient level? For instance, if some signs and symptoms observed and reported 1 year before the diagnosis, this trajectory should be preserved. Age would be another issue. Some risk factors are age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields (e.g. name, DOB, SSN, contact info) to help the note de-identification system? if the note de-id system is aware of the person's real name, we could make it more sensitive to that name, or if we know the street in which the person lives, we can pay more attention to that in the free text. Just wondering if any de-id tool uses this information systematically?

Thank you all!
Masoud

On 7/17/19, 3:01 PM, "Lingren, Todd" <To...@cchmc.org> wrote:

    We had some similar work on de-id and "re-id".

    The impact on performance for NER tasks was minimal.

    https://academic.oup.com/jamia/article/20/1/84/2909298

    The replacing PHI task was employed with data based on US CENSUS distribution.

    https://www.sciencedirect.com/science/article/pii/S1532046414000161



    ----------------------

    Todd Lingren, M.S.
    Division of Biomedical Informatics
    Cincinnati Children's Hospital
    todd.lingren@cchmc.org
    (513) 803-9032


    ________________________________
    From: Peter Szolovits <ps...@mit.edu>
    Sent: Wednesday, July 17, 2019 1:12:21 PM
    To: dev@ctakes.apache.org
    Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

    My group has done considerable work on de-identification and on synthesizing pseudonymous data to replace the original PHI with plausible but inauthentic data (sometimes confusingly called re-identification).

    One conclusion I reached from that work is that the de-identification and the pseudonym generation should be tightly coupled. For example, if de-id replaces all people’s names by [person], then there is no way in the pseudonym generation to make sure that the same real person’s name is replaced by the same pseudonym in every occurrence, leading to much harder to interpret text.  The same goes for other PHI categories.

    I think it’s also important to keep similar formatting if the pseudonymized data are going to be used for NLP learning tasks.  So, for example, the format of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. Nicknames are a problem as well; if the same document also refers to Joe, and the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement for Joe should be Bob.  Gender is also tough because there are so many names that are either ambiguous or not in name dictionaries.

    Date shifting also introduces pseudonymization problems.  For example, a patient admitted on December 15 may have a note saying they are expected to be discharged right after Christmas. If the admission date is shifted, say to mid-January, then retaining the discharge expectation would imply a very long anticipated hospital stay.

    We published a paper on this topic:
    https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 <https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

    I also have some old Java code that deal with a few of these issues, and would be happy to share with anyone interested, though it’s far from production quality and does not address all the issues we know.

    —Peter Szolovits

    > On Jul 17, 2019, at 12:42 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
    >
    > Hi All,
    >
    > ctakes-scrubber is not in any ctakes release and it is not in the main repository.  It never went beyond experimental and resides within the ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ <https://svn.apache.org/repos/asf/ctakes/sandbox/>
    >
    > From what I recall, scrubber does not have "real" name replacement, but instead de-identifies entities by removing them and inserting a tag indicating the type of entity.  For instance: "John has a rash" -> "[person] has a rash".   That is not verbatim, but it is the general idea.
    >
    > If you can get ctakes-scrubber working in your project then it would be pretty easy to create an engine that does nothing except replace such generic tags with random names, dates, institutions, etc.
    >
    > Sean
    > ________________________________________
    > From: gandhi rajan <gandhirajan.n@gmail.com <ma...@gmail.com>>
    > Sent: Wednesday, July 17, 2019 12:26 PM
    > To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
    > Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    >
    > Hi Masoud, we had a similar requirement to identify patient names in the
    > narratives text and I had a discussion with Sean Finan on patient name
    > identification feature in cTAKES. What he told at that point in time was
    > cTAKES dint supported patient name identification feature. Also as far as I
    > know, I m not really sure whether scrubber made it to the cTAKES codebase.
    >
    > Sean, Please correct me if I m wrong.
    >
    > On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
    >
    >> Dear cTAKES developer,
    >> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
    >> Institute for Clinical and Translational Research and work on
    >> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
    >> goals we are targeting is de-identification of a large number of notes
    >> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
    >> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
    >> has been very helpful.
    >>
    >> One of our most desired features in the de-identification pipeline is
    >> synthetic replacement (e.g. Nancy->Sally; random female first name
    >> consistently replaces a female first name.). I wasn't able to find
    >> information about this feature in cTAKES Scrubber. Is synthetic replacement
    >> functionality part of the cTAKES Scrubber, or can it be added by
    >> post-processing the output? For instance, if we know the name Nancy is
    >> removed from multiple places, can we use a name dictionary to insert random
    >> female first names in those places (just a thought)?
    >> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
    >> candidates and I'm hoping that we could find ways to collaborate.
    >>
    >> Thank you very much,
    >> Masoud
    >>
    >> ----
    >> Masoud Rouhizadeh, PhD
    >> Faculty - Division of Health Science Informatics (DHSI)
    >> NLP Lead - Institute for Clinical and Translational Research (ICTR)
    >> Johns Hopkins University School of Medicine
    >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
    >>
    >>
    >
    > --
    > Regards,
    > Gandhi
    >
    > "The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by Masoud Rouhizadeh <mr...@jhu.edu>.

Thanks, everyone, for their great feedback. Very helpful insights! 

Here are a few comments and questions: 

(1) Peter: great paper! I agree that replacing the same real person’s name by the same pseudonym makes the text easier to interpret but on the other hand, wouldn't it make the de-identification less robust? I think if we pick a random pseudonym in each instance, it would be difficult to find the real name (in case it is missed by the de-id system) when it is surrounded by (lots of) pseudonyms. 

(2) Peter: I'd appreciate if you could share your code. That would be helpful indeed. 

(3) Todd: in your work, did you replace the same real person’s name by the same pseudonym across the note or you assigned a random name each time?  

(4) Date shifting can be complicated. In addition to the admission case that Peter pointed out, we would need to deal with consistency. Will shifting the date by a random yet consistent number across that single note is sufficient or should we do this at the patient level? For instance, if some signs and symptoms observed and reported 1 year before the diagnosis, this trajectory should be preserved. Age would be another issue. Some risk factors are age-specific.

(5) Does anyone have any thoughts of using metadata from structured fields (e.g. name, DOB, SSN, contact info) to help the note de-identification system? if the note de-id system is aware of the person's real name, we could make it more sensitive to that name, or if we know the street in which the person lives, we can pay more attention to that in the free text. Just wondering if any de-id tool uses this information systematically? 

Thank you all! 
Masoud

On 7/17/19, 3:01 PM, "Lingren, Todd" <To...@cchmc.org> wrote:

    We had some similar work on de-id and "re-id".
    
    The impact on performance for NER tasks was minimal.
    
    https://academic.oup.com/jamia/article/20/1/84/2909298
    
    The replacing PHI task was employed with data based on US CENSUS distribution.
    
    https://www.sciencedirect.com/science/article/pii/S1532046414000161
    
    
    
    ----------------------
    
    Todd Lingren, M.S.
    Division of Biomedical Informatics
    Cincinnati Children's Hospital
    todd.lingren@cchmc.org
    (513) 803-9032
    
    
    ________________________________
    From: Peter Szolovits <ps...@mit.edu>
    Sent: Wednesday, July 17, 2019 1:12:21 PM
    To: dev@ctakes.apache.org
    Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    
    My group has done considerable work on de-identification and on synthesizing pseudonymous data to replace the original PHI with plausible but inauthentic data (sometimes confusingly called re-identification).
    
    One conclusion I reached from that work is that the de-identification and the pseudonym generation should be tightly coupled. For example, if de-id replaces all people’s names by [person], then there is no way in the pseudonym generation to make sure that the same real person’s name is replaced by the same pseudonym in every occurrence, leading to much harder to interpret text.  The same goes for other PHI categories.
    
    I think it’s also important to keep similar formatting if the pseudonymized data are going to be used for NLP learning tasks.  So, for example, the format of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. Nicknames are a problem as well; if the same document also refers to Joe, and the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement for Joe should be Bob.  Gender is also tough because there are so many names that are either ambiguous or not in name dictionaries.
    
    Date shifting also introduces pseudonymization problems.  For example, a patient admitted on December 15 may have a note saying they are expected to be discharged right after Christmas. If the admission date is shifted, say to mid-January, then retaining the discharge expectation would imply a very long anticipated hospital stay.
    
    We published a paper on this topic:
    https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 <https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>
    
    I also have some old Java code that deal with a few of these issues, and would be happy to share with anyone interested, though it’s far from production quality and does not address all the issues we know.
    
    —Peter Szolovits
    
    > On Jul 17, 2019, at 12:42 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
    >
    > Hi All,
    >
    > ctakes-scrubber is not in any ctakes release and it is not in the main repository.  It never went beyond experimental and resides within the ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ <https://svn.apache.org/repos/asf/ctakes/sandbox/>
    >
    > From what I recall, scrubber does not have "real" name replacement, but instead de-identifies entities by removing them and inserting a tag indicating the type of entity.  For instance: "John has a rash" -> "[person] has a rash".   That is not verbatim, but it is the general idea.
    >
    > If you can get ctakes-scrubber working in your project then it would be pretty easy to create an engine that does nothing except replace such generic tags with random names, dates, institutions, etc.
    >
    > Sean
    > ________________________________________
    > From: gandhi rajan <gandhirajan.n@gmail.com <ma...@gmail.com>>
    > Sent: Wednesday, July 17, 2019 12:26 PM
    > To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
    > Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
    >
    > Hi Masoud, we had a similar requirement to identify patient names in the
    > narratives text and I had a discussion with Sean Finan on patient name
    > identification feature in cTAKES. What he told at that point in time was
    > cTAKES dint supported patient name identification feature. Also as far as I
    > know, I m not really sure whether scrubber made it to the cTAKES codebase.
    >
    > Sean, Please correct me if I m wrong.
    >
    > On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
    >
    >> Dear cTAKES developer,
    >> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
    >> Institute for Clinical and Translational Research and work on
    >> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
    >> goals we are targeting is de-identification of a large number of notes
    >> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
    >> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
    >> has been very helpful.
    >>
    >> One of our most desired features in the de-identification pipeline is
    >> synthetic replacement (e.g. Nancy->Sally; random female first name
    >> consistently replaces a female first name.). I wasn't able to find
    >> information about this feature in cTAKES Scrubber. Is synthetic replacement
    >> functionality part of the cTAKES Scrubber, or can it be added by
    >> post-processing the output? For instance, if we know the name Nancy is
    >> removed from multiple places, can we use a name dictionary to insert random
    >> female first names in those places (just a thought)?
    >> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
    >> candidates and I'm hoping that we could find ways to collaborate.
    >>
    >> Thank you very much,
    >> Masoud
    >>
    >> ----
    >> Masoud Rouhizadeh, PhD
    >> Faculty - Division of Health Science Informatics (DHSI)
    >> NLP Lead - Institute for Clinical and Translational Research (ICTR)
    >> Johns Hopkins University School of Medicine
    >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
    >>
    >>
    >
    > --
    > Regards,
    > Gandhi
    >
    > "The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by "Lingren, Todd" <To...@cchmc.org>.

We had some similar work on de-id and "re-id".

The impact on performance for NER tasks was minimal.

https://academic.oup.com/jamia/article/20/1/84/2909298

The replacing PHI task was employed with data based on US CENSUS distribution.

https://www.sciencedirect.com/science/article/pii/S1532046414000161

----------------------

Todd Lingren, M.S.
Division of Biomedical Informatics
Cincinnati Children's Hospital
todd.lingren@cchmc.org
(513) 803-9032

________________________________
From: Peter Szolovits <ps...@mit.edu>
Sent: Wednesday, July 17, 2019 1:12:21 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

My group has done considerable work on de-identification and on synthesizing pseudonymous data to replace the original PHI with plausible but inauthentic data (sometimes confusingly called re-identification).

One conclusion I reached from that work is that the de-identification and the pseudonym generation should be tightly coupled. For example, if de-id replaces all people’s names by [person], then there is no way in the pseudonym generation to make sure that the same real person’s name is replaced by the same pseudonym in every occurrence, leading to much harder to interpret text.  The same goes for other PHI categories.

I think it’s also important to keep similar formatting if the pseudonymized data are going to be used for NLP learning tasks.  So, for example, the format of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. Nicknames are a problem as well; if the same document also refers to Joe, and the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement for Joe should be Bob.  Gender is also tough because there are so many names that are either ambiguous or not in name dictionaries.

Date shifting also introduces pseudonymization problems.  For example, a patient admitted on December 15 may have a note saying they are expected to be discharged right after Christmas. If the admission date is shifted, say to mid-January, then retaining the discharge expectation would imply a very long anticipated hospital stay.

We published a paper on this topic:
https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 <https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

I also have some old Java code that deal with a few of these issues, and would be happy to share with anyone interested, though it’s far from production quality and does not address all the issues we know.

—Peter Szolovits

> On Jul 17, 2019, at 12:42 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
>
> Hi All,
>
> ctakes-scrubber is not in any ctakes release and it is not in the main repository.  It never went beyond experimental and resides within the ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ <https://svn.apache.org/repos/asf/ctakes/sandbox/>
>
> From what I recall, scrubber does not have "real" name replacement, but instead de-identifies entities by removing them and inserting a tag indicating the type of entity.  For instance: "John has a rash" -> "[person] has a rash".   That is not verbatim, but it is the general idea.
>
> If you can get ctakes-scrubber working in your project then it would be pretty easy to create an engine that does nothing except replace such generic tags with random names, dates, institutions, etc.
>
> Sean
> ________________________________________
> From: gandhi rajan <gandhirajan.n@gmail.com <ma...@gmail.com>>
> Sent: Wednesday, July 17, 2019 12:26 PM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
>
> Hi Masoud, we had a similar requirement to identify patient names in the
> narratives text and I had a discussion with Sean Finan on patient name
> identification feature in cTAKES. What he told at that point in time was
> cTAKES dint supported patient name identification feature. Also as far as I
> know, I m not really sure whether scrubber made it to the cTAKES codebase.
>
> Sean, Please correct me if I m wrong.
>
> On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
>
>> Dear cTAKES developer,
>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
>> Institute for Clinical and Translational Research and work on
>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
>> goals we are targeting is de-identification of a large number of notes
>> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
>> has been very helpful.
>>
>> One of our most desired features in the de-identification pipeline is
>> synthetic replacement (e.g. Nancy->Sally; random female first name
>> consistently replaces a female first name.). I wasn't able to find
>> information about this feature in cTAKES Scrubber. Is synthetic replacement
>> functionality part of the cTAKES Scrubber, or can it be added by
>> post-processing the output? For instance, if we know the name Nancy is
>> removed from multiple places, can we use a name dictionary to insert random
>> female first names in those places (just a thought)?
>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
>> candidates and I'm hoping that we could find ways to collaborate.
>>
>> Thank you very much,
>> Masoud
>>
>> ----
>> Masoud Rouhizadeh, PhD
>> Faculty - Division of Health Science Informatics (DHSI)
>> NLP Lead - Institute for Clinical and Translational Research (ICTR)
>> Johns Hopkins University School of Medicine
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
>>
>>
>
> --
> Regards,
> Gandhi
>
> "The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by Peter Szolovits <ps...@mit.edu>.

My group has done considerable work on de-identification and on synthesizing pseudonymous data to replace the original PHI with plausible but inauthentic data (sometimes confusingly called re-identification). 

One conclusion I reached from that work is that the de-identification and the pseudonym generation should be tightly coupled. For example, if de-id replaces all people’s names by [person], then there is no way in the pseudonym generation to make sure that the same real person’s name is replaced by the same pseudonym in every occurrence, leading to much harder to interpret text.  The same goes for other PHI categories.

I think it’s also important to keep similar formatting if the pseudonymized data are going to be used for NLP learning tasks.  So, for example, the format of names should be preserved; e.g., Smith, Joseph P. vs Joseph P. Smith. Nicknames are a problem as well; if the same document also refers to Joe, and the generated pseudonym for Mr. Smith is Robert J. Quincy, then the replacement for Joe should be Bob.  Gender is also tough because there are so many names that are either ambiguous or not in name dictionaries.

Date shifting also introduces pseudonymization problems.  For example, a patient admitted on December 15 may have a note saying they are expected to be discharged right after Christmas. If the admission date is shifted, say to mid-January, then retaining the discharge expectation would imply a very long anticipated hospital stay.

We published a paper on this topic:
https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27 <https://link.springer.com/chapter/10.1007/978-3-319-23633-9_27>

I also have some old Java code that deal with a few of these issues, and would be happy to share with anyone interested, though it’s far from production quality and does not address all the issues we know.

—Peter Szolovits

> On Jul 17, 2019, at 12:42 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Hi All,
> 
> ctakes-scrubber is not in any ctakes release and it is not in the main repository.  It never went beyond experimental and resides within the ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/ <https://svn.apache.org/repos/asf/ctakes/sandbox/>
> 
> From what I recall, scrubber does not have "real" name replacement, but instead de-identifies entities by removing them and inserting a tag indicating the type of entity.  For instance: "John has a rash" -> "[person] has a rash".   That is not verbatim, but it is the general idea.
> 
> If you can get ctakes-scrubber working in your project then it would be pretty easy to create an engine that does nothing except replace such generic tags with random names, dates, institutions, etc.
> 
> Sean
> ________________________________________
> From: gandhi rajan <gandhirajan.n@gmail.com <ma...@gmail.com>>
> Sent: Wednesday, July 17, 2019 12:26 PM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
> 
> Hi Masoud, we had a similar requirement to identify patient names in the
> narratives text and I had a discussion with Sean Finan on patient name
> identification feature in cTAKES. What he told at that point in time was
> cTAKES dint supported patient name identification feature. Also as far as I
> know, I m not really sure whether scrubber made it to the cTAKES codebase.
> 
> Sean, Please correct me if I m wrong.
> 
> On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
> 
>> Dear cTAKES developer,
>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
>> Institute for Clinical and Translational Research and work on
>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
>> goals we are targeting is de-identification of a large number of notes
>> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
>> has been very helpful.
>> 
>> One of our most desired features in the de-identification pipeline is
>> synthetic replacement (e.g. Nancy->Sally; random female first name
>> consistently replaces a female first name.). I wasn't able to find
>> information about this feature in cTAKES Scrubber. Is synthetic replacement
>> functionality part of the cTAKES Scrubber, or can it be added by
>> post-processing the output? For instance, if we know the name Nancy is
>> removed from multiple places, can we use a name dictionary to insert random
>> female first names in those places (just a thought)?
>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
>> candidates and I'm hoping that we could find ways to collaborate.
>> 
>> Thank you very much,
>> Masoud
>> 
>> ----
>> Masoud Rouhizadeh, PhD
>> Faculty - Division of Health Science Informatics (DHSI)
>> NLP Lead - Institute for Clinical and Translational Research (ICTR)
>> Johns Hopkins University School of Medicine
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=>
>> 
>> 
> 
> --
> Regards,
> Gandhi
> 
> "The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by gandhi rajan <ga...@gmail.com>.

Hi Ravi,

Send out an email to dev-unsubscribe@ctakes.apache.org to un-subscribe.

On Wednesday, July 17, 2019, Ravi Tejwani <ra...@icloud.com.invalid>
wrote:

> How can I un-subscribe from this? Any help would be kindly appreciated.
>
> - Ravi
>
> > On Jul 17, 2019, at 12:53 PM, gandhi rajan <ga...@gmail.com>
> wrote:
> >
> > Thanks for the insight Sean.
> >
> > On Wednesday, July 17, 2019, Finan, Sean <
> Sean.Finan@childrens.harvard.edu>
> > wrote:
> >
> >> Hi All,
> >>
> >> ctakes-scrubber is not in any ctakes release and it is not in the main
> >> repository.  It never went beyond experimental and resides within the
> >> ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/
> >>
> >> From what I recall, scrubber does not have "real" name replacement, but
> >> instead de-identifies entities by removing them and inserting a tag
> >> indicating the type of entity.  For instance:  "John has a rash" ->
> >> "[person] has a rash".   That is not verbatim, but it is the general
> idea.
> >>
> >> If you can get ctakes-scrubber working in your project then it would be
> >> pretty easy to create an engine that does nothing except replace such
> >> generic tags with random names, dates, institutions, etc.
> >>
> >> Sean
> >> ________________________________________
> >> From: gandhi rajan <ga...@gmail.com>
> >> Sent: Wednesday, July 17, 2019 12:26 PM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
> >>
> >> Hi Masoud, we had a similar requirement to identify patient names in the
> >> narratives text and I had a discussion with Sean Finan on patient name
> >> identification feature in cTAKES. What he told at that point in time was
> >> cTAKES dint supported patient name identification feature. Also as far
> as I
> >> know, I m not really sure whether scrubber made it to the cTAKES
> codebase.
> >>
> >> Sean, Please correct me if I m wrong.
> >>
> >> On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
> >>
> >>> Dear cTAKES developer,
> >>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
> >>> Institute for Clinical and Translational Research and work on
> >>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the
> major
> >>> goals we are targeting is de-identification of a large number of notes
> >>> (350M) to prepare them for search and indexing (Elasticsearch and
> Solr).
> >> I
> >>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and
> she
> >>> has been very helpful.
> >>>
> >>> One of our most desired features in the de-identification pipeline is
> >>> synthetic replacement (e.g. Nancy->Sally; random female first name
> >>> consistently replaces a female first name.). I wasn't able to find
> >>> information about this feature in cTAKES Scrubber. Is synthetic
> >> replacement
> >>> functionality part of the cTAKES Scrubber, or can it be added by
> >>> post-processing the output? For instance, if we know the name Nancy is
> >>> removed from multiple places, can we use a name dictionary to insert
> >> random
> >>> female first names in those places (just a thought)?
> >>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
> >>> candidates and I'm hoping that we could find ways to collaborate.
> >>>
> >>> Thank you very much,
> >>> Masoud
> >>>
> >>> ----
> >>> Masoud Rouhizadeh, PhD
> >>> Faculty - Division of Health Science Informatics (DHSI)
> >>> NLP Lead - Institute for Clinical and Translational Research (ICTR)
> >>> Johns Hopkins University School of Medicine
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.
> >> jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=
> >> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> >> aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_
> >> sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
> >>>
> >>>
> >>
> >> --
> >> Regards,
> >> Gandhi
> >>
> >> "The best way to find urself is to lose urself in the service of others
> >> !!!"
> >>
> >
> >
> > --
> > Regards,
> > Gandhi
> >
> > "The best way to find urself is to lose urself in the service of others
> !!!"
>
>

-- 
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by Ravi Tejwani <ra...@icloud.com.INVALID>.

How can I un-subscribe from this? Any help would be kindly appreciated. 

- Ravi

> On Jul 17, 2019, at 12:53 PM, gandhi rajan <ga...@gmail.com> wrote:
> 
> Thanks for the insight Sean.
> 
> On Wednesday, July 17, 2019, Finan, Sean <Se...@childrens.harvard.edu>
> wrote:
> 
>> Hi All,
>> 
>> ctakes-scrubber is not in any ctakes release and it is not in the main
>> repository.  It never went beyond experimental and resides within the
>> ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/
>> 
>> From what I recall, scrubber does not have "real" name replacement, but
>> instead de-identifies entities by removing them and inserting a tag
>> indicating the type of entity.  For instance:  "John has a rash" ->
>> "[person] has a rash".   That is not verbatim, but it is the general idea.
>> 
>> If you can get ctakes-scrubber working in your project then it would be
>> pretty easy to create an engine that does nothing except replace such
>> generic tags with random names, dates, institutions, etc.
>> 
>> Sean
>> ________________________________________
>> From: gandhi rajan <ga...@gmail.com>
>> Sent: Wednesday, July 17, 2019 12:26 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
>> 
>> Hi Masoud, we had a similar requirement to identify patient names in the
>> narratives text and I had a discussion with Sean Finan on patient name
>> identification feature in cTAKES. What he told at that point in time was
>> cTAKES dint supported patient name identification feature. Also as far as I
>> know, I m not really sure whether scrubber made it to the cTAKES codebase.
>> 
>> Sean, Please correct me if I m wrong.
>> 
>> On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
>> 
>>> Dear cTAKES developer,
>>> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
>>> Institute for Clinical and Translational Research and work on
>>> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
>>> goals we are targeting is de-identification of a large number of notes
>>> (350M) to prepare them for search and indexing (Elasticsearch and Solr).
>> I
>>> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
>>> has been very helpful.
>>> 
>>> One of our most desired features in the de-identification pipeline is
>>> synthetic replacement (e.g. Nancy->Sally; random female first name
>>> consistently replaces a female first name.). I wasn't able to find
>>> information about this feature in cTAKES Scrubber. Is synthetic
>> replacement
>>> functionality part of the cTAKES Scrubber, or can it be added by
>>> post-processing the output? For instance, if we know the name Nancy is
>>> removed from multiple places, can we use a name dictionary to insert
>> random
>>> female first names in those places (just a thought)?
>>> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
>>> candidates and I'm hoping that we could find ways to collaborate.
>>> 
>>> Thank you very much,
>>> Masoud
>>> 
>>> ----
>>> Masoud Rouhizadeh, PhD
>>> Faculty - Division of Health Science Informatics (DHSI)
>>> NLP Lead - Institute for Clinical and Translational Research (ICTR)
>>> Johns Hopkins University School of Medicine
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.
>> jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
>> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
>> aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_
>> sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
>>> 
>>> 
>> 
>> --
>> Regards,
>> Gandhi
>> 
>> "The best way to find urself is to lose urself in the service of others
>> !!!"
>> 
> 
> 
> -- 
> Regards,
> Gandhi
> 
> "The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by gandhi rajan <ga...@gmail.com>.

Thanks for the insight Sean.

On Wednesday, July 17, 2019, Finan, Sean <Se...@childrens.harvard.edu>
wrote:

> Hi All,
>
> ctakes-scrubber is not in any ctakes release and it is not in the main
> repository.  It never went beyond experimental and resides within the
> ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/
>
> From what I recall, scrubber does not have "real" name replacement, but
> instead de-identifies entities by removing them and inserting a tag
> indicating the type of entity.  For instance:  "John has a rash" ->
> "[person] has a rash".   That is not verbatim, but it is the general idea.
>
> If you can get ctakes-scrubber working in your project then it would be
> pretty easy to create an engine that does nothing except replace such
> generic tags with random names, dates, institutions, etc.
>
> Sean
> ________________________________________
> From: gandhi rajan <ga...@gmail.com>
> Sent: Wednesday, July 17, 2019 12:26 PM
> To: dev@ctakes.apache.org
> Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]
>
> Hi Masoud, we had a similar requirement to identify patient names in the
> narratives text and I had a discussion with Sean Finan on patient name
> identification feature in cTAKES. What he told at that point in time was
> cTAKES dint supported patient name identification feature. Also as far as I
> know, I m not really sure whether scrubber made it to the cTAKES codebase.
>
> Sean, Please correct me if I m wrong.
>
> On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:
>
> > Dear cTAKES developer,
> > This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
> > Institute for Clinical and Translational Research and work on
> > enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
> > goals we are targeting is de-identification of a large number of notes
> > (350M) to prepare them for search and indexing (Elasticsearch and Solr).
> I
> > have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
> > has been very helpful.
> >
> > One of our most desired features in the de-identification pipeline is
> > synthetic replacement (e.g. Nancy->Sally; random female first name
> > consistently replaces a female first name.). I wasn't able to find
> > information about this feature in cTAKES Scrubber. Is synthetic
> replacement
> > functionality part of the cTAKES Scrubber, or can it be added by
> > post-processing the output? For instance, if we know the name Nancy is
> > removed from multiple places, can we use a name dictionary to insert
> random
> > female first names in those places (just a thought)?
> > Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
> > candidates and I'm hoping that we could find ways to collaborate.
> >
> > Thank you very much,
> > Masoud
> >
> > ----
> > Masoud Rouhizadeh, PhD
> > Faculty - Division of Health Science Informatics (DHSI)
> > NLP Lead - Institute for Clinical and Translational Research (ICTR)
> > Johns Hopkins University School of Medicine
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.
> jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=
> aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_
> sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
> >
> >
>
> --
> Regards,
> Gandhi
>
> "The best way to find urself is to lose urself in the service of others
> !!!"
>


-- 
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"

Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi All,

ctakes-scrubber is not in any ctakes release and it is not in the main repository.  It never went beyond experimental and resides within the ctakes sandbox.  https://svn.apache.org/repos/asf/ctakes/sandbox/

From what I recall, scrubber does not have "real" name replacement, but instead de-identifies entities by removing them and inserting a tag indicating the type of entity.  For instance:  "John has a rash" -> "[person] has a rash".   That is not verbatim, but it is the general idea.

If you can get ctakes-scrubber working in your project then it would be pretty easy to create an engine that does nothing except replace such generic tags with random names, dates, institutions, etc.

Sean
________________________________________
From: gandhi rajan <ga...@gmail.com>
Sent: Wednesday, July 17, 2019 12:26 PM
To: dev@ctakes.apache.org
Subject: Re: Synthetic replacement feature in cTAKES Scrubber [EXTERNAL]

Hi Masoud, we had a similar requirement to identify patient names in the
narratives text and I had a discussion with Sean Finan on patient name
identification feature in cTAKES. What he told at that point in time was
cTAKES dint supported patient name identification feature. Also as far as I
know, I m not really sure whether scrubber made it to the cTAKES codebase.

Sean, Please correct me if I m wrong.

On Wednesday, July 17, 2019, Masoud Rouhizadeh <mr...@jhu.edu> wrote:

> Dear cTAKES developer,
> This is Masoud Rouhizadeh from JHU. I'm leading the NLP effort at the
> Institute for Clinical and Translational Research and work on
> enterprise-level NLP projects at Johns Hopkins Medicine. One of the major
> goals we are targeting is de-identification of a large number of notes
> (350M) to prepare them for search and indexing (Elasticsearch and Solr). I
> have been in touch with Dr. Guergana Savova about cTAKES Scrubber and she
> has been very helpful.
>
> One of our most desired features in the de-identification pipeline is
> synthetic replacement (e.g. Nancy->Sally; random female first name
> consistently replaces a female first name.). I wasn't able to find
> information about this feature in cTAKES Scrubber. Is synthetic replacement
> functionality part of the cTAKES Scrubber, or can it be added by
> post-processing the output? For instance, if we know the name Nancy is
> removed from multiple places, can we use a name dictionary to insert random
> female first names in those places (just a thought)?
> Overall, I wanted to emphasize that cTAKES Scrubber is one of our main
> candidates and I'm hoping that we could find ways to collaborate.
>
> Thank you very much,
> Masoud
>
> ----
> Masoud Rouhizadeh, PhD
> Faculty - Division of Health Science Informatics (DHSI)
> NLP Lead - Institute for Clinical and Translational Research (ICTR)
> Johns Hopkins University School of Medicine
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cs.jhu.edu_-7Emrou_&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aIXsCuGWJqYNNtMb1ZfvZ0gAiw57gtrpZGqLVZjn5o4&s=9mLpsY5OPs7_sAMhA60kB0PJcsttBBK6BYRN_xThZSo&e=
>
>

--
Regards,
Gandhi

"The best way to find urself is to lose urself in the service of others !!!"