You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ctakes.apache.org by "Hari, Sekhar" <se...@cgi.com> on 2017/06/26 00:30:02 UTC

Visit segregation and extraction

Hello there -

I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.

The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -

"Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."

Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?

Many thanks,
Sekhar H.

RE: Visit segregation and extraction [EXTERNAL]

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.

You probably have to add some logic on top of the cTAKES extracted information to distinguish inpatient v outpatient text. 
--Guergana


Guergana Savova, PhD, FACMI
Associate Professor
PI Natural Language Processing Lab
Boston Children's Hospital and Harvard Medical School
300 Longwood Avenue
Mailstop: BCH3092
Enders 144.1
Boston, MA 02115
Tel: (617) 919-2972
Fax: (617) 730-0817
Guergana.Savova@childrens.harvard.edu
Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv
http://ctakes.apache.org  
http://thyme.healthnlp.org 
http://cancer.healthnlp.org 
http://share.healthnlp.org
http://center.healthnlp.org  


-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com] 
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.
    
    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.

RE: Visit segregation and extraction [EXTERNAL]

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.

You probably have to add some logic on top of the cTAKES extracted information to distinguish inpatient v outpatient text. 
--Guergana


Guergana Savova, PhD, FACMI
Associate Professor
PI Natural Language Processing Lab
Boston Children's Hospital and Harvard Medical School
300 Longwood Avenue
Mailstop: BCH3092
Enders 144.1
Boston, MA 02115
Tel: (617) 919-2972
Fax: (617) 730-0817
Guergana.Savova@childrens.harvard.edu
Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv
http://ctakes.apache.org  
http://thyme.healthnlp.org 
http://cancer.healthnlp.org 
http://share.healthnlp.org
http://center.healthnlp.org  


-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com] 
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.
    
    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.

FW: Visit segregation and extraction [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

-----Original Message-----
From: Finan, Sean 
Sent: Monday, June 26, 2017 1:25 PM
To: 'Hari, Sekhar'
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sekhar,

I don't know of any open source straight convertors.  Perhaps somebody on the devlist does?  Does anybody have a special writer that can be plugged into ctakes?

Sean

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com] 
Sent: Monday, June 26, 2017 12:12 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org; Finan, Sean
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sean - Many thanks. I will study that approach further.

On another note, do you know of any Open Source software or a robust method to convert free text clinical documents (such as progress notes, H&P notes etc.) to a structured HL7 format (CCD or QRDA XML file)?

Thanks,
Sekhar H.
________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: 26 June 2017 20:42
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sekhar,

With regard to what Chris wrote ... He is correct, you should somehow get the patient notes into a raw text format.  ctakes does not directly handle pdf files, which contain metadata and instructions in addition to the raw text.  If your ocr tool can save directly to text, please try to use that functionality.

To address your original question, ctakes does not know what kind of note is fed in without some help.  If the notes have some kind of header or footer that distinguishes the type of note then you can write a parser to handle that text and somehow pass the information through to the end of your pipeline.  If you trust a same type of data (e.g. BP) to be in differently named sections depending upon the note type then you can use a sectionizer and store and use the section name.  There may be better techniques that others can recommend.

Sean

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com]
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?

On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -

    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.

    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -

    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."

    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?

    Many thanks,
    Sekhar H.

RE: Visit segregation and extraction [EXTERNAL]

Posted by Brian Wilson <Br...@hsl.harvard.edu>.

Roll your own with http://hl7api.sourceforge.net ?
-B

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com] 
Sent: Monday, June 26, 2017 12:12 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org; Sean.Finan@childrens.harvard.edu
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sean - Many thanks. I will study that approach further.

On another note, do you know of any Open Source software or a robust method to convert free text clinical documents (such as progress notes, H&P notes etc.) to a structured HL7 format (CCD or QRDA XML file)?

Thanks,
Sekhar H.
________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: 26 June 2017 20:42
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sekhar,

With regard to what Chris wrote ... He is correct, you should somehow get the patient notes into a raw text format.  ctakes does not directly handle pdf files, which contain metadata and instructions in addition to the raw text.  If your ocr tool can save directly to text, please try to use that functionality.

To address your original question, ctakes does not know what kind of note is fed in without some help.  If the notes have some kind of header or footer that distinguishes the type of note then you can write a parser to handle that text and somehow pass the information through to the end of your pipeline.  If you trust a same type of data (e.g. BP) to be in differently named sections depending upon the note type then you can use a sectionizer and store and use the section name.  There may be better techniques that others can recommend.

Sean

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com]
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?

On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -

    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.

    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -

    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."

    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?

    Many thanks,
    Sekhar H.

----------------------------------------------------------------------
CONFIDENTIAL NOTICE:

This electronic mail transmission contains confidential information 
including Protected Health Information (PHI) that is legally privileged. 
If you are not the intended recipient, or designee, you are hereby 
notified that any disclosure, copying, distribution or use of any and
all attachments to this transmission is STRICTLY PROHIBITED. If you 
have received this transmission in error, please notify the sender 
immediately to arrange for return or destruction of these documents.

RE: Visit segregation and extraction [EXTERNAL]

Posted by Brian Wilson <Br...@hsl.harvard.edu>.

Roll your own with http://hl7api.sourceforge.net ?
-B

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com] 
Sent: Monday, June 26, 2017 12:12 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org; Sean.Finan@childrens.harvard.edu
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sean - Many thanks. I will study that approach further.

On another note, do you know of any Open Source software or a robust method to convert free text clinical documents (such as progress notes, H&P notes etc.) to a structured HL7 format (CCD or QRDA XML file)?

Thanks,
Sekhar H.
________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: 26 June 2017 20:42
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sekhar,

With regard to what Chris wrote ... He is correct, you should somehow get the patient notes into a raw text format.  ctakes does not directly handle pdf files, which contain metadata and instructions in addition to the raw text.  If your ocr tool can save directly to text, please try to use that functionality.

To address your original question, ctakes does not know what kind of note is fed in without some help.  If the notes have some kind of header or footer that distinguishes the type of note then you can write a parser to handle that text and somehow pass the information through to the end of your pipeline.  If you trust a same type of data (e.g. BP) to be in differently named sections depending upon the note type then you can use a sectionizer and store and use the section name.  There may be better techniques that others can recommend.

Sean

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com]
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?

On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -

    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.

    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -

    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."

    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?

    Many thanks,
    Sekhar H.

----------------------------------------------------------------------
CONFIDENTIAL NOTICE:

This electronic mail transmission contains confidential information 
including Protected Health Information (PHI) that is legally privileged. 
If you are not the intended recipient, or designee, you are hereby 
notified that any disclosure, copying, distribution or use of any and
all attachments to this transmission is STRICTLY PROHIBITED. If you 
have received this transmission in error, please notify the sender 
immediately to arrange for return or destruction of these documents.

RE: Visit segregation and extraction [EXTERNAL]

Posted by "Hari, Sekhar" <se...@cgi.com>.

Hi Sean - Many thanks. I will study that approach further.

On another note, do you know of any Open Source software or a robust method to convert free text clinical documents (such as progress notes, H&P notes etc.) to a structured HL7 format (CCD or QRDA XML file)?

Thanks,
Sekhar H.
________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: 26 June 2017 20:42
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sekhar,

With regard to what Chris wrote ... He is correct, you should somehow get the patient notes into a raw text format.  ctakes does not directly handle pdf files, which contain metadata and instructions in addition to the raw text.  If your ocr tool can save directly to text, please try to use that functionality.

To address your original question, ctakes does not know what kind of note is fed in without some help.  If the notes have some kind of header or footer that distinguishes the type of note then you can write a parser to handle that text and somehow pass the information through to the end of your pipeline.  If you trust a same type of data (e.g. BP) to be in differently named sections depending upon the note type then you can use a sectionizer and store and use the section name.  There may be better techniques that others can recommend.

Sean

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com]
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?

On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -

    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.

    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -

    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."

    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?

    Many thanks,
    Sekhar H.

RE: Visit segregation and extraction [EXTERNAL]

Posted by "Hari, Sekhar" <se...@cgi.com>.

Hi Sean - Many thanks. I will study that approach further.

On another note, do you know of any Open Source software or a robust method to convert free text clinical documents (such as progress notes, H&P notes etc.) to a structured HL7 format (CCD or QRDA XML file)?

Thanks,
Sekhar H.
________________________________________
From: Finan, Sean [Sean.Finan@childrens.harvard.edu]
Sent: 26 June 2017 20:42
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

Hi Sekhar,

With regard to what Chris wrote ... He is correct, you should somehow get the patient notes into a raw text format.  ctakes does not directly handle pdf files, which contain metadata and instructions in addition to the raw text.  If your ocr tool can save directly to text, please try to use that functionality.

To address your original question, ctakes does not know what kind of note is fed in without some help.  If the notes have some kind of header or footer that distinguishes the type of note then you can write a parser to handle that text and somehow pass the information through to the end of your pipeline.  If you trust a same type of data (e.g. BP) to be in differently named sections depending upon the note type then you can use a sectionizer and store and use the section name.  There may be better techniques that others can recommend.

Sean

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com]
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?

On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -

    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.

    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -

    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."

    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?

    Many thanks,
    Sekhar H.

RE: Visit segregation and extraction [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Sekhar,

With regard to what Chris wrote ... He is correct, you should somehow get the patient notes into a raw text format.  ctakes does not directly handle pdf files, which contain metadata and instructions in addition to the raw text.  If your ocr tool can save directly to text, please try to use that functionality.

To address your original question, ctakes does not know what kind of note is fed in without some help.  If the notes have some kind of header or footer that distinguishes the type of note then you can write a parser to handle that text and somehow pass the information through to the end of your pipeline.  If you trust a same type of data (e.g. BP) to be in differently named sections depending upon the note type then you can use a sectionizer and store and use the section name.  There may be better techniques that others can recommend.

Sean

-----Original Message-----
From: Hari, Sekhar [mailto:sekhar.hari@cgi.com] 
Sent: Monday, June 26, 2017 12:44 AM
To: user@ctakes.apache.org; dev@ctakes.apache.org
Subject: RE: Visit segregation and extraction [EXTERNAL]

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.
    
    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.

RE: Visit segregation and extraction

Posted by "Hari, Sekhar" <se...@cgi.com>.

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.
    
    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.

RE: Visit segregation and extraction

Posted by "Hari, Sekhar" <se...@cgi.com>.

These are already readable PDFs and not images. The clinical documents came through to me as scanned images. We then converted those images into readable PDFs using OCR. cTAKES is able to read the texts. But I want to understand if it can distinguish BP test result performed during an outpatient visit and in a non-outpatient visit (such as inpatient stay, ED visit, diagnostic test, or surgical procedure). The texts are cluttered with different types of clinical documents (progress notes, radiology notes, H&P notes etc.).

Thanks
Sekhar Hari | Program Lead
Health Sciences Business Innovation
ASDC CGI Health Solutions
Electronic City, Bangalore
Karnataka, India 560100

814 7027 779 (C)
080 6642 2536 (D)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: 26 June 2017 10:03
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Visit segregation and extraction

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.
    
    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.

Re: Visit segregation and extraction

Posted by Chris Mattmann <ma...@apache.org>.

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the 
resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.
    
    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.

Re: Visit segregation and extraction

Posted by Chris Mattmann <ma...@apache.org>.

Maybe start out with Apache Tika for text extraction from the PDFs, then run Apache cTAKES on the 
resultant text?



On 6/25/17, 5:30 PM, "Hari, Sekhar" <se...@cgi.com> wrote:

    Hello there -
    
    I have a task in hand to process 7,000,000 patient records (PDF files) containing different clinical documents. Each PDF has 20 pages and one PDF = one patient.
    
    The information to retrieve from these documents is like this for a patient quality measure namely 'Controlling High Blood Pressure' -
    
    "Extract most recently documented blood pressure occurring after the diagnosis of hypertension (Do not use BP readings from inpatient stay, ED visit, diagnostic test, or surgical procedure). Blood pressure should be routinely assessed as part of a physical exam at each outpatient visit."
    
    Can cTAKES identify non-outpatient visits and outpatient visits separately? Are there specific pipelines that we should use to solve this problem?
    
    Many thanks,
    Sekhar H.