You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Finan, Sean" <Se...@childrens.harvard.edu> on 2017/10/16 14:46:47 UTC

RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.  
I am by no means an expert on Tesseract, but perhaps I can help to get you started ...

There are tricks that you can use to get it to work better with medical notes (besides training on fonts).  Possibly the most effective is using a whitelist of desired characters using tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...  Another is to add a wordlist that contains words pertinent to your domain.  See:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data
https://stackoverflow.com/questions/9568165/custom-dictionary-for-tesseract
https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg10100.html

Good luck,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com] 
Sent: Monday, October 16, 2017 10:13 AM
To: dev@ctakes.apache.org
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text extraction from images? I am currently using tesseract and seeing some  text extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: OCR engine used [EXTERNAL]

Posted by Ab...@cognizant.com.

Sure Sean, will update you on the progress.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Friday, October 20, 2017 6:41 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

If you do find a particular configuration that works well for clinical notes, could you please share it?  Somebody might be able to put together a preprocessor flow or even a ctakes collection reader to use it.

Thanks,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Friday, October 20, 2017 6:45 AM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Thank you Sean for sharing your experience. I will reach out to  Tesseract forum for further queries.

Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, October 19, 2017 6:52 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I haven't evaluated a per-character (or per-word) accuracy, but things look fair (80-90%? per char) from spot-checks.  Obviously the accuracy is dependent upon the quality of the input, and sometimes you get a lousy 300dpi scan made from 200dpi printed fax ...

I do run pdf first through imagemagik, but that is more to get pagination than anything else as sometimes an entire patient history is in one giant pdf.

I am not an expert on ocr or any particular tool or method.  There is a lot of Tesseract discussion on the web if you can find it.

Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Thursday, October 19, 2017 1:20 AM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Sean,

What is the accuracy that you get from OCR? We are at  60-70% accuracy.  Most of the documents are 200 DPI ones. Also, are you using any other software like Matlab for the OCR pre or  post processing.

Thanks,
Abilash Mathew

-----Original Message-----
From: Mathew, Abilash (Cognizant)
Sent: Monday, October 16, 2017 8:37 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Thanks Sean fir the quick reply and providing the valuable information.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 16, 2017 8:17 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.
I am by no means an expert on Tesseract, but perhaps I can help to get you started ...

There are tricks that you can use to get it to work better with medical notes (besides training on fonts).  Possibly the most effective is using a whitelist of desired characters using tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...  Another is to add a wordlist that contains words pertinent to your domain.  See:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_wiki_ImproveQuality-23dictionaries-2Dword-2Dlists-2Dand-2Dpatterns&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=mdvTV4CsdGjAgIX6yNzNYCrkBuDVrNvOgxKiv-R9vxI&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_blob_master_doc_tesseract.1.asc-23config-2Dfiles-2Dand-2Daugmenting-2Dwith-2Duser-2Ddata&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=w6BHFOtmh6VsGVBFaH2yhVLqxyezeW8ozgRhM67ImS0&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_9568165_custom-2Ddictionary-2Dfor-2Dtesseract&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=vqJa6rcFsmUgCotpp3fbfF6epW4WiHCJWugr4eFIyWs&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_tesseract-2Docr-40googlegroups.com_msg10100.html&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=ORz4k4McDLmQa64dLEgFCE-oVBIW0LNNh2mVMb2T2Xk&e=

Good luck,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Monday, October 16, 2017 10:13 AM
To: dev@ctakes.apache.org
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text extraction from images? I am currently using tesseract and seeing some  text extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: OCR engine used [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Abilash Mathew,

If you do find a particular configuration that works well for clinical notes, could you please share it?  Somebody might be able to put together a preprocessor flow or even a ctakes collection reader to use it.

Thanks,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com] 
Sent: Friday, October 20, 2017 6:45 AM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Thank you Sean for sharing your experience. I will reach out to  Tesseract forum for further queries.


Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, October 19, 2017 6:52 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I haven't evaluated a per-character (or per-word) accuracy, but things look fair (80-90%? per char) from spot-checks.  Obviously the accuracy is dependent upon the quality of the input, and sometimes you get a lousy 300dpi scan made from 200dpi printed fax ...

I do run pdf first through imagemagik, but that is more to get pagination than anything else as sometimes an entire patient history is in one giant pdf.

I am not an expert on ocr or any particular tool or method.  There is a lot of Tesseract discussion on the web if you can find it.

Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Thursday, October 19, 2017 1:20 AM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Sean,

What is the accuracy that you get from OCR? We are at  60-70% accuracy.  Most of the documents are 200 DPI ones. Also, are you using any other software like Matlab for the OCR pre or  post processing.

Thanks,
Abilash Mathew

-----Original Message-----
From: Mathew, Abilash (Cognizant)
Sent: Monday, October 16, 2017 8:37 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Thanks Sean fir the quick reply and providing the valuable information.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 16, 2017 8:17 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.
I am by no means an expert on Tesseract, but perhaps I can help to get you started ...

There are tricks that you can use to get it to work better with medical notes (besides training on fonts).  Possibly the most effective is using a whitelist of desired characters using tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...  Another is to add a wordlist that contains words pertinent to your domain.  See:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_wiki_ImproveQuality-23dictionaries-2Dword-2Dlists-2Dand-2Dpatterns&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=mdvTV4CsdGjAgIX6yNzNYCrkBuDVrNvOgxKiv-R9vxI&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_blob_master_doc_tesseract.1.asc-23config-2Dfiles-2Dand-2Daugmenting-2Dwith-2Duser-2Ddata&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=w6BHFOtmh6VsGVBFaH2yhVLqxyezeW8ozgRhM67ImS0&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_9568165_custom-2Ddictionary-2Dfor-2Dtesseract&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=vqJa6rcFsmUgCotpp3fbfF6epW4WiHCJWugr4eFIyWs&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_tesseract-2Docr-40googlegroups.com_msg10100.html&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=ORz4k4McDLmQa64dLEgFCE-oVBIW0LNNh2mVMb2T2Xk&e=

Good luck,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Monday, October 16, 2017 10:13 AM
To: dev@ctakes.apache.org
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text extraction from images? I am currently using tesseract and seeing some  text extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: OCR engine used [EXTERNAL]

Posted by Ab...@cognizant.com.

Thank you Sean for sharing your experience. I will reach out to  Tesseract forum for further queries.

Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Thursday, October 19, 2017 6:52 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I haven't evaluated a per-character (or per-word) accuracy, but things look fair (80-90%? per char) from spot-checks.  Obviously the accuracy is dependent upon the quality of the input, and sometimes you get a lousy 300dpi scan made from 200dpi printed fax ...

I do run pdf first through imagemagik, but that is more to get pagination than anything else as sometimes an entire patient history is in one giant pdf.

I am not an expert on ocr or any particular tool or method.  There is a lot of Tesseract discussion on the web if you can find it.

Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Thursday, October 19, 2017 1:20 AM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Sean,

What is the accuracy that you get from OCR? We are at  60-70% accuracy.  Most of the documents are 200 DPI ones. Also, are you using any other software like Matlab for the OCR pre or  post processing.

Thanks,
Abilash Mathew

-----Original Message-----
From: Mathew, Abilash (Cognizant)
Sent: Monday, October 16, 2017 8:37 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Thanks Sean fir the quick reply and providing the valuable information.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 16, 2017 8:17 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.
I am by no means an expert on Tesseract, but perhaps I can help to get you started ...

There are tricks that you can use to get it to work better with medical notes (besides training on fonts).  Possibly the most effective is using a whitelist of desired characters using tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...  Another is to add a wordlist that contains words pertinent to your domain.  See:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_wiki_ImproveQuality-23dictionaries-2Dword-2Dlists-2Dand-2Dpatterns&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=mdvTV4CsdGjAgIX6yNzNYCrkBuDVrNvOgxKiv-R9vxI&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_blob_master_doc_tesseract.1.asc-23config-2Dfiles-2Dand-2Daugmenting-2Dwith-2Duser-2Ddata&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=w6BHFOtmh6VsGVBFaH2yhVLqxyezeW8ozgRhM67ImS0&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_9568165_custom-2Ddictionary-2Dfor-2Dtesseract&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=vqJa6rcFsmUgCotpp3fbfF6epW4WiHCJWugr4eFIyWs&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_tesseract-2Docr-40googlegroups.com_msg10100.html&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=ORz4k4McDLmQa64dLEgFCE-oVBIW0LNNh2mVMb2T2Xk&e=

Good luck,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Monday, October 16, 2017 10:13 AM
To: dev@ctakes.apache.org
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text extraction from images? I am currently using tesseract and seeing some  text extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: OCR engine used [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Abilash Mathew,

I haven't evaluated a per-character (or per-word) accuracy, but things look fair (80-90%? per char) from spot-checks.  Obviously the accuracy is dependent upon the quality of the input, and sometimes you get a lousy 300dpi scan made from 200dpi printed fax ...

I do run pdf first through imagemagik, but that is more to get pagination than anything else as sometimes an entire patient history is in one giant pdf.

I am not an expert on ocr or any particular tool or method.  There is a lot of Tesseract discussion on the web if you can find it.

Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com] 
Sent: Thursday, October 19, 2017 1:20 AM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Sean,

What is the accuracy that you get from OCR? We are at  60-70% accuracy.  Most of the documents are 200 DPI ones. Also, are you using any other software like Matlab for the OCR pre or  post processing.

Thanks,
Abilash Mathew

-----Original Message-----
From: Mathew, Abilash (Cognizant)
Sent: Monday, October 16, 2017 8:37 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Thanks Sean fir the quick reply and providing the valuable information.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 16, 2017 8:17 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.
I am by no means an expert on Tesseract, but perhaps I can help to get you started ...

There are tricks that you can use to get it to work better with medical notes (besides training on fonts).  Possibly the most effective is using a whitelist of desired characters using tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...  Another is to add a wordlist that contains words pertinent to your domain.  See:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_wiki_ImproveQuality-23dictionaries-2Dword-2Dlists-2Dand-2Dpatterns&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=mdvTV4CsdGjAgIX6yNzNYCrkBuDVrNvOgxKiv-R9vxI&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_tesseract-2Docr_tesseract_blob_master_doc_tesseract.1.asc-23config-2Dfiles-2Dand-2Daugmenting-2Dwith-2Duser-2Ddata&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=w6BHFOtmh6VsGVBFaH2yhVLqxyezeW8ozgRhM67ImS0&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_9568165_custom-2Ddictionary-2Dfor-2Dtesseract&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=vqJa6rcFsmUgCotpp3fbfF6epW4WiHCJWugr4eFIyWs&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_tesseract-2Docr-40googlegroups.com_msg10100.html&d=DwIFAg&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=qAmnDC76ixUPUO1C4SCKEG2rudGkfy63Rxe4RXQ8vH8&s=ORz4k4McDLmQa64dLEgFCE-oVBIW0LNNh2mVMb2T2Xk&e=

Good luck,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Monday, October 16, 2017 10:13 AM
To: dev@ctakes.apache.org
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text extraction from images? I am currently using tesseract and seeing some  text extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: OCR engine used [EXTERNAL]

Posted by Ab...@cognizant.com.

Yes, that is correct. But, we are looking for a non-cloud options.

Thanks,
Abilash Mathew

-----Original Message-----
From: Melvin Ma [mailto:ma.qianfan@gmail.com]
Sent: Thursday, October 19, 2017 11:03 AM
To: dev@ctakes.apache.org
Subject: Re: OCR engine used [EXTERNAL]

In the context of ctakes, I am not sure. But recently, I am using Google text recognition services and the results (of print texts) are really good.
Maybe you could try that. Melvin

On Wed, Oct 18, 2017 at 10:19 PM, <Ab...@cognizant.com> wrote:

> Sean,
>
> What is the accuracy that you get from OCR? We are at  60-70% accuracy.
> Most of the documents are 200 DPI ones. Also, are you using any other
> software like Matlab for the OCR pre or  post processing.
>
> Thanks,
> Abilash Mathew
>
> -----Original Message-----
> From: Mathew, Abilash (Cognizant)
> Sent: Monday, October 16, 2017 8:37 PM
> To: dev@ctakes.apache.org
> Subject: RE: OCR engine used [EXTERNAL]
>
> Thanks Sean fir the quick reply and providing the valuable information.
>
> Regards,
> Abilash Mathew
>
> -----Original Message-----
> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
> Sent: Monday, October 16, 2017 8:17 PM
> To: dev@ctakes.apache.org
> Subject: RE: OCR engine used [EXTERNAL]
>
> Hi Abilash Mathew,
>
> I have only used Tesseract.  Unfortunately, no ocr is perfect.
> I am by no means an expert on Tesseract, but perhaps I can help to get
> you started ...
>
> There are tricks that you can use to get it to work better with
> medical notes (besides training on fonts).  Possibly the most
> effective is using a whitelist of desired characters using
> tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...
> Another is to add a wordlist that contains words pertinent to your domain.
> See:
> https://github.com/tesseract-ocr/tesseract/wiki/
> ImproveQuality#dictionaries-word-lists-and-patterns
> https://github.com/tesseract-ocr/tesseract/blob/master/doc/
> tesseract.1.asc#config-files-and-augmenting-with-user-data
> https://stackoverflow.com/questions/9568165/custom-
> dictionary-for-tesseract
> https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg10100.h
> tml
>
> Good luck,
> Sean
>
> -----Original Message-----
> From: Abilash.Mathew@cognizant.com
> [mailto:Abilash.Mathew@cognizant.com]
> Sent: Monday, October 16, 2017 10:13 AM
> To: dev@ctakes.apache.org
> Subject: OCR engine used [EXTERNAL]
>
> Hi All,
>
> Can you guys give some of the OCR engines used for Medical record text
> extraction from images? I am currently using tesseract and seeing some
> text extraction quality issues.
>
> Thanks,
> Abilash Mathew
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to
> the sender and destroy all copies of the original message. Any
> unauthorized review, use, disclosure, dissemination, forwarding,
> printing or copying of this email, and/or any action taken in reliance
> on the contents of this e-mail is strictly prohibited and may be
> unlawful. Where permitted by applicable law, this e-mail and other
> e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
> This e-mail and any files transmitted with it are for the sole use of
> the intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to
> the sender and destroy all copies of the original message. Any
> unauthorized review, use, disclosure, dissemination, forwarding,
> printing or copying of this email, and/or any action taken in reliance
> on the contents of this e-mail is strictly prohibited and may be
> unlawful. Where permitted by applicable law, this e-mail and other
> e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
>
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: OCR engine used [EXTERNAL]

Posted by Melvin Ma <ma...@gmail.com>.

In the context of ctakes, I am not sure. But recently, I am using Google
text recognition services and the results (of print texts) are really good.
Maybe you could try that. Melvin

On Wed, Oct 18, 2017 at 10:19 PM, <Ab...@cognizant.com> wrote:

> Sean,
>
> What is the accuracy that you get from OCR? We are at  60-70% accuracy.
> Most of the documents are 200 DPI ones. Also, are you using any other
> software like Matlab for the OCR pre or  post processing.
>
> Thanks,
> Abilash Mathew
>
> -----Original Message-----
> From: Mathew, Abilash (Cognizant)
> Sent: Monday, October 16, 2017 8:37 PM
> To: dev@ctakes.apache.org
> Subject: RE: OCR engine used [EXTERNAL]
>
> Thanks Sean fir the quick reply and providing the valuable information.
>
> Regards,
> Abilash Mathew
>
> -----Original Message-----
> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
> Sent: Monday, October 16, 2017 8:17 PM
> To: dev@ctakes.apache.org
> Subject: RE: OCR engine used [EXTERNAL]
>
> Hi Abilash Mathew,
>
> I have only used Tesseract.  Unfortunately, no ocr is perfect.
> I am by no means an expert on Tesseract, but perhaps I can help to get you
> started ...
>
> There are tricks that you can use to get it to work better with medical
> notes (besides training on fonts).  Possibly the most effective is using a
> whitelist of desired characters using tessedit_char_whitelist and a series
> of characters that doesn't include things like hash, dollar, bar ...
> Another is to add a wordlist that contains words pertinent to your domain.
> See:
> https://github.com/tesseract-ocr/tesseract/wiki/
> ImproveQuality#dictionaries-word-lists-and-patterns
> https://github.com/tesseract-ocr/tesseract/blob/master/doc/
> tesseract.1.asc#config-files-and-augmenting-with-user-data
> https://stackoverflow.com/questions/9568165/custom-
> dictionary-for-tesseract
> https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg10100.html
>
> Good luck,
> Sean
>
> -----Original Message-----
> From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
> Sent: Monday, October 16, 2017 10:13 AM
> To: dev@ctakes.apache.org
> Subject: OCR engine used [EXTERNAL]
>
> Hi All,
>
> Can you guys give some of the OCR engines used for Medical record text
> extraction from images? I am currently using tesseract and seeing some
> text extraction quality issues.
>
> Thanks,
> Abilash Mathew
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>

RE: OCR engine used [EXTERNAL]

Posted by Ab...@cognizant.com.

Sean,

What is the accuracy that you get from OCR? We are at  60-70% accuracy.  Most of the documents are 200 DPI ones. Also, are you using any other software like Matlab for the OCR pre or  post processing.

Thanks,
Abilash Mathew

-----Original Message-----
From: Mathew, Abilash (Cognizant)
Sent: Monday, October 16, 2017 8:37 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Thanks Sean fir the quick reply and providing the valuable information.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 16, 2017 8:17 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.
I am by no means an expert on Tesseract, but perhaps I can help to get you started ...

There are tricks that you can use to get it to work better with medical notes (besides training on fonts).  Possibly the most effective is using a whitelist of desired characters using tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...  Another is to add a wordlist that contains words pertinent to your domain.  See:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data
https://stackoverflow.com/questions/9568165/custom-dictionary-for-tesseract
https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg10100.html

Good luck,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Monday, October 16, 2017 10:13 AM
To: dev@ctakes.apache.org
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text extraction from images? I am currently using tesseract and seeing some  text extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: OCR engine used [EXTERNAL]

Posted by Ab...@cognizant.com.

Thanks Sean fir the quick reply and providing the valuable information.

Regards,
Abilash Mathew

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
Sent: Monday, October 16, 2017 8:17 PM
To: dev@ctakes.apache.org
Subject: RE: OCR engine used [EXTERNAL]

Hi Abilash Mathew,

I have only used Tesseract.  Unfortunately, no ocr is perfect.
I am by no means an expert on Tesseract, but perhaps I can help to get you started ...

There are tricks that you can use to get it to work better with medical notes (besides training on fonts).  Possibly the most effective is using a whitelist of desired characters using tessedit_char_whitelist and a series of characters that doesn't include things like hash, dollar, bar ...  Another is to add a wordlist that contains words pertinent to your domain.  See:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data
https://stackoverflow.com/questions/9568165/custom-dictionary-for-tesseract
https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg10100.html

Good luck,
Sean

-----Original Message-----
From: Abilash.Mathew@cognizant.com [mailto:Abilash.Mathew@cognizant.com]
Sent: Monday, October 16, 2017 10:13 AM
To: dev@ctakes.apache.org
Subject: OCR engine used [EXTERNAL]

Hi All,

Can you guys give some of the OCR engines used for Medical record text extraction from images? I am currently using tesseract and seeing some  text extraction quality issues.

Thanks,
Abilash Mathew
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.