You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sal (Jira)" <ji...@apache.org> on 2021/05/31 19:31:00 UTC
[jira] [Updated] (TIKA-3427) Duplicate characters in some words
[ https://issues.apache.org/jira/browse/TIKA-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sal updated TIKA-3427:
----------------------
Description:
When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.
I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers
Content-Type : application/pdf
X-Tika-PDFextractInlineImages : true
X-Tika-PDFOcrStrategy: ocr_and_text_extraction
An attached PDF document is provided as an example
The output looks like this, incorrect text is in red text
{color:#de350b}*PPAATIENTTIENT*{color}
DISEASE Lung cancer (NOS)
NAME
DATE OF BIRTH
SEX Male
MEDICAL RECORD # Not given
{color:#de350b}*PHYPHYSICIANSICIAN*{color}
ORDERING PHYSICIAN
MEDICAL FACILITY
ADDITIONAL RECIPIENT None
MEDICAL FACILITY ID
PATHOLOGIST Not Provided
{color:#de350b}*SPESPECIMENCIMEN*{color}
SPECIMEN ID
SPECIMEN TYPE Blood
DATE OF COLLECTION
SPECIMEN RECEIVED
MEDIAN EXON COVERAGE
Biomarker Findings
*{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*
was:
When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.
I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers
Content-Type : application/pdf
X-Tika-PDFextractInlineImages : true
X-Tika-PDFOcrStrategy: ocr_and_text_extraction
An attached PDF document is provided as an example
The output looks like this
{color:#de350b}*PPAATIENTTIENT*{color}
DISEASE Lung cancer (NOS)
NAME
DATE OF BIRTH
SEX Male
MEDICAL RECORD # Not given
{color:#de350b}*PHYPHYSICIANSICIAN*{color}
ORDERING PHYSICIAN
MEDICAL FACILITY
ADDITIONAL RECIPIENT None
MEDICAL FACILITY ID
PATHOLOGIST Not Provided
{color:#de350b}*SPESPECIMENCIMEN*{color}
SPECIMEN ID
SPECIMEN TYPE Blood
DATE OF COLLECTION
SPECIMEN RECEIVED
MEDIAN EXON COVERAGE
Biomarker Findings
*{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*
> Duplicate characters in some words
> ----------------------------------
>
> Key: TIKA-3427
> URL: https://issues.apache.org/jira/browse/TIKA-3427
> Project: Tika
> Issue Type: Bug
> Components: tika-server
> Affects Versions: 1.26
> Environment: Windows 10 x64
> Reporter: Sal
> Priority: Major
> Attachments: 1_PDFsam_FoundationOne_Liquid_Sample_Report.pdf
>
>
> When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.
> I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers
> Content-Type : application/pdf
> X-Tika-PDFextractInlineImages : true
> X-Tika-PDFOcrStrategy: ocr_and_text_extraction
> An attached PDF document is provided as an example
> The output looks like this, incorrect text is in red text
>
> {color:#de350b}*PPAATIENTTIENT*{color}
> DISEASE Lung cancer (NOS)
> NAME
> DATE OF BIRTH
> SEX Male
> MEDICAL RECORD # Not given
> {color:#de350b}*PHYPHYSICIANSICIAN*{color}
> ORDERING PHYSICIAN
> MEDICAL FACILITY
> ADDITIONAL RECIPIENT None
> MEDICAL FACILITY ID
> PATHOLOGIST Not Provided
> {color:#de350b}*SPESPECIMENCIMEN*{color}
> SPECIMEN ID
> SPECIMEN TYPE Blood
> DATE OF COLLECTION
> SPECIMEN RECEIVED
> MEDIAN EXON COVERAGE
> Biomarker Findings
> *{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)