You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sal (Jira)" <ji...@apache.org> on 2021/05/31 19:31:00 UTC
[jira] [Updated] (TIKA-3427) Duplicate characters in some words

     [ https://issues.apache.org/jira/browse/TIKA-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sal updated TIKA-3427:
----------------------
    Description: 
When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.  

I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers 

Content-Type : application/pdf

X-Tika-PDFextractInlineImages : true

X-Tika-PDFOcrStrategy: ocr_and_text_extraction

An attached PDF document  is provided as an example

The output looks like this, incorrect text is in red text

 

{color:#de350b}*PPAATIENTTIENT*{color}

DISEASE Lung cancer (NOS) 
 NAME 
 DATE OF BIRTH 
 SEX Male
 MEDICAL RECORD # Not given

{color:#de350b}*PHYPHYSICIANSICIAN*{color}

ORDERING PHYSICIAN 
 MEDICAL FACILITY 
 ADDITIONAL RECIPIENT None
 MEDICAL FACILITY ID 
 PATHOLOGIST Not Provided

{color:#de350b}*SPESPECIMENCIMEN*{color}

SPECIMEN ID 
 SPECIMEN TYPE Blood
 DATE OF COLLECTION 
 SPECIMEN RECEIVED 
 MEDIAN EXON COVERAGE

Biomarker Findings
 *{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*

  was:
When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.  

I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers 

Content-Type : application/pdf

X-Tika-PDFextractInlineImages : true

X-Tika-PDFOcrStrategy: ocr_and_text_extraction

An attached PDF document  is provided as an example

The output looks like this



{color:#de350b}*PPAATIENTTIENT*{color}

DISEASE Lung cancer (NOS) 
NAME 
DATE OF BIRTH 
 SEX Male
MEDICAL RECORD # Not given

{color:#de350b}*PHYPHYSICIANSICIAN*{color}

ORDERING PHYSICIAN 
MEDICAL FACILITY 
 ADDITIONAL RECIPIENT None
MEDICAL FACILITY ID 
PATHOLOGIST Not Provided

{color:#de350b}*SPESPECIMENCIMEN*{color}

SPECIMEN ID 
SPECIMEN TYPE Blood
DATE OF COLLECTION 
 SPECIMEN RECEIVED 
 MEDIAN EXON COVERAGE

Biomarker Findings
*{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*


> Duplicate characters in some words
> ----------------------------------
>
>                 Key: TIKA-3427
>                 URL: https://issues.apache.org/jira/browse/TIKA-3427
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-server
>    Affects Versions: 1.26
>         Environment: Windows 10 x64
>            Reporter: Sal
>            Priority: Major
>         Attachments: 1_PDFsam_FoundationOne_Liquid_Sample_Report.pdf
>
>
> When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.  
> I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers 
> Content-Type : application/pdf
> X-Tika-PDFextractInlineImages : true
> X-Tika-PDFOcrStrategy: ocr_and_text_extraction
> An attached PDF document  is provided as an example
> The output looks like this, incorrect text is in red text
>  
> {color:#de350b}*PPAATIENTTIENT*{color}
> DISEASE Lung cancer (NOS) 
>  NAME 
>  DATE OF BIRTH 
>  SEX Male
>  MEDICAL RECORD # Not given
> {color:#de350b}*PHYPHYSICIANSICIAN*{color}
> ORDERING PHYSICIAN 
>  MEDICAL FACILITY 
>  ADDITIONAL RECIPIENT None
>  MEDICAL FACILITY ID 
>  PATHOLOGIST Not Provided
> {color:#de350b}*SPESPECIMENCIMEN*{color}
> SPECIMEN ID 
>  SPECIMEN TYPE Blood
>  DATE OF COLLECTION 
>  SPECIMEN RECEIVED 
>  MEDIAN EXON COVERAGE
> Biomarker Findings
>  *{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)