You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sal (Jira)" <ji...@apache.org> on 2021/05/31 19:29:00 UTC

[jira] [Created] (TIKA-3427) Duplicate characters in some words

Sal created TIKA-3427:
-------------------------

             Summary: Duplicate characters in some words
                 Key: TIKA-3427
                 URL: https://issues.apache.org/jira/browse/TIKA-3427
             Project: Tika
          Issue Type: Bug
          Components: tika-server
    Affects Versions: 1.26
         Environment: Windows 10 x64
            Reporter: Sal
         Attachments: 1_PDFsam_FoundationOne_Liquid_Sample_Report.pdf

When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.  

I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers 

Content-Type : application/pdf

X-Tika-PDFextractInlineImages : true

X-Tika-PDFOcrStrategy: ocr_and_text_extraction

An attached PDF document  is provided as an example

The output looks like this



{color:#de350b}*PPAATIENTTIENT*{color}

DISEASE Lung cancer (NOS) 
NAME 
DATE OF BIRTH 
 SEX Male
MEDICAL RECORD # Not given

{color:#de350b}*PHYPHYSICIANSICIAN*{color}

ORDERING PHYSICIAN 
MEDICAL FACILITY 
 ADDITIONAL RECIPIENT None
MEDICAL FACILITY ID 
PATHOLOGIST Not Provided

{color:#de350b}*SPESPECIMENCIMEN*{color}

SPECIMEN ID 
SPECIMEN TYPE Blood
DATE OF COLLECTION 
 SPECIMEN RECEIVED 
 MEDIAN EXON COVERAGE

Biomarker Findings
*{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)