You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/06/01 03:23:00 UTC

[jira] [Commented] (TIKA-3427) Duplicate characters in some words

    [ https://issues.apache.org/jira/browse/TIKA-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354770#comment-17354770 ] 

Tilman Hausherr commented on TIKA-3427:
---------------------------------------

Try setting the {{suppressDuplicateOverlappingText}} option.

> Duplicate characters in some words
> ----------------------------------
>
>                 Key: TIKA-3427
>                 URL: https://issues.apache.org/jira/browse/TIKA-3427
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-server
>    Affects Versions: 1.26
>         Environment: Windows 10 x64
>            Reporter: Sal
>            Priority: Minor
>         Attachments: 1_PDFsam_FoundationOne_Liquid_Sample_Report.pdf
>
>
> When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.  
> I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers 
> Content-Type : application/pdf
> X-Tika-PDFextractInlineImages : true
> X-Tika-PDFOcrStrategy: ocr_and_text_extraction
> An attached PDF document  is provided as an example
> The output looks like this, incorrect text is in red text
>  
> {color:#de350b}*PPAATIENTTIENT*{color}
> DISEASE Lung cancer (NOS) 
>  NAME 
>  DATE OF BIRTH 
>  SEX Male
>  MEDICAL RECORD # Not given
> {color:#de350b}*PHYPHYSICIANSICIAN*{color}
> ORDERING PHYSICIAN 
>  MEDICAL FACILITY 
>  ADDITIONAL RECIPIENT None
>  MEDICAL FACILITY ID 
>  PATHOLOGIST Not Provided
> {color:#de350b}*SPESPECIMENCIMEN*{color}
> SPECIMEN ID 
>  SPECIMEN TYPE Blood
>  DATE OF COLLECTION 
>  SPECIMEN RECEIVED 
>  MEDIAN EXON COVERAGE
> Biomarker Findings
>  *{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)