You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sal (Jira)" <ji...@apache.org> on 2021/05/31 19:29:00 UTC
[jira] [Created] (TIKA-3427) Duplicate characters in some words
Sal created TIKA-3427:
-------------------------
Summary: Duplicate characters in some words
Key: TIKA-3427
URL: https://issues.apache.org/jira/browse/TIKA-3427
Project: Tika
Issue Type: Bug
Components: tika-server
Affects Versions: 1.26
Environment: Windows 10 x64
Reporter: Sal
Attachments: 1_PDFsam_FoundationOne_Liquid_Sample_Report.pdf
When processing PDF document to extract text using Tika Server, the output contains words with some duplicated characters and partial words.
I am sending the PDF using a POST request to the Tika Server running locally at url [http://localhost:9998/tika] with the PDF attached to the body of the message and headers
Content-Type : application/pdf
X-Tika-PDFextractInlineImages : true
X-Tika-PDFOcrStrategy: ocr_and_text_extraction
An attached PDF document is provided as an example
The output looks like this
{color:#de350b}*PPAATIENTTIENT*{color}
DISEASE Lung cancer (NOS)
NAME
DATE OF BIRTH
SEX Male
MEDICAL RECORD # Not given
{color:#de350b}*PHYPHYSICIANSICIAN*{color}
ORDERING PHYSICIAN
MEDICAL FACILITY
ADDITIONAL RECIPIENT None
MEDICAL FACILITY ID
PATHOLOGIST Not Provided
{color:#de350b}*SPESPECIMENCIMEN*{color}
SPECIMEN ID
SPECIMEN TYPE Blood
DATE OF COLLECTION
SPECIMEN RECEIVED
MEDIAN EXON COVERAGE
Biomarker Findings
*{color:#de350b}MSI SMSI Statatus Undettus Undetermined.ermined.{color}*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)