You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2022/02/22 08:00:00 UTC
[jira] [Comment Edited] (TIKA-3682) PDFParser is extracting each char of a word in a new line
[ https://issues.apache.org/jira/browse/TIKA-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495936#comment-17495936 ]
Tilman Hausherr edited comment on TIKA-3682 at 2/22/22, 7:59 AM:
-----------------------------------------------------------------
Please share the PDF, this should work since TIKA-2779. But maybe the sorting interferes with the rotation detection.
was (Author: tilman):
Please share the PDF, this should work since TIKA-2779.
> PDFParser is extracting each char of a word in a new line
> ---------------------------------------------------------
>
> Key: TIKA-3682
> URL: https://issues.apache.org/jira/browse/TIKA-3682
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.26, 2.3.0
> Reporter: Sree Harsha
> Priority: Major
> Attachments: image-2022-02-22-13-14-14-067.png
>
>
> when pdf parser is trying to extract text from a pdf document having a different orientation for text, each character of word is extracted to a new line.
> For eg the text is extracted like below:
> TO
> P
> LA
> C
> E
> A
> N
> O
> R
> D
> E
> R
> where the original text is like
> !image-2022-02-22-13-14-14-067.png!
> setExtractBookmarksText(false);
> getPDFParserConfig().setEnableAutoSpace(true);
>
> After adding the below options:
> setSortByPosition(true);
> setSuppressDuplicateOverlappingText(true);
> setOcrStrategy(OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
>
> The text is extracted like:
> TO PLACE xxxxxxx
> yyyyyyy AN ORDER
>
> where xxxxxx, yyyyyyy refers to some other text at same level in pdf document.
> If i search for TO PLACE AN ORDER in acrobat reader it works but if i search for the same text in extracted text content, it won't work..
> Is there any option to exclude unnecessary new line characters shown in first example and also solve the side effect or sort by position issue..
> The the output should look like:
> TO PLACE AN ORDER
> xxxxxx yyyyyyyy
--
This message was sent by Atlassian Jira
(v8.20.1#820001)