You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sree Harsha (Jira)" <ji...@apache.org> on 2022/03/03 14:30:00 UTC

[jira] [Commented] (TIKA-3682) PDFParser is extracting each char of a word in a new line

    [ https://issues.apache.org/jira/browse/TIKA-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500792#comment-17500792 ] 

Sree Harsha commented on TIKA-3682:
-----------------------------------

Apologies for delayed response..

Creating a sample file and will upload in couple of days...

> PDFParser is extracting each char of a word in a new line
> ---------------------------------------------------------
>
>                 Key: TIKA-3682
>                 URL: https://issues.apache.org/jira/browse/TIKA-3682
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.26, 2.3.0
>            Reporter: Sree Harsha
>            Priority: Major
>         Attachments: image-2022-02-22-13-14-14-067.png
>
>
> when pdf parser is trying to extract text from a pdf document having a different orientation for text, each character of word is extracted to a  new line.
> For eg the text is extracted like below:
> TO
>  P
> LA
> C
> E
> A
> N
>  O
> R
> D
> E
> R
> where the original text is like 
> !image-2022-02-22-13-14-14-067.png!
> setExtractBookmarksText(false);
> getPDFParserConfig().setEnableAutoSpace(true);
>  
> After adding the below options:
> setSortByPosition(true);
> setSuppressDuplicateOverlappingText(true);
> setOcrStrategy(OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
>  
> The text is extracted like:
> TO PLACE xxxxxxx
> yyyyyyy AN ORDER
>  
> where xxxxxx, yyyyyyy refers to some other text at same level in pdf document.
> If i search for TO PLACE AN ORDER in acrobat reader it works but if i search for the same text in extracted text content, it won't work..
> Is there any option to exclude unnecessary new line characters shown in first example and also solve the side effect or sort by position issue..
> The the output should look like:
> TO PLACE AN ORDER
> xxxxxx yyyyyyyy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)