You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeremy Naylor (Jira)" <ji...@apache.org> on 2022/03/15 16:16:00 UTC

[jira] [Updated] (TIKA-3682) PDFParser is extracting each char of a word in a new line

     [ https://issues.apache.org/jira/browse/TIKA-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Naylor updated TIKA-3682:
--------------------------------
    Attachment: 00000000_BN_TextSearch.pdf
                00000000_CX_TextSearch.pdf

> PDFParser is extracting each char of a word in a new line
> ---------------------------------------------------------
>
>                 Key: TIKA-3682
>                 URL: https://issues.apache.org/jira/browse/TIKA-3682
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.26, 2.3.0
>            Reporter: Sree Harsha
>            Priority: Major
>         Attachments: 00000000_BN_TextSearch.pdf, 00000000_CX_TextSearch.pdf, image-2022-02-22-13-14-14-067.png
>
>
> when pdf parser is trying to extract text from a pdf document having a different orientation for text, each character of word is extracted to a  new line.
> For eg the text is extracted like below:
> TO
>  P
> LA
> C
> E
> A
> N
>  O
> R
> D
> E
> R
> where the original text is like 
> !image-2022-02-22-13-14-14-067.png!
> setExtractBookmarksText(false);
> getPDFParserConfig().setEnableAutoSpace(true);
>  
> After adding the below options:
> setSortByPosition(true);
> setSuppressDuplicateOverlappingText(true);
> setOcrStrategy(OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
>  
> The text is extracted like:
> TO PLACE xxxxxxx
> yyyyyyy AN ORDER
>  
> where xxxxxx, yyyyyyy refers to some other text at same level in pdf document.
> If i search for TO PLACE AN ORDER in acrobat reader it works but if i search for the same text in extracted text content, it won't work..
> Is there any option to exclude unnecessary new line characters shown in first example and also solve the side effect or sort by position issue..
> The the output should look like:
> TO PLACE AN ORDER
> xxxxxx yyyyyyyy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)