You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/08/17 17:20:00 UTC

[jira] [Commented] (TIKA-3170) PDF extraction space issue

    [ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179139#comment-17179139 ] 

Tilman Hausherr commented on TIKA-3170:
---------------------------------------

This is because the glyphs are so much apart. You and me, as humans, are able to understand that this is a design trick, but for the text extraction algorithm, this looks like separate words.

You could try to play around with spacingTolerance and averageCharTolerance, but there might be the risk that other files or other parts of the document are no longer read the way you expect.

> PDF extraction space issue
> --------------------------
>
>                 Key: TIKA-3170
>                 URL: https://issues.apache.org/jira/browse/TIKA-3170
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Akash
>            Priority: Major
>         Attachments: document_example.pdf
>
>
> While extracting pdf files, we are observing spaces between some letters.
> As per below documentation, 
> [https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html]
> we can resolve this by disabling Enable Auto Space property. But when we disable this value, we are getting an issue with another text.
> With Enable Auto Space 
> < <p>*2014 C H A M B* R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E 2015
> Without Enable Auto Space
> > <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE 54e ZITTINGSPERIODE2015
>  
> Now there is no space between 2014 and CHAMBRE.
>  
> Is there some configuration to over come this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)