You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Akash (Jira)" <ji...@apache.org> on 2020/08/18 18:08:00 UTC

[jira] [Comment Edited] (TIKA-3170) PDF extraction space issue

    [ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179990#comment-17179990 ] 

Akash edited comment on TIKA-3170 at 8/18/20, 6:07 PM:
-------------------------------------------------------

Seems issue is already fixed as part of this commit - [https://github.com/apache/tika/commit/5f747ac3c7d19224cd9d9086346251096c1109fc]

If some one still want to use 1.24.1, please add below details in tika config file.
{code:java}
/<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
    </parser>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="averageCharTolerance" type="float">0.3f</param>
        <param name="spacingTolerance" type="float">0.5f</param>
      </params>
    </parser>
  </parsers>
</properties>/ 
{code}
Closing this jiira


was (Author: akki1607):
Seems issue is already fixed as part of this commit - [https://github.com/apache/tika/commit/5f747ac3c7d19224cd9d9086346251096c1109fc]

Closing this jiira

> PDF extraction space issue
> --------------------------
>
>                 Key: TIKA-3170
>                 URL: https://issues.apache.org/jira/browse/TIKA-3170
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Akash
>            Priority: Major
>         Attachments: document_example.pdf, image-2020-08-18-20-23-16-159.png
>
>
> While extracting pdf files, we are observing spaces between some letters.
> As per below documentation, 
> [https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html]
> we can resolve this by disabling Enable Auto Space property. But when we disable this value, we are getting an issue with another text.
> With Enable Auto Space 
> < <p>*2014 C H A M B* R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E 2015
> Without Enable Auto Space
> > <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE 54e ZITTINGSPERIODE2015
>  
> Now there is no space between 2014 and CHAMBRE.
>  
> Is there some configuration to over come this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)