You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ray Gauss II (JIRA)" <ji...@apache.org> on 2014/05/12 19:40:18 UTC

[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

    [ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995298#comment-13995298 ] 

Ray Gauss II edited comment on TIKA-1278 at 5/12/14 5:39 PM:
-------------------------------------------------------------

Hi [~tallison@apache.org],

I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people.

We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful.


was (Author: rgauss):
Hi [~tallison],

I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people.

We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful.

> Expose PDF Avg Char and Spacing Tolerance Config Params
> -------------------------------------------------------
>
>                 Key: TIKA-1278
>                 URL: https://issues.apache.org/jira/browse/TIKA-1278
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.5
>            Reporter: Ray Gauss II
>            Assignee: Ray Gauss II
>             Fix For: 1.6
>
>
> {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}.
> Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)