You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Annie Didier (JIRA)" <ji...@apache.org> on 2018/06/14 19:01:00 UTC

[jira] [Created] (TIKA-2669) Tika JAX-RS PDF parser option / custom config issue

Annie Didier created TIKA-2669:
----------------------------------

             Summary: Tika JAX-RS PDF parser option / custom config issue
                 Key: TIKA-2669
                 URL: https://issues.apache.org/jira/browse/TIKA-2669
             Project: Tika
          Issue Type: Bug
          Components: config
    Affects Versions: 1.18
            Reporter: Annie Didier


PDF parsing using a config file behaves differently in Tika app than in Tika server. Tika server reads the custom config file, but the PDF parsing options are not being set. 

Here is an excerpt of output from the app:

<p>WINS No: B29017 APACHE 27-38 UNIT 1H Date: 5/4/2017

</p>

<p>AFE No: 1704555 Daily Completion and Workover Report DOL: 

</p>

However, with the same configuration file the output from tika server is:

<p>Daily Completion and Workover Report

</p>

<p>WINS No: 

</p>

<p>AFE No: 

</p>

<p>Date: 

</p>

<p>DOL: 

</p>

<p>APACHE 27-38 UNIT B29017

</p>

<p>1704555

</p>

<p>5/4/2017

</p>

 

 

The tika config is:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
 <parsers>
 <parser class="org.apache.tika.parser.pdf.PDFParser">
 <params>
 <param name="sortByPosition" type="bool">true</param>
 </params>
 </parser>
 </parsers>
</properties>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)