You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Chris Mattmann <ma...@apache.org> on 2021/05/26 21:24:48 UTC

Re: Question on custom tika-python configs for OMB PDF

Hannah, I am pushing your question upstream to the dev@tika list. I think what you need is for them to look
at your config file which I’ve reattached below pasted, and then see if it looks ok. Then in Tika Python you need
to give it this config file before your server starts up or outside of Python just start your server with this config
file working, then Tika Python will pick it up:

 

<?xml version="1.0" encoding="UTF-8"?>

<properties>

    <parsers>

        <!-- Exclude default values -->

        <parser class="org.apache.tika.parser.DefaultParser">

<!--            <property-exclude name = "sortByPosition"/>-->

            <mime-exclude>application/pdf</mime-exclude>

        </parser>

        <!-- Ensure that sorts by position -->

        <parser class="org.apache.tika.parser.EmptyParser">

            <mime>application/pdf</mime>

            <property name="sortByPosition" value="true"/>

        </parser>

    </parsers>

</properties>

 

 

Cheers,

Chris

 

 

From: Hannah Eli <el...@gmail.com>
Date: Wednesday, May 26, 2021 at 1:47 PM
To: "Mattmann, Chris A (US 1740)" <ch...@jpl.nasa.gov>
Subject: [EXTERNAL] Question on custom tika-python configs for OMB PDF

 

Hi Chris,  

 

Hope you're well. I'm trying to use tika to parse the table of contents for the Office of Management and Budget's A-11 Circular PDF (I know it's short enough to parse manually, but we're building a repeatable extract). When I do so, the text is parsed out of order. I was trying to fix this by creating a custom config file with the sortbyPosition property (see attached), but I'm not an XML guru and don't believe it's working properly. I've also tried changing the Windows environment variables to point to this file. 

 

Any guidance would be much appreciated. 

 

Thank you!

Hannah

 

-- 

Hannah Eli 


Re: Question on custom tika-python configs for OMB PDF

Posted by Hannah Eli <el...@gmail.com>.
Hi Chris - thank you for forwarding the request! Once the team has reviewed
I'll give it another try.

Thank you,
Hannah

On Wed, May 26, 2021 at 5:24 PM Chris Mattmann <ma...@apache.org> wrote:

> Hannah, I am pushing your question upstream to the dev@tika list. I think
> what you need is for them to look
> at your config file which I’ve reattached below pasted, and then see if it
> looks ok. Then in Tika Python you need
> to give it this config file before your server starts up or outside of
> Python just start your server with this config
> file working, then Tika Python will pick it up:
>
>
>
> <?xml version="1.0" encoding="UTF-8"?>
>
> <properties>
>
>     <parsers>
>
>         <!-- Exclude default values -->
>
>         <parser class="org.apache.tika.parser.DefaultParser">
>
> <!--            <property-exclude name = "sortByPosition"/>-->
>
>             <mime-exclude>application/pdf</mime-exclude>
>
>         </parser>
>
>         <!-- Ensure that sorts by position -->
>
>         <parser class="org.apache.tika.parser.EmptyParser">
>
>             <mime>application/pdf</mime>
>
>             <property name="sortByPosition" value="true"/>
>
>         </parser>
>
>     </parsers>
>
> </properties>
>
>
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> *From: *Hannah Eli <el...@gmail.com>
> *Date: *Wednesday, May 26, 2021 at 1:47 PM
> *To: *"Mattmann, Chris A (US 1740)" <ch...@jpl.nasa.gov>
> *Subject: *[EXTERNAL] Question on custom tika-python configs for OMB PDF
>
>
>
> Hi Chris,
>
>
>
> Hope you're well. I'm trying to use tika to parse the table of contents
> for the Office of Management and Budget's A-11 Circular PDF
> <https://urldefense.us/v3/__https:/www.whitehouse.gov/wp-content/uploads/2018/06/a11_web_toc.pdf__;!!PvBDto6Hs4WbVuu7!aHaS3pr3WwzObTFHgaGkqMCJppTbQKWTCHqYM3RU4jHtF7_QT2I398YFRJBbMCfLWTVf_0yR9A$> (I
> know it's short enough to parse manually, but we're building a repeatable
> extract). When I do so, the text is parsed out of order. I was trying to
> fix this by creating a custom config file with the sortbyPosition property
> (see attached), but I'm not an XML guru and don't believe it's working
> properly. I've also tried changing the Windows environment variables to
> point to this file.
>
>
>
> Any guidance would be much appreciated.
>
>
>
> Thank you!
>
> Hannah
>
>
>
> --
>
> *Hannah Eli*
>


-- 
*Hannah Eli*
(317) 656-1366 | elihannahr@gmail.com