You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/09/03 19:48:46 UTC

[jira] [Comment Edited] (TIKA-1657) Allow easier XML serialization of TikaConfig

    [ https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729452#comment-14729452 ] 

Tim Allison edited comment on TIKA-1657 at 9/3/15 5:48 PM:
-----------------------------------------------------------

Current version of tika config serialization for "effective config" run against TIKA-1558-blacklist.xml.

{noformat}
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>
{noformat}

I couldn't easily get back to our old flat list of parsers.  This is what I'm thinking for where we might be heading in 2.0.  What do you think?
If this looks good, I'll add a deserializer...we can use the legacy Tika loading mechanisms for any config files that don't have a version 2.0.

[~gagravarr], if you have a chance, please take a look.  I know that you've added quite a bit of capability in this area, and I don't want to ruin it. :)

The biggest changes:
# Hierarchical parsers are represented hierarchically (Composite,Decorator)...
# I've added a params section (TIKA-1508) for: str, int, long, float, double, boolean...see PDFParser and RTFParser
# There's room to grow for the potentially new type of CompositeParsers (fallback, etc) in the "type" attribute"


was (Author: tallison@mitre.org):
Current version of tika config serialization for "effective config" run against TIKA-1558-blacklist.xml.

{noformat}
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>
{noformat}

I couldn't easily get back to our old flat list of parsers.  This is what I'm thinking for where we might be heading in 2.0.  What do you think?
If this looks good, I'll add a deserializer...we can use the legacy Tika loading mechanisms for any config files that don't have a version 2.0.

[~gagravarr], if you have a chance, please take a look.  I know that you've added quite a bit of capability in this area, and I don't want to ruin it. :)

> Allow easier XML serialization of TikaConfig
> --------------------------------------------
>
>                 Key: TIKA-1657
>                 URL: https://issues.apache.org/jira/browse/TIKA-1657
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: TIKA-1558-blacklist-effective.xml
>
>
> In TIKA-1418, we added an example for how to dump the config file so that users could easily modify it.  I think we should go further and make this an option at the tika-core level with hooks for tika-app and tika-server.  I propose adding a main() to TikaConfig that will print the xml config file that Tika is currently using to stdout.
> I'd like to put this into core so that e.g. Solr's DIH users can get by without having to download tika-app separately.  
> There's every chance that I've not accounted for issues with dynamic loading etc.  Also, I'd be ok with only having this available in tika-app and tika-server if there are good reasons.
> Feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)