You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/03/09 14:56:40 UTC

[jira] [Comment Edited] (TIKA-1508) Add uniformity to parser parameter configuration

    [ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187117#comment-15187117 ] 

Tim Allison edited comment on TIKA-1508 at 3/9/16 1:56 PM:
-----------------------------------------------------------

[~thammegowda], this looks really good. I merged it on a local branch and made minimal modifications to the PDFParser to make this work...and it did...very straightforwardly.

Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses {{configure()}} outside of the normal course of events...it is simpler to use Map<String,String>.  Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. {{context.set{{PDFParser.class, Map<String,String> params}}.  I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly.


2) We need to add a {{Map<String,String> getParams()}} to the {{Configurable}} interface so that when we serialize the config to XML, we can remember what the params were.  We should also add that to the TikaConfigSerializer.

3) It would be great to add parameter checking into the {{AbstractParser}} or somewhere else?  I think a configurable (parser? or all configurables?) should need to register valid configuration keys at initialization, and then we can check the validity of the keys passed in during {{configure()}} once in the base class so that each extending parser isn't required to do this on its own.

4) Let's subclass TikaException for TikaParameterConfigException?  I don't feel strongly about this one.

5) We'll need to add {{@Override configure()}} to pass on the configuration information to the wrapped parser in parser wrappers: ParserDecorator, DelegatingParser, ParserPostProcessor...any others?  Or, do we need to set the parameters in the wrapped parser before wrapping?

Questions for the broader dev community:

A) Are we ok with Map<String,String> parameters? Or should we follow, say, Solr's syntax for type checking?
{noformat}
<int name="pageWidth">10</int>
{noformat}

B) We could use reflection to get around each parser having to add its own configuration code.  We could create a static configurator  that has a {{configure(Configurable configurable, Map<String, String> params}} method.  That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines.  Too complex?


was (Author: tallison@mitre.org):
[~thammegowda], this looks really good. I merged it on a local branch and made minimal modifications to the PDFParser to make this work...and it did...very straightforwardly.

Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have collisions with different parsers if anyone uses {{configure()}} outside of the normal course of events...it is simpler to use Map<String,String>.  Or, if we do use the ParseContext, we should specify which parser the params are for, e.g. {{context.set{{PDFParser.class, Map<String,String> params}}.  I do like the dual use of configure with ParseContext to achieve Nick's recommendation elegantly.


2) We need to add a {{Map<String,String> getParams()}} to the {{Configurable}} interface so that when we serialize the config to XML, we can remember what the params were.  We should also add that to the TikaConfigSerializer.

3) It would be great to add parameter checking into the {{AbstractParser}} or somewhere else?  I think a configurable (parser? or all configurables?) should need to register valid configuration keys at initialization, and then we can check the validity of the keys passed in during {{configure()}} once in the base class so that each extending parser isn't required to do this on its own.

4) Let's subclass TikaException for TikaParameterConfigException?  I don't feel strongly about this one.

5) We'll need to add {{@Override configure()}} to pass on the configuration information to the wrapped parser in parser wrappers: ParserDecorator, DelegatingParser, ParserPostProcessor...any others?  Or, do we need to set the parameters in the wrapped parser before wrapping?

Questions for the broader dev community:

A) Are we ok with Map<String,String> parameters? Or should we follow, say, Solr's syntax for type checking?
{{noformat}}
<int name="pageWidth">10</int>
{{noformat}}

B) We could use reflection to get around each parser having to add its own configuration code.  We could create a static configurator  that has a {{configure(Configurable configurable, Map<String, String> params}} method.  That isn't quite right, because we'd have to know the type for each param (see above), but something along those lines.  Too complex?

> Add uniformity to parser parameter configuration
> ------------------------------------------------
>
>                 Key: TIKA-1508
>                 URL: https://issues.apache.org/jira/browse/TIKA-1508
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>             Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, it would be great if we could specify parser parameters in the main config file, something along the lines of this:
> {noformat}
>     <parser class="org.apache.tika.parser.audio.AudioParser">
>       <params>
>         <int name="someparam1">2</int>
>         <str name="someOtherParam2">something or other</str>
>       </params>
>       <mime>audio/basic</mime>
>       <mime>audio/x-aiff</mime>
>       <mime>audio/x-wav</mime>
>     </parser>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)