You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Thamme Gowda N (JIRA)" <ji...@apache.org> on 2016/02/29 21:20:18 UTC

[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

    [ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172527#comment-15172527 ] 

Thamme Gowda N commented on TIKA-1663:
--------------------------------------

[~chrismattmann] [~tallison@mitre.org] We need SHA digest of raw content for MEMEX project.
I tried to enable digesting parser by editing our config file:
{code}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DigestingParser">
            <parser class="org.apache.tika.parser.DefaultParser">
            </parser>
        </parser>
        .....
{code}

This doesnt work for the obvious reason that we havent told which digest algorithm.
After checking https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java, I found that  DigestingParser is a flexible framwork and takes constructor args. 

So, I propose two options:
1. We offer few popular implementations like SHA, MD5 parsers which doesnt need constructor args. This will enable us to activate them by editing the config xml file instead of source code.
2. We enhance tika configuration framework and these flexible parsers to accept runtime arguments, so that the flexibility and ease of use is preserved. For instance, if we can supply digest algorithm name from config file and let the DigestingParser use it to instantiate, then we dont need to edit source code of applications.
{code}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DigestingParser">
            <args>
                  <digest>MD5</digest>
           </args>
            <parser class="org.apache.tika.parser.DefaultParser">
            </parser>
        </parser>
        .....
{code}

I vote for option 2 even though it is slightly more work, but I feel it is the way to go.
I donot know if Tika already has a support for option 2 by accepting runtime arguments from config file.
 I faced a similar issue with NamedEntityParser, but found a workaround by using System properties.

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> -------------------------------------------------------------------
>
>                 Key: TIKA-1663
>                 URL: https://issues.apache.org/jira/browse/TIKA-1663
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)