You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Yahav Amsalem (JIRA)" <ji...@apache.org> on 2016/10/19 11:55:58 UTC

[jira] [Created] (TIKA-2123) CommonsDigester calculates wrong hashes on large files

Yahav Amsalem created TIKA-2123:
-----------------------------------

             Summary: CommonsDigester calculates wrong hashes on large files
                 Key: TIKA-2123
                 URL: https://issues.apache.org/jira/browse/TIKA-2123
             Project: Tika
          Issue Type: Bug
          Components: metadata
    Affects Versions: 1.13
            Reporter: Yahav Amsalem


When passing more than one algorithm to CommonsDigester constructor and
then trying to digest a file which is larger than 7.5 MB, results wrong
hashe calculation for all the algorithms except the first.

The next code will reproduce the bug:

// The file that was used w as a simple plain text file with size > 7.5 MB 
File file = new File("c:\\testLargeFile.txt");

BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream(file));

Metadata metadata = new Metadata();

CommonsDigester digester = new CommonsDigester(20000000,
                CommonsDigester.DigestAlgorithm.MD5,
                CommonsDigester.DigestAlgorithm.SHA1,
                CommonsDigester.DigestAlgorithm.SHA256);

digester.digest(bufferedInputStream, metadata, null);

// Will print correct MD5 but wrong SHA1 and wrong SHA256
System.out.println(metadata);

Initial direction: it seems that the inner buffered stream that is being used doesn't reset to 0 position after the first algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)