You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by kevin slote <ks...@gmail.com> on 2014/07/31 18:15:51 UTC

Compress algorithm 'implode' not parsed.

Hi every one.

I would like to talk about a compression algorithm that doesn't get parsed
by Tika yet, but could be.  The compression algorithm is called 'implode'
and there is a patch for Apache-compress that can handle this particular
compression algorithm that is not yet leveraged by Tika.

There is a unit test in Tika in ZipParserTest.java:

It is a test to demonstrate that just the names of files get extracted when
the zip is compressed with the 'implode' compression algorithm.
The file moby.zip in the test data is compressed with this type of
compression.



    /**

     * Test case for the ability of the ZIP parser to extract the name of

     * a ZIP entry even if the content of the entry is unreadable due to an

     * unsupported compression method.

     *

     * *@see* <a href="https://issues.apache.org/jira/browse/TIKA-346"
<https://issues.apache.org/jira/browse/TIKA-346>>TIKA-346</a>

     */

    @Test

    *public* *void* testUnsupportedZipCompressionMethod() *throws*
Exception {

        String content = *new* Tika().parseToString(

                ZipParserTest.*class*.getResourceAsStream(

                        "/test-documents/moby.zip"));

        *assertTrue*(content.contains("README"));

    }


The implode compression algorithm is an old proprietary compression
algorithm that used to be used by PKZIP in the '80's.

It uses Shannon Fano coding, which has fallen out of favor since huffman
coding is more efficient.

Point being, Tika-1.5 uses apache-commons-compress 1-5. According to
the Apache compress jira ticket below, Apache compress can

handle this compression method for compress version greater than 1.7. I was
wondering, if I wrote a patch for this if I could contribute to the tika or
if this was worthy of being opened as an issue.




https://issues.apache.org/jira/browse/COMPRESS-115

http://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding

https://issues.apache.org/jira/browse/COMPRESS-115

Re: Compress algorithm 'implode' not parsed.

Posted by kevin slote <ks...@gmail.com>.
Sorry I haven't responded to this.  I tried updating the pom to have the
latest version of compress and it didn't change anything.  I tested this on
1.5.  I just cloned the git repository and will test this again on the
latest version of the code.


On Thu, Jul 31, 2014 at 12:26 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 31 Jul 2014, kevin slote wrote:
>
>> Point being, Tika-1.5 uses apache-commons-compress 1-5. According to the
>> Apache compress jira ticket below, Apache compress can
>>
>
> Trunk currently uses Commons Compress 1.8, can you try with that?
>
> (Tika 1.6 should be out within about a week, based on trunk)
>
> Nick
>

Re: Compress algorithm 'implode' not parsed.

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 31 Jul 2014, kevin slote wrote:
> Point being, Tika-1.5 uses apache-commons-compress 1-5. According to the 
> Apache compress jira ticket below, Apache compress can

Trunk currently uses Commons Compress 1.8, can you try with that?

(Tika 1.6 should be out within about a week, based on trunk)

Nick