You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/12/16 14:56:46 UTC

[jira] [Updated] (TIKA-1813) Figure out file types for several unknown OLE files in Common Crawl

     [ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-1813:
------------------------------
    Attachment: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB
                25JIANLV77U645GUSJ2E67YSM4B2TNSP
                2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA

Some examples.

The file lengths are suspiciously regular.  Given that these are Common Crawl docs, there's a chance that they were truncated.

225... looks like an SPSS output file (SPO)...maybe?

[Gary Kessler|http://www.garykessler.net/library/file_sigs.html] has a helpful list of non-MS file types that rely on OLE.

> Figure out file types for several unknown OLE files in Common Crawl
> -------------------------------------------------------------------
>
>                 Key: TIKA-1813
>                 URL: https://issues.apache.org/jira/browse/TIKA-1813
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
>     at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)