You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/08/30 14:46:00 UTC
[jira] [Comment Edited] (TIKA-2450) OfficeParser.parse called for zero-byte file with .doc extension

    [ https://issues.apache.org/jira/browse/TIKA-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147339#comment-16147339 ] 

Tim Allison edited comment on TIKA-2450 at 8/30/17 2:45 PM:
------------------------------------------------------------

Comment too late... keeping for posterity.  Y, let's add a normalized exception that extends TikaException.

-Y, I see your point.  I agree about the preference for byte-based detection over extension-based detection.  As you know, we use a combination... e.g., extensions are all we have for e.g. text files, and, IIRC, we rely on extensions for corrupt ooxml files.-

-And, y, I completely agree that, of course, a 0 byte file can't be a Word Document, but it is possible that it used to be and was somehow truncated to 0 bytes failure of cp/mv or network transmission or possibly in a forensics/carving context (??? I defer to those in that field, though).-

-So, to all, is there value in knowing what format a 0-byte file _might_ have been?  I'm willing to accept "no". :)-


was (Author: tallison@mitre.org):
Y, I see your point.  I agree about the preference for byte-based detection over extension-based detection.  As you know, we use a combination... e.g., extensions are all we have for e.g. text files, and, IIRC, we rely on extensions for corrupt ooxml files.

And, y, I completely agree that, of course, a 0-byte file can't be a Word Document, but it is possible that it used to be and was somehow truncated to 0 bytes --
 failure of cp/mv or network transmission or possibly in a forensics/carving context (??? I defer to those in that field, though).

So, to all, is there value in knowing what format a 0-byte file _might_ have been?  I'm willing to accept "no". :)

> OfficeParser.parse called for zero-byte file with .doc extension
> ----------------------------------------------------------------
>
>                 Key: TIKA-2450
>                 URL: https://issues.apache.org/jira/browse/TIKA-2450
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.16
>            Reporter: Matthew Caruana Galizia
>            Priority: Minor
>
> A zero-byte (empty) file with a .doc extension is detected as a Word Document and the {{OfficeParser.parse}} method is called for this file.
> We then get a {{TikaException}}, with the cause given as an {{org.apache.poi.EmptyFileException}}.
> I think it would be more useful if the file were NOT detected as a Word Document, meaning that the {{AutoDetectParser}} would then fall back to whatever is set as the fallback parser in the parse context.
> This is more useful because the user can then trigger some special logic for handling empty files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)