You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Torsten Krah (JIRA)" <ji...@apache.org> on 2012/05/07 15:24:50 UTC
[jira] [Created] (TIKA-913) MagicMime detection of msdos
executables does not work
Torsten Krah created TIKA-913:
---------------------------------
Summary: MagicMime detection of msdos executables does not work
Key: TIKA-913
URL: https://issues.apache.org/jira/browse/TIKA-913
Project: Tika
Issue Type: Bug
Components: mime
Affects Versions: 1.1
Environment: Linux, JDK 1.6
Reporter: Torsten Krah
Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
For example using putty ms-dos executable does result in wrong detections:
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
application/octet-stream
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
image/jpeg
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
application/x-msdownload
Its everytime the same binary resource only with different names.
In contrast using "file" does output:
krah@sf050:~$ file /tmp/putty
/tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
krah@sf050:~$ file /tmp/putty.jpg
/tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
krah@sf050:~$ file /tmp/putty.exe
/tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
So magic mime detection should be able to detect that this is actually an executable.
E.g. for a PDF it does work:
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
application/pdf
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
application/pdf
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg
application/pdf
Here Tika detects what is expected.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-913) MagicMime detection of msdos
executables does not work
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272258#comment-13272258 ]
Nick Burch commented on TIKA-913:
---------------------------------
If anyone wanted to add a parser for PE(32/64) files, then this doc should be handy: <http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx>. We should be able to get the odd common thing, like creation date, along with lots of other info too
Based on this info, and the osdev page, I've added mime magic for what look to be the common variants in r1336610.
> MagicMime detection of msdos executables does not work
> ------------------------------------------------------
>
> Key: TIKA-913
> URL: https://issues.apache.org/jira/browse/TIKA-913
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.1
> Environment: Linux, JDK 1.6
> Reporter: Torsten Krah
> Labels: detection, magic, mime
>
> Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
> For example using putty ms-dos executable does result in wrong detections:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
> application/octet-stream
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
> image/jpeg
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
> application/x-msdownload
> Its everytime the same binary resource only with different names.
> In contrast using "file" does output:
> krah@sf050:~$ file /tmp/putty
> /tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.jpg
> /tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.exe
> /tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> So magic mime detection should be able to detect that this is actually an executable.
> E.g. for a PDF it does work:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg
> application/pdf
> Here Tika detects what is expected.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-913) MagicMime detection of msdos
executables does not work
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-913.
-----------------------------
Resolution: Fixed
Fix Version/s: 1.2
> MagicMime detection of msdos executables does not work
> ------------------------------------------------------
>
> Key: TIKA-913
> URL: https://issues.apache.org/jira/browse/TIKA-913
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.1
> Environment: Linux, JDK 1.6
> Reporter: Torsten Krah
> Labels: detection, magic, mime
> Fix For: 1.2
>
>
> Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
> For example using putty ms-dos executable does result in wrong detections:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
> application/octet-stream
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
> image/jpeg
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
> application/x-msdownload
> Its everytime the same binary resource only with different names.
> In contrast using "file" does output:
> krah@sf050:~$ file /tmp/putty
> /tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.jpg
> /tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.exe
> /tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> So magic mime detection should be able to detect that this is actually an executable.
> E.g. for a PDF it does work:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg
> application/pdf
> Here Tika detects what is expected.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-913) MagicMime detection of msdos
executables does not work
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272242#comment-13272242 ]
Nick Burch commented on TIKA-913:
---------------------------------
I believe that MS-DOS executables are not actually PE32 files. http://wiki.osdev.org/PE seems to have some good details on the PE32 and PE64 formats that should help for detection
> MagicMime detection of msdos executables does not work
> ------------------------------------------------------
>
> Key: TIKA-913
> URL: https://issues.apache.org/jira/browse/TIKA-913
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.1
> Environment: Linux, JDK 1.6
> Reporter: Torsten Krah
> Labels: detection, magic, mime
>
> Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
> For example using putty ms-dos executable does result in wrong detections:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
> application/octet-stream
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
> image/jpeg
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
> application/x-msdownload
> Its everytime the same binary resource only with different names.
> In contrast using "file" does output:
> krah@sf050:~$ file /tmp/putty
> /tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.jpg
> /tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.exe
> /tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> So magic mime detection should be able to detect that this is actually an executable.
> E.g. for a PDF it does work:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg
> application/pdf
> Here Tika detects what is expected.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira