You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Torsten Krah (JIRA)" <ji...@apache.org> on 2012/05/07 15:24:50 UTC

[jira] [Created] (TIKA-913) MagicMime detection of msdos executables does not work

Torsten Krah created TIKA-913:
---------------------------------

             Summary: MagicMime detection of msdos executables does not work
                 Key: TIKA-913
                 URL: https://issues.apache.org/jira/browse/TIKA-913
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.1
         Environment: Linux, JDK 1.6
            Reporter: Torsten Krah


Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
For example using putty ms-dos executable does result in wrong detections:

krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
application/octet-stream
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
image/jpeg
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
application/x-msdownload


Its everytime the same binary resource only with different names.
In contrast using "file" does output:

krah@sf050:~$ file /tmp/putty
/tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
krah@sf050:~$ file /tmp/putty.jpg
/tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
krah@sf050:~$ file /tmp/putty.exe
/tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit

So magic mime detection should be able to detect that this is actually an executable.

E.g. for a PDF it does work:

krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
application/pdf
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
application/pdf
krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg 
application/pdf

Here Tika detects what is expected.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-913) MagicMime detection of msdos executables does not work

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272258#comment-13272258 ] 

Nick Burch commented on TIKA-913:
---------------------------------

If anyone wanted to add a parser for PE(32/64) files, then this doc should be handy: <http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx>. We should be able to get the odd common thing, like creation date, along with lots of other info too

Based on this info, and the osdev page, I've added mime magic for what look to be the common variants in r1336610.
                
> MagicMime detection of msdos executables does not work
> ------------------------------------------------------
>
>                 Key: TIKA-913
>                 URL: https://issues.apache.org/jira/browse/TIKA-913
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.1
>         Environment: Linux, JDK 1.6
>            Reporter: Torsten Krah
>              Labels: detection, magic, mime
>
> Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
> For example using putty ms-dos executable does result in wrong detections:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
> application/octet-stream
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
> image/jpeg
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
> application/x-msdownload
> Its everytime the same binary resource only with different names.
> In contrast using "file" does output:
> krah@sf050:~$ file /tmp/putty
> /tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.jpg
> /tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.exe
> /tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> So magic mime detection should be able to detect that this is actually an executable.
> E.g. for a PDF it does work:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg 
> application/pdf
> Here Tika detects what is expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-913) MagicMime detection of msdos executables does not work

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-913.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.2
    
> MagicMime detection of msdos executables does not work
> ------------------------------------------------------
>
>                 Key: TIKA-913
>                 URL: https://issues.apache.org/jira/browse/TIKA-913
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.1
>         Environment: Linux, JDK 1.6
>            Reporter: Torsten Krah
>              Labels: detection, magic, mime
>             Fix For: 1.2
>
>
> Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
> For example using putty ms-dos executable does result in wrong detections:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
> application/octet-stream
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
> image/jpeg
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
> application/x-msdownload
> Its everytime the same binary resource only with different names.
> In contrast using "file" does output:
> krah@sf050:~$ file /tmp/putty
> /tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.jpg
> /tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.exe
> /tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> So magic mime detection should be able to detect that this is actually an executable.
> E.g. for a PDF it does work:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg 
> application/pdf
> Here Tika detects what is expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-913) MagicMime detection of msdos executables does not work

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272242#comment-13272242 ] 

Nick Burch commented on TIKA-913:
---------------------------------

I believe that MS-DOS executables are not actually PE32 files. http://wiki.osdev.org/PE seems to have some good details on the PE32 and PE64 formats that should help for detection
                
> MagicMime detection of msdos executables does not work
> ------------------------------------------------------
>
>                 Key: TIKA-913
>                 URL: https://issues.apache.org/jira/browse/TIKA-913
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.1
>         Environment: Linux, JDK 1.6
>            Reporter: Torsten Krah
>              Labels: detection, magic, mime
>
> Mime detection does not work as expected (at least from me) in contrast e.g. to sourceforge mime-util detection or "file" utility.
> For example using putty ms-dos executable does result in wrong detections:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty
> application/octet-stream
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.jpg
> image/jpeg
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/putty.exe
> application/x-msdownload
> Its everytime the same binary resource only with different names.
> In contrast using "file" does output:
> krah@sf050:~$ file /tmp/putty
> /tmp/putty: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.jpg
> /tmp/putty.jpg: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> krah@sf050:~$ file /tmp/putty.exe
> /tmp/putty.exe: PE32 executable for MS Windows (GUI) Intel 80386 32-bit
> So magic mime detection should be able to detect that this is actually an executable.
> E.g. for a PDF it does work:
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.pdf
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print
> application/pdf
> krah@sf050:~$ java -jar /tmp/tika-app-1.1.jar --detect /tmp/print.jpg 
> application/pdf
> Here Tika detects what is expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira