You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Benson Margulies (JIRA)" <ji...@apache.org> on 2010/12/11 20:39:01 UTC

[jira] Created: (TIKA-570) If this is a BMP, my name is horatio alger

If this is a BMP, my name is horatio alger
------------------------------------------

                 Key: TIKA-570
                 URL: https://issues.apache.org/jira/browse/TIKA-570
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
            Reporter: Benson Margulies


I am attaching a file which Tika is identifying as a bmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970627#action_12970627 ] 

Benjamin Douglas commented on TIKA-570:
---------------------------------------

Also looking at:

BITMAPCOREHEADER: http://msdn.microsoft.com/en-us/library/dd183372%28VS.85%29.aspx
BITMAPINFOHEADER: http://msdn.microsoft.com/en-us/library/dd183376%28VS.85%29.aspx

It looks as if the two byte sequence at 0x1C (bit count) must have the values 0, 1, 4, 8, 16, 24, or 32. Especially since these all have a 0x00 in their most significant byte, this again should have very little overlap with text data that starts with ASCII.

> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> 	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> 	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970579#action_12970579 ] 

Nick Burch commented on TIKA-570:
---------------------------------

Reading http://en.wikipedia.org/wiki/BMP_file_format I'm not sure what else we can be sure to find, but I'm tempted to say we also require either "00 00" or "00 00 00" inside the first few KB - a text file shouldn't have that many nulls, but most bitmaps will.

> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> 	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> 	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Douglas updated TIKA-570:
----------------------------------

    Attachment: TIKA-570.patch

I am attaching a patch that encodes the "BM" prefix, the color planes signature, and the possible bit count values in tika-mimetypes.xml. I believe that since we are checking for the "BM" magic, this should not conflict with any OS/2 variations, since they have different magic values, like "BA", "CI", etc.

This patch adds the original text file to the test document set and confirms in the unit test that it is not detected as a bitmap.

> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, TIKA-570.patch
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> 	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> 	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970578#action_12970578 ] 

Nick Burch commented on TIKA-570:
---------------------------------

First line of the file:
"BMW to Make Hybrid Sports Car
"

And the BMP matcher is:
    <magic priority="50">
      <match value="BM" type="string" offset="0" />
    </magic>

I think we'll need to add a second check too onto it

> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> 	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> 	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-570.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.9

Thanks for the patch, I agree those look like the right header bytes to check for. I've applied it in r1045006.

> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>             Fix For: 0.9
>
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, TIKA-570.patch
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> 	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> 	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated TIKA-570:
----------------------------------

    Description: 
I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.

 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
	at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
	... 8 more


  was:I am attaching a file which Tika is identifying as a bmp.


> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> 	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> 	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970626#action_12970626 ] 

Benjamin Douglas commented on TIKA-570:
---------------------------------------

What about this (from the Wikipedia article):

Offset: 0x1A
Size:2
Purpose: the number of color planes being used. Must be set to 1.

This means that there is always a two byte 0x01 0x00 sequence at a specific offset toward the beginning of the file. This is in the header, and granted there are different versions of the header; but the description in the article makes it look like the majority of headers have this, possibly modulo OS/2 flavors. The pattern 0x01 0x00 is not likely to appear in most plain text, especially text that begins with ASCII. The BMP file in the unit tests has this signature, for example.

> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> 	at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> 	at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> 	at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> 	at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> 	at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio alger

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated TIKA-570:
----------------------------------

    Attachment: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
                C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt

> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
>                 Key: TIKA-570
>                 URL: https://issues.apache.org/jira/browse/TIKA-570
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Benson Margulies
>         Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.