You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Benson Margulies (JIRA)" <ji...@apache.org> on 2010/12/11 20:39:01 UTC
[jira] Created: (TIKA-570) If this is a BMP, my name is horatio
alger
If this is a BMP, my name is horatio alger
------------------------------------------
Key: TIKA-570
URL: https://issues.apache.org/jira/browse/TIKA-570
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.8
Reporter: Benson Margulies
I am attaching a file which Tika is identifying as a bmp.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970627#action_12970627 ]
Benjamin Douglas commented on TIKA-570:
---------------------------------------
Also looking at:
BITMAPCOREHEADER: http://msdn.microsoft.com/en-us/library/dd183372%28VS.85%29.aspx
BITMAPINFOHEADER: http://msdn.microsoft.com/en-us/library/dd183376%28VS.85%29.aspx
It looks as if the two byte sequence at 0x1C (bit count) must have the values 0, 1, 4, 8, 16, 24, or 32. Especially since these all have a 0x00 in their most significant byte, this again should have very little overlap with text data that starts with ASCII.
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970579#action_12970579 ]
Nick Burch commented on TIKA-570:
---------------------------------
Reading http://en.wikipedia.org/wiki/BMP_file_format I'm not sure what else we can be sure to find, but I'm tempted to say we also require either "00 00" or "00 00 00" inside the first few KB - a text file shouldn't have that many nulls, but most bitmaps will.
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Douglas updated TIKA-570:
----------------------------------
Attachment: TIKA-570.patch
I am attaching a patch that encodes the "BM" prefix, the color planes signature, and the possible bit count values in tika-mimetypes.xml. I believe that since we are checking for the "BM" magic, this should not conflict with any OS/2 variations, since they have different magic values, like "BA", "CI", etc.
This patch adds the original text file to the test document set and confirms in the unit test that it is not detected as a bitmap.
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, TIKA-570.patch
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970578#action_12970578 ]
Nick Burch commented on TIKA-570:
---------------------------------
First line of the file:
"BMW to Make Hybrid Sports Car
"
And the BMP matcher is:
<magic priority="50">
<match value="BM" type="string" offset="0" />
</magic>
I think we'll need to add a second check too onto it
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-570.
-----------------------------
Resolution: Fixed
Fix Version/s: 0.9
Thanks for the patch, I agree those look like the right header bytes to check for. I've applied it in r1045006.
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Fix For: 0.9
>
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, TIKA-570.patch
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated TIKA-570:
----------------------------------
Description:
I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 8 more
was:I am attaching a file which Tika is identifying as a bmp.
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970626#action_12970626 ]
Benjamin Douglas commented on TIKA-570:
---------------------------------------
What about this (from the Wikipedia article):
Offset: 0x1A
Size:2
Purpose: the number of color planes being used. Must be set to 1.
This means that there is always a two byte 0x01 0x00 sequence at a specific offset toward the beginning of the file. This is in the header, and granted there are different versions of the header; but the description in the article makes it look like the majority of headers have this, possibly modulo OS/2 flavors. The pattern 0x01 0x00 is not likely to appear in most plain text, especially text that begins with ASCII. The BMP file in the unit tests has this signature, for example.
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp. It contains ordinary text.
>
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.image.ImageParser@20a19811
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at com.basistech.jug.FileHarvester.process(FileHarvester.java:204)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:165)
> at com.basistech.jug.FileHarvester.harvestDir(FileHarvester.java:179)
> at com.basistech.jug.FileHarvester.harvest(FileHarvester.java:135)
> at com.basistech.jug.FileHarvester.run(FileHarvester.java:247)
> at java.lang.Thread.run(Thread.java:680)
> Caused by: java.lang.RuntimeException: New BMP version not implemented yet.
> at com.sun.imageio.plugins.bmp.BMPImageReader.readHeader(BMPImageReader.java:462)
> at com.sun.imageio.plugins.bmp.BMPImageReader.getWidth(BMPImageReader.java:174)
> at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:75)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-570) If this is a BMP, my name is horatio
alger
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated TIKA-570:
----------------------------------
Attachment: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
> If this is a BMP, my name is horatio alger
> ------------------------------------------
>
> Key: TIKA-570
> URL: https://issues.apache.org/jira/browse/TIKA-570
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.8
> Reporter: Benson Margulies
> Attachments: C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt, C80A5295-EFC7-44DD-9A39-B882D1EC6F38.txt
>
>
> I am attaching a file which Tika is identifying as a bmp.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.