You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/09/07 05:03:03 UTC
Jpeg parsing issues
Hi devs,
I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and a
number of documents now fail during parsing that previously passed.
Many of these failures seem related to image processing. For example:
Caused by: org.apache.tika.exception.TikaException: Can't read JPEG
metadata
at
org
.apache
.tika
.parser
.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:71)
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:
163)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
at bixo.parser.TikaCallable.call(TikaCallable.java:63)
at bixo.parser.TikaCallable.call(TikaCallable.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.lang.Thread.run(Thread.java:637)
Caused by: com.drew.imaging.jpeg.JpegProcessingException: not a jpeg
file
at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown Source)
at com.drew.imaging.jpeg.JpegSegmentReader.<init>(Unknown Source)
at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown
Source)
at
org
.apache
.tika
.parser
.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:67)
... 8 more
Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract metadata,
and thus not run into these types of issues?
Thanks,
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Re: Jpeg parsing issues
Posted by Ken Krugler <kk...@transpac.com>.
On Sep 7, 2010, at 5:58am, Staffan wrote:
> On Tue, Sep 7, 2010 at 10:43 AM, Nick Burch
> <ni...@alfresco.com> wrote:
>> On Mon, 6 Sep 2010, Ken Krugler wrote:
>>>
>>> I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and
>>> a number
>>> of documents now fail during parsing that previously passed.
>>
>> Any chance you could create a new jira issue, and upload one of the
>> problem
>> documents?
>>
>>> Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract
>>> metadata, and
>>> thus not run into these types of issues?
>>
>> The image metadata stuff has changed dramatically since 0.7, and
>> we're now
>> processing a lot more of the files in search of useful metadata
>> than we used
>> to.
>>
>
> The exception is thrown before we start to extract the metadata. It
> looks like the file is auto detected as a Jpeg but the EXIF parser
> (the same version that Tika has used for a long time) says it is not a
> Jpeg. Please attach one of the failing files to the issue.
I'm extracting these from a .arc web archive file (from the Heritrix
project). So I'll have to write some code to save these as individual
files - hopefully next week.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Re: Jpeg parsing issues
Posted by Staffan <so...@gmail.com>.
On Tue, Sep 7, 2010 at 10:43 AM, Nick Burch <ni...@alfresco.com> wrote:
> On Mon, 6 Sep 2010, Ken Krugler wrote:
>>
>> I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and a number
>> of documents now fail during parsing that previously passed.
>
> Any chance you could create a new jira issue, and upload one of the problem
> documents?
>
>> Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract metadata, and
>> thus not run into these types of issues?
>
> The image metadata stuff has changed dramatically since 0.7, and we're now
> processing a lot more of the files in search of useful metadata than we used
> to.
>
The exception is thrown before we start to extract the metadata. It
looks like the file is auto detected as a Jpeg but the EXIF parser
(the same version that Tika has used for a long time) says it is not a
Jpeg. Please attach one of the failing files to the issue.
/Staffan
Re: Jpeg parsing issues
Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 6 Sep 2010, Ken Krugler wrote:
> I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and a
> number of documents now fail during parsing that previously passed.
Any chance you could create a new jira issue, and upload one of the
problem documents?
> Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract metadata,
> and thus not run into these types of issues?
The image metadata stuff has changed dramatically since 0.7, and we're now
processing a lot more of the files in search of useful metadata than we
used to.
Nick