You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/09/07 05:03:03 UTC

Jpeg parsing issues

Hi devs,

I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and a  
number of documents now fail during parsing that previously passed.

Many of these failures seem related to image processing. For example:

Caused by: org.apache.tika.exception.TikaException: Can't read JPEG  
metadata
	at  
org 
.apache 
.tika 
.parser 
.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:71)
	at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java: 
163)
	at  
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
	at bixo.parser.TikaCallable.call(TikaCallable.java:63)
	at bixo.parser.TikaCallable.call(TikaCallable.java:1)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.lang.Thread.run(Thread.java:637)
Caused by: com.drew.imaging.jpeg.JpegProcessingException: not a jpeg  
file
	at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown Source)
	at com.drew.imaging.jpeg.JpegSegmentReader.<init>(Unknown Source)
	at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown  
Source)
	at  
org 
.apache 
.tika 
.parser 
.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:67)
	... 8 more

Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract metadata,  
and thus not run into these types of issues?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Jpeg parsing issues

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 7, 2010, at 5:58am, Staffan wrote:

> On Tue, Sep 7, 2010 at 10:43 AM, Nick Burch  
> <ni...@alfresco.com> wrote:
>> On Mon, 6 Sep 2010, Ken Krugler wrote:
>>>
>>> I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and  
>>> a number
>>> of documents now fail during parsing that previously passed.
>>
>> Any chance you could create a new jira issue, and upload one of the  
>> problem
>> documents?
>>
>>> Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract  
>>> metadata, and
>>> thus not run into these types of issues?
>>
>> The image metadata stuff has changed dramatically since 0.7, and  
>> we're now
>> processing a lot more of the files in search of useful metadata  
>> than we used
>> to.
>>
>
> The exception is thrown before we start to extract the metadata. It
> looks like the file is auto detected as a Jpeg but the EXIF parser
> (the same version that Tika has used for a long time) says it is not a
> Jpeg. Please attach one of the failing files to the issue.

I'm extracting these from a .arc web archive file (from the Heritrix  
project). So I'll have to write some code to save these as individual  
files - hopefully next week.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Jpeg parsing issues

Posted by Staffan <so...@gmail.com>.
On Tue, Sep 7, 2010 at 10:43 AM, Nick Burch <ni...@alfresco.com> wrote:
> On Mon, 6 Sep 2010, Ken Krugler wrote:
>>
>> I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and a number
>> of documents now fail during parsing that previously passed.
>
> Any chance you could create a new jira issue, and upload one of the problem
> documents?
>
>> Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract metadata, and
>> thus not run into these types of issues?
>
> The image metadata stuff has changed dramatically since 0.7, and we're now
> processing a lot more of the files in search of useful metadata than we used
> to.
>

The exception is thrown before we start to extract the metadata. It
looks like the file is auto detected as a Jpeg but the EXIF parser
(the same version that Tika has used for a long time) says it is not a
Jpeg. Please attach one of the failing files to the issue.

/Staffan

Re: Jpeg parsing issues

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 6 Sep 2010, Ken Krugler wrote:
> I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and a 
> number of documents now fail during parsing that previously passed.

Any chance you could create a new jira issue, and upload one of the 
problem documents?

> Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract metadata, 
> and thus not run into these types of issues?

The image metadata stuff has changed dramatically since 0.7, and we're now 
processing a lot more of the files in search of useful metadata than we 
used to.

Nick