You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Reto Bachmann-Gmür <re...@apache.org> on 2012/10/06 14:48:10 UTC

Engine to extract XMP and problems with the tika engine

Hello

I thought that adding an engine that extract XMP metadata and converts EXIF
data to XMP would be pretty straight forward (expecially since clerezza
provides a bundle with such utilities).

However I've noticed that the tika engina already processes jpegs but for
the jpeg I've been testing it I get:

<h3>Caused
by:</h3><pre>org.apache.stanbol.enhancer.servicesapi.EngineException:
Unable to convert ContentItem
&lt;urn:content-item-sha1-13b7a6ca2636d1e1e8d36b4bc69d623947a6acb7&gt; with
mimeType 'image/jpeg' to plain text!
    at
org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:222)
    at
org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.processEvent(EnhancementJobHandler.java:259)
    at
org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.handleEvent(EnhancementJobHandler.java:181)
    at
org.apache.felix.eventadmin.impl.tasks.HandlerTaskImpl.execute(HandlerTaskImpl.java:88)
    at
org.apache.felix.eventadmin.impl.tasks.SyncDeliverTasks.execute(SyncDeliverTasks.java:221)
    at
org.apache.felix.eventadmin.impl.tasks.AsyncDeliverTasks$TaskExecuter.run(AsyncDeliverTasks.java:110)
    at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown
Source)
    at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.tika.exception.TikaException: Can't read JPEG metadata
    at
org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
    at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at
org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:220)
    ... 7 more
Caused by: com.drew.imaging.jpeg.JpegProcessingException: segment size
would extend beyond file stream length
    at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown Source)
    at com.drew.imaging.jpeg.JpegSegmentReader.&lt;init&gt;(Unknown Source)
    at
org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:94)
    ... 13 more
</pre>
<h3>Caused by:</h3><pre>org.apache.tika.exception.TikaException: Can't read
JPEG metadata
    at
org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
    at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)

Now its not surprising that a jpeg cannot be converted to plain text but
why does tika attempts in the first place andy why can't the JPEG metadata
be read?

Any ideas?

Cheers,
Reto

Re: Engine to extract XMP and problems with the tika engine

Posted by Reto Bachmann-Gmür <re...@apache.org>.
Hi Rupert

Thanks for your hints. The problem was my incorrect usage of curl instead
of  --data @file.jpg I have to use --data-binary @file.jpg.

Reto


On Sun, Oct 7, 2012 at 10:38 AM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi Reto,
>
> Normally it is not a problem if a parsed content does not contain any
> plain text. There is even a unit test for the TikaEngine that test
> EXIF metadata extraction for JPEG images (see
> TikaEngineTest#testExifMetadata).
>
> Because of that I assume that the library used by Tika does hove some
> problem with your image. In fact TIKA-609 mentions a similar exception
> and the first comment suggests an illegal char encoding as cause (what
> might make sense, because this could cause a different number of bytes
> to be read from the stream).
>
> I would suggest to directly test your image with Tika 1.2 and see if
> you can reproduce the error
>
> best
> Rupert
>
> On Sat, Oct 6, 2012 at 2:48 PM, Reto Bachmann-Gmür <re...@apache.org>
> wrote:
> > Hello
> >
> > I thought that adding an engine that extract XMP metadata and converts
> EXIF
> > data to XMP would be pretty straight forward (expecially since clerezza
> > provides a bundle with such utilities).
> >
> > However I've noticed that the tika engina already processes jpegs but for
> > the jpeg I've been testing it I get:
> >
> > <h3>Caused
> > by:</h3><pre>org.apache.stanbol.enhancer.servicesapi.EngineException:
> > Unable to convert ContentItem
> > &lt;urn:content-item-sha1-13b7a6ca2636d1e1e8d36b4bc69d623947a6acb7&gt;
> with
> > mimeType 'image/jpeg' to plain text!
> >     at
> >
> org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:222)
> >     at
> >
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.processEvent(EnhancementJobHandler.java:259)
> >     at
> >
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.handleEvent(EnhancementJobHandler.java:181)
> >     at
> >
> org.apache.felix.eventadmin.impl.tasks.HandlerTaskImpl.execute(HandlerTaskImpl.java:88)
> >     at
> >
> org.apache.felix.eventadmin.impl.tasks.SyncDeliverTasks.execute(SyncDeliverTasks.java:221)
> >     at
> >
> org.apache.felix.eventadmin.impl.tasks.AsyncDeliverTasks$TaskExecuter.run(AsyncDeliverTasks.java:110)
> >     at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown
> > Source)
> >     at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.tika.exception.TikaException: Can't read JPEG
> metadata
> >     at
> >
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
> >     at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
> >     at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >     at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >     at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >     at
> > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> >     at
> >
> org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:220)
> >     ... 7 more
> > Caused by: com.drew.imaging.jpeg.JpegProcessingException: segment size
> > would extend beyond file stream length
> >     at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown
> Source)
> >     at com.drew.imaging.jpeg.JpegSegmentReader.&lt;init&gt;(Unknown
> Source)
> >     at
> >
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:94)
> >     ... 13 more
> > </pre>
> > <h3>Caused by:</h3><pre>org.apache.tika.exception.TikaException: Can't
> read
> > JPEG metadata
> >     at
> >
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
> >     at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
> >     at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >     at
> > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >
> > Now its not surprising that a jpeg cannot be converted to plain text but
> > why does tika attempts in the first place andy why can't the JPEG
> metadata
> > be read?
> >
> > Any ideas?
> >
> > Cheers,
> > Reto
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Engine to extract XMP and problems with the tika engine

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Reto,

Normally it is not a problem if a parsed content does not contain any
plain text. There is even a unit test for the TikaEngine that test
EXIF metadata extraction for JPEG images (see
TikaEngineTest#testExifMetadata).

Because of that I assume that the library used by Tika does hove some
problem with your image. In fact TIKA-609 mentions a similar exception
and the first comment suggests an illegal char encoding as cause (what
might make sense, because this could cause a different number of bytes
to be read from the stream).

I would suggest to directly test your image with Tika 1.2 and see if
you can reproduce the error

best
Rupert

On Sat, Oct 6, 2012 at 2:48 PM, Reto Bachmann-Gmür <re...@apache.org> wrote:
> Hello
>
> I thought that adding an engine that extract XMP metadata and converts EXIF
> data to XMP would be pretty straight forward (expecially since clerezza
> provides a bundle with such utilities).
>
> However I've noticed that the tika engina already processes jpegs but for
> the jpeg I've been testing it I get:
>
> <h3>Caused
> by:</h3><pre>org.apache.stanbol.enhancer.servicesapi.EngineException:
> Unable to convert ContentItem
> &lt;urn:content-item-sha1-13b7a6ca2636d1e1e8d36b4bc69d623947a6acb7&gt; with
> mimeType 'image/jpeg' to plain text!
>     at
> org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:222)
>     at
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.processEvent(EnhancementJobHandler.java:259)
>     at
> org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.handleEvent(EnhancementJobHandler.java:181)
>     at
> org.apache.felix.eventadmin.impl.tasks.HandlerTaskImpl.execute(HandlerTaskImpl.java:88)
>     at
> org.apache.felix.eventadmin.impl.tasks.SyncDeliverTasks.execute(SyncDeliverTasks.java:221)
>     at
> org.apache.felix.eventadmin.impl.tasks.AsyncDeliverTasks$TaskExecuter.run(AsyncDeliverTasks.java:110)
>     at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown
> Source)
>     at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.tika.exception.TikaException: Can't read JPEG metadata
>     at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
>     at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>     at
> org.apache.stanbol.enhancer.engines.tika.TikaEngine.computeEnhancements(TikaEngine.java:220)
>     ... 7 more
> Caused by: com.drew.imaging.jpeg.JpegProcessingException: segment size
> would extend beyond file stream length
>     at com.drew.imaging.jpeg.JpegSegmentReader.readSegments(Unknown Source)
>     at com.drew.imaging.jpeg.JpegSegmentReader.&lt;init&gt;(Unknown Source)
>     at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:94)
>     ... 13 more
> </pre>
> <h3>Caused by:</h3><pre>org.apache.tika.exception.TikaException: Can't read
> JPEG metadata
>     at
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:104)
>     at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>     at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>
> Now its not surprising that a jpeg cannot be converted to plain text but
> why does tika attempts in the first place andy why can't the JPEG metadata
> be read?
>
> Any ideas?
>
> Cheers,
> Reto



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen