You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Philipp Koch (JIRA)" <ji...@apache.org> on 2009/11/17 08:54:39 UTC

[jira] Created: (JCR-2395) Text Extractor: Image parser throws exception (jpeg)

Text Extractor: Image parser throws exception (jpeg)
----------------------------------------------------

                 Key: JCR-2395
                 URL: https://issues.apache.org/jira/browse/JCR-2395
             Project: Jackrabbit Content Repository
          Issue Type: Bug
          Components: jackrabbit-text-extractors
    Affects Versions: 2.0-beta1
            Reporter: Philipp Koch


the below exception is thrown over an over while uploading jpeg images:
16.11.2009 17:20:42 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.image.ImageParser@c7bc3
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
	at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
	at java.util.concurrent.FutureTask.run(FutureTask.java:123)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
	at java.lang.Thread.run(Thread.java:613)
Caused by: javax.imageio.IIOException: Not a JPEG file: starts with 0x00 0x05
	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readImageHeader(Native Method)
	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readNativeHeader(JPEGImageReader.java:554)
	at com.sun.imageio.plugins.jpeg.JPEGImageReader.checkTablesOnly(JPEGImageReader.java:309)
	at com.sun.imageio.plugins.jpeg.JPEGImageReader.gotoImage(JPEGImageReader.java:431)
	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readHeader(JPEGImageReader.java:547)
	at com.sun.imageio.plugins.jpeg.JPEGImageReader.getHeight(JPEGImageReader.java:609)
	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:47)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
	... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2395) Text Extractor: Image parser throws exception (jpeg)

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778859#action_12778859 ] 

Jukka Zitting commented on JCR-2395:
------------------------------------

Do you have an example image that triggers this behaviour? For some reason (.jpg extension?) the image is parsed as a JPEG, which causes the exception shown above.

Since Tika currently only supports metadata extraction from images and we only care about the extracted text content, we can avoid this issue simply by disabling the ImageParser in the default configuration.

> Text Extractor: Image parser throws exception (jpeg)
> ----------------------------------------------------
>
>                 Key: JCR-2395
>                 URL: https://issues.apache.org/jira/browse/JCR-2395
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-text-extractors
>    Affects Versions: 2.0-beta1
>            Reporter: Philipp Koch
>
> the below exception is thrown over an over while uploading jpeg images:
> 16.11.2009 17:20:42 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.image.ImageParser@c7bc3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> 	at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:123)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
> 	at java.lang.Thread.run(Thread.java:613)
> Caused by: javax.imageio.IIOException: Not a JPEG file: starts with 0x00 0x05
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readImageHeader(Native Method)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readNativeHeader(JPEGImageReader.java:554)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.checkTablesOnly(JPEGImageReader.java:309)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.gotoImage(JPEGImageReader.java:431)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readHeader(JPEGImageReader.java:547)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.getHeight(JPEGImageReader.java:609)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:47)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> 	... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (JCR-2395) Text Extractor: Image parser throws exception (jpeg)

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved JCR-2395.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0.0

Fixed in revision 881272 by disabling the ImageParser in the default configuration.

See TIKA-326 for a related issue in Tika.

The root cause of this issue, i.e. why the images were incorrectly identified as JPEG, is still unknown.

> Text Extractor: Image parser throws exception (jpeg)
> ----------------------------------------------------
>
>                 Key: JCR-2395
>                 URL: https://issues.apache.org/jira/browse/JCR-2395
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-text-extractors
>    Affects Versions: 2.0-beta1
>            Reporter: Philipp Koch
>             Fix For: 2.0.0
>
>
> the below exception is thrown over an over while uploading jpeg images:
> 16.11.2009 17:20:42 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.image.ImageParser@c7bc3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
> 	at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:123)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
> 	at java.lang.Thread.run(Thread.java:613)
> Caused by: javax.imageio.IIOException: Not a JPEG file: starts with 0x00 0x05
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readImageHeader(Native Method)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readNativeHeader(JPEGImageReader.java:554)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.checkTablesOnly(JPEGImageReader.java:309)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.gotoImage(JPEGImageReader.java:431)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.readHeader(JPEGImageReader.java:547)
> 	at com.sun.imageio.plugins.jpeg.JPEGImageReader.getHeight(JPEGImageReader.java:609)
> 	at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:47)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
> 	... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.