You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrzej Bialecki (Created) (JIRA)" <ji...@apache.org> on 2011/12/05 12:25:39 UTC

[jira] [Created] (TIKA-800) mark/reset not supported from POIFSContainerDetector

mark/reset not supported from POIFSContainerDetector
----------------------------------------------------

                 Key: TIKA-800
                 URL: https://issues.apache.org/jira/browse/TIKA-800
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.0, 1.1
            Reporter: Andrzej Bialecki 


{code}
bash-3.2$ touch test.txt
bash-3.2$ zip test.zip test.txt
  adding: test.txt (stored 0%)
bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
Caused by: java.io.IOException: mark/reset not supported
	at java.io.InputStream.reset(InputStream.java:330)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
	... 5 more
bash-3.2$ 
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-800) mark/reset not supported from POIFSContainerDetector

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163238#comment-13163238 ] 

Nick Burch commented on TIKA-800:
---------------------------------

Fixed in r1210736 by wrapping the ArchiveInputStream, the -z call now works correctly
                
> mark/reset not supported from POIFSContainerDetector
> ----------------------------------------------------
>
>                 Key: TIKA-800
>                 URL: https://issues.apache.org/jira/browse/TIKA-800
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.1
>
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
> 	at java.io.InputStream.reset(InputStream.java:330)
> 	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> 	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
> 	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-800) mark/reset not supported from POIFSContainerDetector

Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-800.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
    
> mark/reset not supported from POIFSContainerDetector
> ----------------------------------------------------
>
>                 Key: TIKA-800
>                 URL: https://issues.apache.org/jira/browse/TIKA-800
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.1
>
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
> 	at java.io.InputStream.reset(InputStream.java:330)
> 	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> 	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
> 	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-800) mark/reset not supported from POIFSContainerDetector

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162730#comment-13162730 ] 

Nick Burch commented on TIKA-800:
---------------------------------

Looks like the issue is that ArchiveInputStream (from Commons Compress) doesn't support mark/reset

My hunch is that there are two fixes needed here:
 * If the POIFS detector (now by run by default if the parser jar is available) can't mark/reset, it should decline to detect
 * The TikaCLI extractor should wrap the InputStreams it gets to ensure that all detectors can run

If no-one spots a snag with these, I'll make the changes in a little bit
                
> mark/reset not supported from POIFSContainerDetector
> ----------------------------------------------------
>
>                 Key: TIKA-800
>                 URL: https://issues.apache.org/jira/browse/TIKA-800
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>            Reporter: Andrzej Bialecki 
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
> 	at java.io.InputStream.reset(InputStream.java:330)
> 	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> 	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
> 	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-800) mark/reset not supported from POIFSContainerDetector

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162734#comment-13162734 ] 

Jukka Zitting commented on TIKA-800:
------------------------------------

bq. If the POIFS detector (now by run by default if the parser jar is available) can't mark/reset, it should decline to detect

The Detector interface explicitly asks for the given InputStream to support mark/reset, so I think it's fine for the detector to throw an IOException like it's doing in this case.
                
> mark/reset not supported from POIFSContainerDetector
> ----------------------------------------------------
>
>                 Key: TIKA-800
>                 URL: https://issues.apache.org/jira/browse/TIKA-800
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>            Reporter: Andrzej Bialecki 
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
> 	at java.io.InputStream.reset(InputStream.java:330)
> 	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> 	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
> 	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-800) mark/reset not supported from POIFSContainerDetector

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162735#comment-13162735 ] 

Nick Burch commented on TIKA-800:
---------------------------------

In that case, maybe it's best to have the wrapping done in PackageExtractor, if the shouldParseEmbedded call indicates we'll process it?

(Since most people getting the embedded resources back will be using them with an AutoDetectParser, it would seem to make sense for us to do the work once rather than everyone having to handle it themselves)
                
> mark/reset not supported from POIFSContainerDetector
> ----------------------------------------------------
>
>                 Key: TIKA-800
>                 URL: https://issues.apache.org/jira/browse/TIKA-800
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>            Reporter: Andrzej Bialecki 
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
> 	at java.io.InputStream.reset(InputStream.java:330)
> 	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> 	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
> 	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-800) mark/reset not supported from POIFSContainerDetector

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163701#comment-13163701 ] 

Jukka Zitting commented on TIKA-800:
------------------------------------

Note that calling TikaInputStream.get(InputStream) expects that you'll explicitly close() the returned stream.
In revision 1211027 I changed the code to use the TikaInputStream.get(InputStream, TemporaryResources) method that works better in this situation.
                
> mark/reset not supported from POIFSContainerDetector
> ----------------------------------------------------
>
>                 Key: TIKA-800
>                 URL: https://issues.apache.org/jira/browse/TIKA-800
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.1
>
>
> {code}
> bash-3.2$ touch test.txt
> bash-3.2$ zip test.zip test.txt
>   adding: test.txt (stored 0%)
> bash-3.2$ java -jar tika-app-1.1-SNAPSHOT.jar -z test.zip
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@2d58f9d3
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.io.IOException: mark/reset not supported
> 	at java.io.InputStream.reset(InputStream.java:330)
> 	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:116)
> 	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> 	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:676)
> 	at org.apache.tika.parser.pkg.PackageExtractor.unpack(PackageExtractor.java:167)
> 	at org.apache.tika.parser.pkg.PackageExtractor.parse(PackageExtractor.java:96)
> 	at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:64)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
> 	... 5 more
> bash-3.2$ 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira