You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2011/04/21 18:04:05 UTC

[jira] [Created] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
-------------------------------------------------------------------------------------

                 Key: TIKA-645
                 URL: https://issues.apache.org/jira/browse/TIKA-645
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: Nick Burch


Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.

However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036141#comment-13036141 ] 

Jukka Zitting commented on TIKA-645:
------------------------------------

Hmm, yes. I'll look into this.

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034983#comment-13034983 ] 

Jukka Zitting commented on TIKA-645:
------------------------------------

In revision 1104472 I made TikaInputStream extend TaggedInputStream which allows us to avoid one of these levels of wrapping (and also makes it possible to separate exceptions from the original stream from those caused by temporary file handling, etc.).

I'll look at what we can do about the CountingInputStream wrapper.

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025536#comment-13025536 ] 

Nick Burch commented on TIKA-645:
---------------------------------

Nope, when the parser is called from the CLI it gets a TaggedInputStream (wrapping a CountingInputStream wrapping a TikaInputStream)

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-645.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
         Assignee: Jukka Zitting

And in revision 1124385 I modified SecureContentHandler to use TikaInputStream instead of CountingInputStream, which avoids the other stream wrapper.

Now a TikaInputStream given to AutoDetectParser will get passed as-is all the way down to the concrete Parser implementation, so I'm resolving this as fixed.

Note that this fix required a minor backwards-compatibility break in the SecureContentHandler class, but this shouldn't be a big issue since I don't think the class is widely used outside AutoDetectParser.

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-645.
--------------------------------

    Resolution: Fixed

The problem was that when the parser was using TikaInputStream.getFile(), no bytes were recorded as being read from the stream and the SecureContentHandler couldn't figure out where the all the output is coming from.

In revision 1124788 I changed the logic a bit so that when the stream is based on a file, the SecureContentHandler class looks at the total size of the input file instead of the number of bytes read from the input stream.

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch reopened TIKA-645:
-----------------------------


> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036214#comment-13036214 ] 

Nick Burch commented on TIKA-645:
---------------------------------

Thanks. With this change in, tika-app from svn trunk can parse the same 5mb excel file with "-Xms15m -Xmx15m", so the use of NPOIFS and access to the file has saved us a bit over the size of the office file in memory, which is good news!

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036137#comment-13036137 ] 

Nick Burch commented on TIKA-645:
---------------------------------

I've just gone to test this with a 5mb excel file. With 0.9, anything smaller than "-Xms21m -Xmx21m" gives an out of memory

With the latest trunk, something else breaks though:

java -Xms21m -Xmx21m -jar ~/java/apache-tika/tika-app/target/tika-app-1.0-SNAPSHOT.jar --metadata ex45698-22488.xls
Exception in thread "main" org.apache.tika.exception.TikaException: Zip bomb detected!
	at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:128)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:132)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:340)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 0 input bytes produced 1000020 output characters
	at org.apache.tika.sax.SecureContentHandler.advance(SecureContentHandler.java:144)

Looks like something else needs to be updated too?

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025203#comment-13025203 ] 

Jukka Zitting commented on TIKA-645:
------------------------------------

The InputStream passed to the parse() method is still TikaInputStream, isn't it? The parser should then already have access to the underlying file, even if it at some point wants to use wrappers with the given stream.

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-645) Parsers can't get at an underlying TikaInputStream to get the file if they wanted one

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022803#comment-13022803 ] 

Nick Burch commented on TIKA-645:
---------------------------------

One solution that springs to mind is to place the hasFile and getFile methods onto an interface. TikaInputStream, TaggedInputStream and CountingInputStream could then all implement this. That way, if the underlying stream is a TikaInputStream, the parser can still find out and grab the file. If it isn't, then nothing changes.

> Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-645
>                 URL: https://issues.apache.org/jira/browse/TIKA-645
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>
> Spotted this with the office parser, but it should be general. The user creates a TikaInputStream, and passes that off to the parser framework. The Parser that is called may wish to spot that the input is a File backed TikaInputStream, and take a shortcut to use the file instead of the InputStream.
> However, what the parser gets is a TaggedInputStream wrapping a CountingInputStream wrapping the original TikaInputStream. As such, it can't get at the file.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira