You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2012/06/30 00:07:02 UTC

ZipContainerDetector and TikaInputStream.getFile()

Hi,

When working on TIKA-932 I encountered the following potential problem
in ZipContainerDetector:

The Detector contract says: "The given stream is guaranteed to support
the mark feature and the detector is expected to mark the stream
before reading any bytes from it, and to reset the stream before
returning." The rationale for this constraint is that it should be
possible for things like the AutoDetectParser class to send the same
InputStream to multiple different detectors for processing without
worrying about one of them consuming the stream and thus making it
impossible it to be parsed anymore after type detection. In practice
this means that a detector can't expect to be able to consume the
entire stream, just a prefix up to some predefined length (that the
detector may choose).

When given a TikaInputStream, a detector can get around this
limitation by using the getFile() method to access the underlying file
(or a spooled temporary one) without changing the state of that
stream. This is a pretty useful feature that the ZipContainerDetector
class leverages to great effect.

However, there's a potentially tricky interaction that worries me a
bit. When working without a pre-existing File, the getFile() method
will automatically spool the entire underlying stream to a temporary
file and replace the reference to the original stream with a fresh
FileInputStream based on the new temporary file. This preserves the
externally visible state of the TikaInputStream instance, but of
course changes that of the original stream. I'm wondering if there are
potential cases where a detector is given a normal InputStream that it
then converts into a TikaInputStream and passes to another detector.
In such a case the first detector would break the Detector contract if
the second detector ends up calling the getFile() method.

There are a couple of different ways how we could deal with this scenario:

1) Consider it illegal for the first detector to wrap a stream to a
TikaInputStream like in the above example unless it has some way to
guarantee that the underlying stream won't end up being unexpectedly
consumed. Then any Detector implementation can safely call the
getFile() method on a TikaInputStream acquired with
TikaInputStream.cast() from the stream passed to the detect() method.
This is what the ZipContainerDetector currently does.

2) Require detectors that want to use getFile() to use the hasFile()
method to check whether doing so would end up triggering the spooling
of the underlying stream to a temporary file. This is conceptually a
bit troublesome as it would make the detector responsible for properly
managing the state of a stream that it can't directly see and hasn't
explicitly been given. However, this approach does have the nice
side-effect of preventing the type detection process from
automatically spooling streams to temporary files (which can be a
performance problem). A client that wants to ensure that also more
advanced detectors like ZipContainerDetector can function properly
would need to explicitly call getFile() before the detection process
is started to trigger the spooling of the stream if it isn't already
based on an actual underlying file.

3) A combination of the above approaches where 1) is used to ensure
contractual correctness and 2) to prevent too eager spooling of
streams (and to act as a failsafe in case some code fails to honor
requirement 1).

WDYT?

BR,

Jukka Zitting