You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Satya Deep Maheshwari <co...@gmail.com> on 2015/06/18 14:28:42 UTC
MagicDetector does not enforce mark/reset support in inputstream
Please see [1] which comes into play when detecting the mime-type from
content. I think there is an assumption in Tika's MagicDetector that the
stream would always support mark/reset. Probably it should check it
explicitly and not proceed if that's not the case.
As per the current handling, if an inputstream without mark/reset support
is passed to it, some content is read off this stream but its not reset.
This can potentially cause problems elsewhere where this inputstream is
used.
[1] -
https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java#L352
Regards
Satya Deep
Re: MagicDetector does not enforce mark/reset support in inputstream
Posted by Satya Deep Maheshwari <co...@gmail.com>.
Thanks Jukka.
Yes which basically means that Detector.detect should be passed a
mark-supported stream, which could be either an
inherently mark-supported stream or wrapped within a mark-supported stream
like a TikaInputStream or a BufferedInputStream. This is explicitly stated
in its API documentation as well. But its easy to miss it which can lead to
hard to debug issues later. I think that this method should not proceed
with processing if the passed stream isn't mark-supported. Maybe by just
return at the method start with application/octet-stream or throw an
IllegalArgumentException.
On Fri, Jun 19, 2015 at 7:36 PM, Jukka Zitting <ju...@gmail.com>
wrote:
> You can make the test pass by changing the assertion to:
>
> assertTrue(IOUtils.contentEquals(stream, originalStream));
>
> Wrapping a stream with TikaInputStream doesn't magically add
> mark/reset support to the original stream; only the wrapper instance
> has this feature.
>
Re: MagicDetector does not enforce mark/reset support in inputstream
Posted by Jukka Zitting <ju...@gmail.com>.
You can make the test pass by changing the assertion to:
assertTrue(IOUtils.contentEquals(stream, originalStream));
Wrapping a stream with TikaInputStream doesn't magically add
mark/reset support to the original stream; only the wrapper instance
has this feature.
Re: MagicDetector does not enforce mark/reset support in inputstream
Posted by Satya Deep Maheshwari <co...@gmail.com>.
Attaching a (failing) unit test and a sample file indicating the problem
even though I am wrapping the stream in a TikaInputStream.
On Thu, Jun 18, 2015 at 8:55 PM, Jukka Zitting <ju...@gmail.com>
wrote:
> Hi,
>
> 2015-06-18 8:28 GMT-04:00 Satya Deep Maheshwari <co...@gmail.com>:
> > I think there is an assumption in Tika's MagicDetector that the stream
> would always support mark/reset.
>
> That assumption is based on the Detector API [1], which states that a
> stream passed to detect() should support the mark feature.
>
> As suggested by Nick, an easy way to meet that API contract is for the
> client to wrap a stream into TikaInputStream before passing it to the
> detector.
>
> [1]
> http://tika.apache.org/1.8/api/org/apache/tika/detect/Detector.html#detect(java.io.InputStream,%20org.apache.tika.metadata.Metadata)
>
> BR,
>
> Jukka Zitting
>
Re: MagicDetector does not enforce mark/reset support in inputstream
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
2015-06-18 8:28 GMT-04:00 Satya Deep Maheshwari <co...@gmail.com>:
> I think there is an assumption in Tika's MagicDetector that the stream would always support mark/reset.
That assumption is based on the Detector API [1], which states that a
stream passed to detect() should support the mark feature.
As suggested by Nick, an easy way to meet that API contract is for the
client to wrap a stream into TikaInputStream before passing it to the
detector.
[1] http://tika.apache.org/1.8/api/org/apache/tika/detect/Detector.html#detect(java.io.InputStream,%20org.apache.tika.metadata.Metadata)
BR,
Jukka Zitting
Re: MagicDetector does not enforce mark/reset support in
inputstream
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 18 Jun 2015, Satya Deep Maheshwari wrote:
> Please see [1] which comes into play when detecting the mime-type from
> content. I think there is an assumption in Tika's MagicDetector that the
> stream would always support mark/reset. Probably it should check it
> explicitly and not proceed if that's not the case.
Are you able to produce a small junit unit test that shows the problem?
Generally though, we'd suggest wrapping things in a TikaInputStream, which
takes care of that sort of thing!
Nick