You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Satya Deep Maheshwari <co...@gmail.com> on 2015/06/18 14:28:42 UTC

MagicDetector does not enforce mark/reset support in inputstream

Please see [1] which comes into play when detecting the mime-type from
content. I think there is an assumption in Tika's MagicDetector that the
stream would always support mark/reset. Probably it should check it
explicitly and not proceed if that's not the case.

As per the current handling, if an inputstream without mark/reset support
is passed to it, some content is read off this stream but its not reset.
This can potentially cause problems elsewhere where this inputstream is
used.

[1] -
https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java#L352

Regards
Satya Deep

Re: MagicDetector does not enforce mark/reset support in inputstream

Posted by Satya Deep Maheshwari <co...@gmail.com>.
Thanks Jukka.
Yes which basically means that Detector.detect should be passed a
mark-supported stream, which could be either an
inherently mark-supported stream or wrapped  within a mark-supported stream
like a TikaInputStream  or a BufferedInputStream. This is explicitly stated
in its API documentation as well. But its easy to miss it which can lead to
hard to debug issues later. I think that this method should not proceed
with processing if the passed stream isn't mark-supported. Maybe by just
return at the method start with application/octet-stream or throw an
IllegalArgumentException.

On Fri, Jun 19, 2015 at 7:36 PM, Jukka Zitting <ju...@gmail.com>
wrote:

> You can make the test pass by changing the assertion to:
>
>     assertTrue(IOUtils.contentEquals(stream, originalStream));
>
> Wrapping a stream with TikaInputStream doesn't magically add
> mark/reset support to the original stream; only the wrapper instance
> has this feature.
>

Re: MagicDetector does not enforce mark/reset support in inputstream

Posted by Jukka Zitting <ju...@gmail.com>.
You can make the test pass by changing the assertion to:

    assertTrue(IOUtils.contentEquals(stream, originalStream));

Wrapping a stream with TikaInputStream doesn't magically add
mark/reset support to the original stream; only the wrapper instance
has this feature.

Re: MagicDetector does not enforce mark/reset support in inputstream

Posted by Satya Deep Maheshwari <co...@gmail.com>.
Attaching a (failing) unit test and a sample file indicating the problem
even though I am wrapping the stream in a TikaInputStream.

On Thu, Jun 18, 2015 at 8:55 PM, Jukka Zitting <ju...@gmail.com>
wrote:

> Hi,
>
> 2015-06-18 8:28 GMT-04:00 Satya Deep Maheshwari <co...@gmail.com>:
> > I think there is an assumption in Tika's MagicDetector that the stream
> would always support mark/reset.
>
> That assumption is based on the Detector API [1], which states that a
> stream passed to detect() should support the mark feature.
>
> As suggested by Nick, an easy way to meet that API contract is for the
> client to wrap a stream into TikaInputStream before passing it to the
> detector.
>
> [1]
> http://tika.apache.org/1.8/api/org/apache/tika/detect/Detector.html#detect(java.io.InputStream,%20org.apache.tika.metadata.Metadata)
>
> BR,
>
> Jukka Zitting
>

Re: MagicDetector does not enforce mark/reset support in inputstream

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

2015-06-18 8:28 GMT-04:00 Satya Deep Maheshwari <co...@gmail.com>:
> I think there is an assumption in Tika's MagicDetector that the stream would always support mark/reset.

That assumption is based on the Detector API [1], which states that a
stream passed to detect() should support the mark feature.

As suggested by Nick, an easy way to meet that API contract is for the
client to wrap a stream into TikaInputStream before passing it to the
detector.

[1] http://tika.apache.org/1.8/api/org/apache/tika/detect/Detector.html#detect(java.io.InputStream,%20org.apache.tika.metadata.Metadata)

BR,

Jukka Zitting

Re: MagicDetector does not enforce mark/reset support in inputstream

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 18 Jun 2015, Satya Deep Maheshwari wrote:
> Please see [1] which comes into play when detecting the mime-type from 
> content. I think there is an assumption in Tika's MagicDetector that the 
> stream would always support mark/reset. Probably it should check it 
> explicitly and not proceed if that's not the case.

Are you able to produce a small junit unit test that shows the problem?

Generally though, we'd suggest wrapping things in a TikaInputStream, which 
takes care of that sort of thing!

Nick