You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2015/01/29 07:24:51 UTC

Re: multiple detect call -> different results (tika 1.7)

Dear Gabriele,

Thanks for your question. It should be sent to dev@tika.apache.org
(moving dev-owner@tika.apache.org to BCC).

I’ll take a look tomorrow if someone else hasn’t answered yet.

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Gabriele Guidi <ga...@eng.it>
Date: Wednesday, January 28, 2015 at 5:25 AM
To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
Subject: multiple detect call -> different results (tika 1.7)

>
>
>Hi,
>
>
>I found a strange behavior. I have p7m file, then I extract file inside
>the signed one, after that I use tika to discover mime type, the first
>call it gives me "application/pdf" (that's correct). BUT every next call
>to the detect method of Tika to the
> same inputStream gives me "application/octet-stream". ...why?
>I cannot understand the behavior ...and find a solution.
>
>
>Just a snipped of code:
> 
>
>
>InputStream inputsbust = content.getContentStream();
>
>
>
>
>
>
>
>System.out.println(" 1 mime " + filepath + " : "
>+ tika.detect(inputsbust));
>System.out.println(" 2 mime " + filepath + " : "
>+ tika.detect(inputsbust));
>System.out.println(" 3 mime " + filepath + " : "
>+ tika.detect(inputsbust));
>
>
>
>Result:
>
> 1 mime /home/gguidi/01_file.pdf : application/pdf
> 2 mime /home/gguidi/01_file.pdf : application/octet-stream
> 3 mime /home/gguidi/01_file.pdf : application/octet-stream
>
>
>
>
>
>
>
>
>Thanks
>
>
>-- 
>
>
>Gabriele Guidi
>Direzione Pubblica Amministrazione
>gabriele.guidi@eng.it
>
>Engineering Ingegneria Informatica spa
>Via Marconi, 10 - 40122, Bologna
>Tel. +39-051.0435135
>www.eng.it <http://www.eng.it>
>
>
>Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
>Respect the environment. Please don't print this e-mail unless you really
>need to.
>Le informazioni trasmesse sono destinate esclusivamente alla persona o
>alla società in indirizzo e sono da intendersi confidenziali e riservate.
>Ogni trasmissione, inoltro, diffusione o altro uso
> di queste informazioni a persone o società differenti dal destinatario è
>proibita. Se ricevete questa comunicazione per errore, contattate il
>mittente e cancellate le informazioni da ogni computer.
>The information transmitted is intended only for the person or entity to
>which it is addressed and may contain confidential and/or privileged
>material. Any review, retransmission, dissemination or other use of, or
>taking of any action in reliance upon, this
> information by persons or entities other than the intended recipient is
>prohibited. If you received this in error, please contact the sender and
>delete the material from any computer.
>
>
>
>
>
>


Re: multiple detect call -> different results (tika 1.7)

Posted by Gabriele Guidi <ga...@eng.it>.
Thanks for your answer.
I had the same behaviour with tika 1.6 and 1.5.
I found a workaround, the problem seems to happen with only InputStream, so
now I use byte[] and it's OK.

Thanks again
Il 29/gen/2015 07:24 "Mattmann, Chris A (3980)" <
chris.a.mattmann@jpl.nasa.gov> ha scritto:

> Dear Gabriele,
>
> Thanks for your question. It should be sent to dev@tika.apache.org
> (moving dev-owner@tika.apache.org to BCC).
>
> I’ll take a look tomorrow if someone else hasn’t answered yet.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Gabriele Guidi <ga...@eng.it>
> Date: Wednesday, January 28, 2015 at 5:25 AM
> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
> Subject: multiple detect call -> different results (tika 1.7)
>
> >
> >
> >Hi,
> >
> >
> >I found a strange behavior. I have p7m file, then I extract file inside
> >the signed one, after that I use tika to discover mime type, the first
> >call it gives me "application/pdf" (that's correct). BUT every next call
> >to the detect method of Tika to the
> > same inputStream gives me "application/octet-stream". ...why?
> >I cannot understand the behavior ...and find a solution.
> >
> >
> >Just a snipped of code:
> >
> >
> >
> >InputStream inputsbust = content.getContentStream();
> >
> >
> >
> >
> >
> >
> >
> >System.out.println(" 1 mime " + filepath + " : "
> >+ tika.detect(inputsbust));
> >System.out.println(" 2 mime " + filepath + " : "
> >+ tika.detect(inputsbust));
> >System.out.println(" 3 mime " + filepath + " : "
> >+ tika.detect(inputsbust));
> >
> >
> >
> >Result:
> >
> > 1 mime /home/gguidi/01_file.pdf : application/pdf
> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream
> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream
> >
> >
> >
> >
> >
> >
> >
> >
> >Thanks
> >
> >
> >--
> >
> >
> >Gabriele Guidi
> >Direzione Pubblica Amministrazione
> >gabriele.guidi@eng.it
> >
> >Engineering Ingegneria Informatica spa
> >Via Marconi, 10 - 40122, Bologna
> >Tel. +39-051.0435135
> >www.eng.it <http://www.eng.it>
> >
> >
> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
> >Respect the environment. Please don't print this e-mail unless you really
> >need to.
> >Le informazioni trasmesse sono destinate esclusivamente alla persona o
> >alla società in indirizzo e sono da intendersi confidenziali e riservate.
> >Ogni trasmissione, inoltro, diffusione o altro uso
> > di queste informazioni a persone o società differenti dal destinatario è
> >proibita. Se ricevete questa comunicazione per errore, contattate il
> >mittente e cancellate le informazioni da ogni computer.
> >The information transmitted is intended only for the person or entity to
> >which it is addressed and may contain confidential and/or privileged
> >material. Any review, retransmission, dissemination or other use of, or
> >taking of any action in reliance upon, this
> > information by persons or entities other than the intended recipient is
> >prohibited. If you received this in error, please contact the sender and
> >delete the material from any computer.
> >
> >
> >
> >
> >
> >
>
>

Re: multiple detect call -> different results (tika 1.7)

Posted by Tyler Palsulich <tp...@gmail.com>.
Thanks Konstantin and Gabriele! Please feel free to email any other
questions or open an issue on the Tika JIRA.

Have a good day!
Tyler
On Jan 29, 2015 11:43 AM, "Gabriele Guidi" <ga...@eng.it> wrote:

> Ok, thank you for your support
>
> Best regards
>
> 2015-01-29 15:14 GMT+01:00 Konstantin Gribov <gr...@gmail.com>:
>
> > Hi, Gabriele.
> >
> > If you're using InputStream which doesn't support mark/reset tika facade
> > (org.apache.Tika) creates BufferedInputStream which consumes up to 8k of
> > original inputStream by default, so Tika mime type detector can't find
> pdf
> > magic after first call.
> >
> > Second case (with copying to byte[]) is similar. If you do this copy
> > before calling tika.detect, you consume that input stream and subsequent
> > calls on that stream return application/octet-stream as default
> mime-type.
> > But all works fine with bytes since you have full copy of original stream
> > in it.
> >
> > If you call tika.detect on input stream before copying it to bytes it
> > falls to first case, you'll copy inputstream without first 8k to it, so
> > drop pdf magic.
> >
> > You have to recreate input stream, copy it somewhere to temporary
> resource
> > (as with bytes or some temp file) or wrap it to BufferedInputStream
> before
> > passing it to tika.detect.
> >
> > --
> > Best regards,
> > Konstantin Gribov
> >
> > Thu Jan 29 2015 at 16:07:12, Gabriele Guidi <ga...@eng.it>:
> >
> > Hi
> >>
> >> No, I ask it with "*markSupported
> >> <
> http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#markSupported()
> >*
> >> ()" function and it says "NO".
> >> No recreation.
> >>
> >> The code test is very simple:
> >>
> >> InputStream inputsbust = content.getContentStream();
> >> System.out.println(" mark and reset inputStream ?
> >> "+(inputsbust.markSupported()?"YES":"NO"));
> >> System.out.println(" 1 mime : " + tika.detect(inputsbust));
> >> System.out.println(" 2 mime : " + tika.detect(inputsbust));
> >> byte[] bytes = IOUtils.toByteArray(inputsbust);
> >> System.out.println(" 3 mime : " + tika.detect(bytes));
> >> System.out.println(" 3.2 mime : " + tika.detect(bytes));
> >>
> >>
> >> The result:
> >>
> >> mark and reset of inputStream ? NO
> >>
> >>  1 mime : application/pdf
> >>  2 mime : application/octet-stream
> >>  3 mime : application/octet-stream
> >>  3.2 mime : application/octet-stream
> >>
> >>
> >> If i put the 5th line ("byte[] bytes =
> IOUtils.toByteArray(inputsbust);")
> >> as second line the result is:
> >>
> >> mark and reset of inputStream ? NO
> >>
> >>  1 mime : application/octet-stream
> >>  2 mime : application/octet-stream
> >>  3 mime : application/pdf
> >>  3.2 mime : application/pdf
> >>
> >>
> >> I hope it helps
> >>
> >> Thanks
> >>
> >>
> >> 2015-01-29 10:49 GMT+01:00 Konstantin Gribov <gr...@gmail.com>:
> >>
> >>> Hi,
> >>>
> >>> Does this InputStream support mark/reset fuctionality? Is InputStream
> >>> recreated before each subsequent call to tika.detect or it called on
> >>> partially consumed stream (in case mark isn't supported)?
> >>>
> >>> --
> >>> Best regards,
> >>> Konstantin Gribov
> >>>
> >>> Thu Jan 29 2015 at 9:25:28, Mattmann, Chris A (3980) <
> >>> chris.a.mattmann@jpl.nasa.gov>:
> >>>
> >>> Dear Gabriele,
> >>>>
> >>>> Thanks for your question. It should be sent to dev@tika.apache.org
> >>>> (moving dev-owner@tika.apache.org to BCC).
> >>>>
> >>>> I’ll take a look tomorrow if someone else hasn’t answered yet.
> >>>>
> >>>> Cheers,
> >>>> Chris
> >>>>
> >>>>
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Chris Mattmann, Ph.D.
> >>>> Chief Architect
> >>>> Instrument Software and Science Data Systems Section (398)
> >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>> Office: 168-519, Mailstop: 168-527
> >>>> Email: chris.a.mattmann@nasa.gov
> >>>> WWW:  http://sunset.usc.edu/~mattmann/
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> Adjunct Associate Professor, Computer Science Department
> >>>> University of Southern California, Los Angeles, CA 90089 USA
> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Gabriele Guidi <ga...@eng.it>
> >>>> Date: Wednesday, January 28, 2015 at 5:25 AM
> >>>> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
> >>>> Subject: multiple detect call -> different results (tika 1.7)
> >>>>
> >>>> >
> >>>> >
> >>>> >Hi,
> >>>> >
> >>>> >
> >>>> >I found a strange behavior. I have p7m file, then I extract file
> inside
> >>>> >the signed one, after that I use tika to discover mime type, the
> first
> >>>> >call it gives me "application/pdf" (that's correct). BUT every next
> >>>> call
> >>>> >to the detect method of Tika to the
> >>>> > same inputStream gives me "application/octet-stream". ...why?
> >>>> >I cannot understand the behavior ...and find a solution.
> >>>> >
> >>>> >
> >>>> >Just a snipped of code:
> >>>> >
> >>>> >
> >>>> >
> >>>> >InputStream inputsbust = content.getContentStream();
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >System.out.println(" 1 mime " + filepath + " : "
> >>>> >+ tika.detect(inputsbust));
> >>>> >System.out.println(" 2 mime " + filepath + " : "
> >>>> >+ tika.detect(inputsbust));
> >>>> >System.out.println(" 3 mime " + filepath + " : "
> >>>> >+ tika.detect(inputsbust));
> >>>> >
> >>>> >
> >>>> >
> >>>> >Result:
> >>>> >
> >>>> > 1 mime /home/gguidi/01_file.pdf : application/pdf
> >>>> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream
> >>>> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >Thanks
> >>>> >
> >>>> >
> >>>> >--
> >>>> >
> >>>> >
> >>>> >Gabriele Guidi
> >>>> >Direzione Pubblica Amministrazione
> >>>> >gabriele.guidi@eng.it
> >>>> >
> >>>> >Engineering Ingegneria Informatica spa
> >>>> >Via Marconi, 10 - 40122, Bologna
> >>>> >Tel. +39-051.0435135
> >>>> >www.eng.it <http://www.eng.it>
> >>>> >
> >>>> >
> >>>> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
> >>>> >Respect the environment. Please don't print this e-mail unless you
> >>>> really
> >>>> >need to.
> >>>> >Le informazioni trasmesse sono destinate esclusivamente alla persona
> o
> >>>> >alla società in indirizzo e sono da intendersi confidenziali e
> >>>> riservate.
> >>>> >Ogni trasmissione, inoltro, diffusione o altro uso
> >>>> > di queste informazioni a persone o società differenti dal
> >>>> destinatario è
> >>>> >proibita. Se ricevete questa comunicazione per errore, contattate il
> >>>> >mittente e cancellate le informazioni da ogni computer.
> >>>> >The information transmitted is intended only for the person or entity
> >>>> to
> >>>> >which it is addressed and may contain confidential and/or privileged
> >>>> >material. Any review, retransmission, dissemination or other use of,
> or
> >>>> >taking of any action in reliance upon, this
> >>>> > information by persons or entities other than the intended recipient
> >>>> is
> >>>> >prohibited. If you received this in error, please contact the sender
> >>>> and
> >>>> >delete the material from any computer.
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>>
> >>>>
> >>
> >>
> >> --
> >>
> >>
> >>
> >> * Gabriele Guidi*
> >>
> >>
> >>  Direzione Pubblica Amministrazione
> >> gabriele.guidi@eng.it
> >>
> >> *Engineering Ingegneria Informatica spa*
> >> Via Marconi, 10 - 40122, Bologna
> >>
> >>
> >> Tel. +39-051.0435135
> >>  www.eng.it
> >>
> >>  Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
> >> Respect the environment. Please don't print this e-mail unless you
> really
> >> need to.
> >>
> >> Le informazioni trasmesse sono destinate esclusivamente alla persona o
> >> alla società in indirizzo e sono da intendersi confidenziali e
> riservate.
> >> Ogni trasmissione, inoltro, diffusione o altro uso di queste
> informazioni a
> >> persone o società differenti dal destinatario è proibita. Se ricevete
> >> questa comunicazione per errore, contattate il mittente e cancellate le
> >> informazioni da ogni computer.
> >> The information transmitted is intended only for the person or entity to
> >> which it is addressed and may contain confidential and/or privileged
> >> material. Any review, retransmission, dissemination or other use of, or
> >> taking of any action in reliance upon, this information by persons or
> >> entities other than the intended recipient is prohibited. If you
> received
> >> this in error, please contact the sender and delete the material from
> any
> >> computer.
> >>
> >
>
>
> --
>
>
>
> * Gabriele Guidi*
>  Direzione Pubblica Amministrazione
> gabriele.guidi@eng.it
>
> *Engineering Ingegneria Informatica spa*
> Via Marconi, 10 - 40122, Bologna
> Tel. +39-051.0435135
>  www.eng.it
>
>  Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
> Respect the environment. Please don't print this e-mail unless you really
> need to.
>
> Le informazioni trasmesse sono destinate esclusivamente alla persona o alla
> società in indirizzo e sono da intendersi confidenziali e riservate. Ogni
> trasmissione, inoltro, diffusione o altro uso di queste informazioni a
> persone o società differenti dal destinatario è proibita. Se ricevete
> questa comunicazione per errore, contattate il mittente e cancellate le
> informazioni da ogni computer.
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>

Re: multiple detect call -> different results (tika 1.7)

Posted by Gabriele Guidi <ga...@eng.it>.
Ok, thank you for your support

Best regards

2015-01-29 15:14 GMT+01:00 Konstantin Gribov <gr...@gmail.com>:

> Hi, Gabriele.
>
> If you're using InputStream which doesn't support mark/reset tika facade
> (org.apache.Tika) creates BufferedInputStream which consumes up to 8k of
> original inputStream by default, so Tika mime type detector can't find pdf
> magic after first call.
>
> Second case (with copying to byte[]) is similar. If you do this copy
> before calling tika.detect, you consume that input stream and subsequent
> calls on that stream return application/octet-stream as default mime-type.
> But all works fine with bytes since you have full copy of original stream
> in it.
>
> If you call tika.detect on input stream before copying it to bytes it
> falls to first case, you'll copy inputstream without first 8k to it, so
> drop pdf magic.
>
> You have to recreate input stream, copy it somewhere to temporary resource
> (as with bytes or some temp file) or wrap it to BufferedInputStream before
> passing it to tika.detect.
>
> --
> Best regards,
> Konstantin Gribov
>
> Thu Jan 29 2015 at 16:07:12, Gabriele Guidi <ga...@eng.it>:
>
> Hi
>>
>> No, I ask it with "*markSupported
>> <http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#markSupported()>*
>> ()" function and it says "NO".
>> No recreation.
>>
>> The code test is very simple:
>>
>> InputStream inputsbust = content.getContentStream();
>> System.out.println(" mark and reset inputStream ?
>> "+(inputsbust.markSupported()?"YES":"NO"));
>> System.out.println(" 1 mime : " + tika.detect(inputsbust));
>> System.out.println(" 2 mime : " + tika.detect(inputsbust));
>> byte[] bytes = IOUtils.toByteArray(inputsbust);
>> System.out.println(" 3 mime : " + tika.detect(bytes));
>> System.out.println(" 3.2 mime : " + tika.detect(bytes));
>>
>>
>> The result:
>>
>> mark and reset of inputStream ? NO
>>
>>  1 mime : application/pdf
>>  2 mime : application/octet-stream
>>  3 mime : application/octet-stream
>>  3.2 mime : application/octet-stream
>>
>>
>> If i put the 5th line ("byte[] bytes = IOUtils.toByteArray(inputsbust);")
>> as second line the result is:
>>
>> mark and reset of inputStream ? NO
>>
>>  1 mime : application/octet-stream
>>  2 mime : application/octet-stream
>>  3 mime : application/pdf
>>  3.2 mime : application/pdf
>>
>>
>> I hope it helps
>>
>> Thanks
>>
>>
>> 2015-01-29 10:49 GMT+01:00 Konstantin Gribov <gr...@gmail.com>:
>>
>>> Hi,
>>>
>>> Does this InputStream support mark/reset fuctionality? Is InputStream
>>> recreated before each subsequent call to tika.detect or it called on
>>> partially consumed stream (in case mark isn't supported)?
>>>
>>> --
>>> Best regards,
>>> Konstantin Gribov
>>>
>>> Thu Jan 29 2015 at 9:25:28, Mattmann, Chris A (3980) <
>>> chris.a.mattmann@jpl.nasa.gov>:
>>>
>>> Dear Gabriele,
>>>>
>>>> Thanks for your question. It should be sent to dev@tika.apache.org
>>>> (moving dev-owner@tika.apache.org to BCC).
>>>>
>>>> I’ll take a look tomorrow if someone else hasn’t answered yet.
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Gabriele Guidi <ga...@eng.it>
>>>> Date: Wednesday, January 28, 2015 at 5:25 AM
>>>> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
>>>> Subject: multiple detect call -> different results (tika 1.7)
>>>>
>>>> >
>>>> >
>>>> >Hi,
>>>> >
>>>> >
>>>> >I found a strange behavior. I have p7m file, then I extract file inside
>>>> >the signed one, after that I use tika to discover mime type, the first
>>>> >call it gives me "application/pdf" (that's correct). BUT every next
>>>> call
>>>> >to the detect method of Tika to the
>>>> > same inputStream gives me "application/octet-stream". ...why?
>>>> >I cannot understand the behavior ...and find a solution.
>>>> >
>>>> >
>>>> >Just a snipped of code:
>>>> >
>>>> >
>>>> >
>>>> >InputStream inputsbust = content.getContentStream();
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >System.out.println(" 1 mime " + filepath + " : "
>>>> >+ tika.detect(inputsbust));
>>>> >System.out.println(" 2 mime " + filepath + " : "
>>>> >+ tika.detect(inputsbust));
>>>> >System.out.println(" 3 mime " + filepath + " : "
>>>> >+ tika.detect(inputsbust));
>>>> >
>>>> >
>>>> >
>>>> >Result:
>>>> >
>>>> > 1 mime /home/gguidi/01_file.pdf : application/pdf
>>>> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream
>>>> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >Thanks
>>>> >
>>>> >
>>>> >--
>>>> >
>>>> >
>>>> >Gabriele Guidi
>>>> >Direzione Pubblica Amministrazione
>>>> >gabriele.guidi@eng.it
>>>> >
>>>> >Engineering Ingegneria Informatica spa
>>>> >Via Marconi, 10 - 40122, Bologna
>>>> >Tel. +39-051.0435135
>>>> >www.eng.it <http://www.eng.it>
>>>> >
>>>> >
>>>> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
>>>> >Respect the environment. Please don't print this e-mail unless you
>>>> really
>>>> >need to.
>>>> >Le informazioni trasmesse sono destinate esclusivamente alla persona o
>>>> >alla società in indirizzo e sono da intendersi confidenziali e
>>>> riservate.
>>>> >Ogni trasmissione, inoltro, diffusione o altro uso
>>>> > di queste informazioni a persone o società differenti dal
>>>> destinatario è
>>>> >proibita. Se ricevete questa comunicazione per errore, contattate il
>>>> >mittente e cancellate le informazioni da ogni computer.
>>>> >The information transmitted is intended only for the person or entity
>>>> to
>>>> >which it is addressed and may contain confidential and/or privileged
>>>> >material. Any review, retransmission, dissemination or other use of, or
>>>> >taking of any action in reliance upon, this
>>>> > information by persons or entities other than the intended recipient
>>>> is
>>>> >prohibited. If you received this in error, please contact the sender
>>>> and
>>>> >delete the material from any computer.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>
>>
>> --
>>
>>
>>
>> * Gabriele Guidi*
>>
>>
>>  Direzione Pubblica Amministrazione
>> gabriele.guidi@eng.it
>>
>> *Engineering Ingegneria Informatica spa*
>> Via Marconi, 10 - 40122, Bologna
>>
>>
>> Tel. +39-051.0435135
>>  www.eng.it
>>
>>  Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
>> Respect the environment. Please don't print this e-mail unless you really
>> need to.
>>
>> Le informazioni trasmesse sono destinate esclusivamente alla persona o
>> alla società in indirizzo e sono da intendersi confidenziali e riservate.
>> Ogni trasmissione, inoltro, diffusione o altro uso di queste informazioni a
>> persone o società differenti dal destinatario è proibita. Se ricevete
>> questa comunicazione per errore, contattate il mittente e cancellate le
>> informazioni da ogni computer.
>> The information transmitted is intended only for the person or entity to
>> which it is addressed and may contain confidential and/or privileged
>> material. Any review, retransmission, dissemination or other use of, or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipient is prohibited. If you received
>> this in error, please contact the sender and delete the material from any
>> computer.
>>
>


-- 



* Gabriele Guidi*
 Direzione Pubblica Amministrazione
gabriele.guidi@eng.it

*Engineering Ingegneria Informatica spa*
Via Marconi, 10 - 40122, Bologna
Tel. +39-051.0435135
 www.eng.it

 Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
Respect the environment. Please don't print this e-mail unless you really
need to.

Le informazioni trasmesse sono destinate esclusivamente alla persona o alla
società in indirizzo e sono da intendersi confidenziali e riservate. Ogni
trasmissione, inoltro, diffusione o altro uso di queste informazioni a
persone o società differenti dal destinatario è proibita. Se ricevete
questa comunicazione per errore, contattate il mittente e cancellate le
informazioni da ogni computer.
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

Re: multiple detect call -> different results (tika 1.7)

Posted by Konstantin Gribov <gr...@gmail.com>.
Hi, Gabriele.

If you're using InputStream which doesn't support mark/reset tika facade
(org.apache.Tika) creates BufferedInputStream which consumes up to 8k of
original inputStream by default, so Tika mime type detector can't find pdf
magic after first call.

Second case (with copying to byte[]) is similar. If you do this copy before
calling tika.detect, you consume that input stream and subsequent calls on
that stream return application/octet-stream as default mime-type. But all
works fine with bytes since you have full copy of original stream in it.

If you call tika.detect on input stream before copying it to bytes it falls
to first case, you'll copy inputstream without first 8k to it, so drop pdf
magic.

You have to recreate input stream, copy it somewhere to temporary resource
(as with bytes or some temp file) or wrap it to BufferedInputStream before
passing it to tika.detect.

-- 
Best regards,
Konstantin Gribov

Thu Jan 29 2015 at 16:07:12, Gabriele Guidi <ga...@eng.it>:

> Hi
>
> No, I ask it with "*markSupported
> <http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#markSupported()>*
> ()" function and it says "NO".
> No recreation.
>
> The code test is very simple:
>
> InputStream inputsbust = content.getContentStream();
> System.out.println(" mark and reset inputStream ?
> "+(inputsbust.markSupported()?"YES":"NO"));
> System.out.println(" 1 mime : " + tika.detect(inputsbust));
> System.out.println(" 2 mime : " + tika.detect(inputsbust));
> byte[] bytes = IOUtils.toByteArray(inputsbust);
> System.out.println(" 3 mime : " + tika.detect(bytes));
> System.out.println(" 3.2 mime : " + tika.detect(bytes));
>
>
> The result:
>
> mark and reset of inputStream ? NO
>
>  1 mime : application/pdf
>  2 mime : application/octet-stream
>  3 mime : application/octet-stream
>  3.2 mime : application/octet-stream
>
>
> If i put the 5th line ("byte[] bytes = IOUtils.toByteArray(inputsbust);")
> as second line the result is:
>
> mark and reset of inputStream ? NO
>
>  1 mime : application/octet-stream
>  2 mime : application/octet-stream
>  3 mime : application/pdf
>  3.2 mime : application/pdf
>
>
> I hope it helps
>
> Thanks
>
>
> 2015-01-29 10:49 GMT+01:00 Konstantin Gribov <gr...@gmail.com>:
>
>> Hi,
>>
>> Does this InputStream support mark/reset fuctionality? Is InputStream
>> recreated before each subsequent call to tika.detect or it called on
>> partially consumed stream (in case mark isn't supported)?
>>
>> --
>> Best regards,
>> Konstantin Gribov
>>
>> Thu Jan 29 2015 at 9:25:28, Mattmann, Chris A (3980) <
>> chris.a.mattmann@jpl.nasa.gov>:
>>
>> Dear Gabriele,
>>>
>>> Thanks for your question. It should be sent to dev@tika.apache.org
>>> (moving dev-owner@tika.apache.org to BCC).
>>>
>>> I’ll take a look tomorrow if someone else hasn’t answered yet.
>>>
>>> Cheers,
>>> Chris
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Gabriele Guidi <ga...@eng.it>
>>> Date: Wednesday, January 28, 2015 at 5:25 AM
>>> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
>>> Subject: multiple detect call -> different results (tika 1.7)
>>>
>>> >
>>> >
>>> >Hi,
>>> >
>>> >
>>> >I found a strange behavior. I have p7m file, then I extract file inside
>>> >the signed one, after that I use tika to discover mime type, the first
>>> >call it gives me "application/pdf" (that's correct). BUT every next call
>>> >to the detect method of Tika to the
>>> > same inputStream gives me "application/octet-stream". ...why?
>>> >I cannot understand the behavior ...and find a solution.
>>> >
>>> >
>>> >Just a snipped of code:
>>> >
>>> >
>>> >
>>> >InputStream inputsbust = content.getContentStream();
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >System.out.println(" 1 mime " + filepath + " : "
>>> >+ tika.detect(inputsbust));
>>> >System.out.println(" 2 mime " + filepath + " : "
>>> >+ tika.detect(inputsbust));
>>> >System.out.println(" 3 mime " + filepath + " : "
>>> >+ tika.detect(inputsbust));
>>> >
>>> >
>>> >
>>> >Result:
>>> >
>>> > 1 mime /home/gguidi/01_file.pdf : application/pdf
>>> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream
>>> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >Thanks
>>> >
>>> >
>>> >--
>>> >
>>> >
>>> >Gabriele Guidi
>>> >Direzione Pubblica Amministrazione
>>> >gabriele.guidi@eng.it
>>> >
>>> >Engineering Ingegneria Informatica spa
>>> >Via Marconi, 10 - 40122, Bologna
>>> >Tel. +39-051.0435135
>>> >www.eng.it <http://www.eng.it>
>>> >
>>> >
>>> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
>>> >Respect the environment. Please don't print this e-mail unless you
>>> really
>>> >need to.
>>> >Le informazioni trasmesse sono destinate esclusivamente alla persona o
>>> >alla società in indirizzo e sono da intendersi confidenziali e
>>> riservate.
>>> >Ogni trasmissione, inoltro, diffusione o altro uso
>>> > di queste informazioni a persone o società differenti dal destinatario
>>> è
>>> >proibita. Se ricevete questa comunicazione per errore, contattate il
>>> >mittente e cancellate le informazioni da ogni computer.
>>> >The information transmitted is intended only for the person or entity to
>>> >which it is addressed and may contain confidential and/or privileged
>>> >material. Any review, retransmission, dissemination or other use of, or
>>> >taking of any action in reliance upon, this
>>> > information by persons or entities other than the intended recipient is
>>> >prohibited. If you received this in error, please contact the sender and
>>> >delete the material from any computer.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>
>
> --
>
>
>
> * Gabriele Guidi*
>
>
>  Direzione Pubblica Amministrazione
> gabriele.guidi@eng.it
>
> *Engineering Ingegneria Informatica spa*
> Via Marconi, 10 - 40122, Bologna
>
>
> Tel. +39-051.0435135
>  www.eng.it
>
>  Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
> Respect the environment. Please don't print this e-mail unless you really
> need to.
>
> Le informazioni trasmesse sono destinate esclusivamente alla persona o
> alla società in indirizzo e sono da intendersi confidenziali e riservate.
> Ogni trasmissione, inoltro, diffusione o altro uso di queste informazioni a
> persone o società differenti dal destinatario è proibita. Se ricevete
> questa comunicazione per errore, contattate il mittente e cancellate le
> informazioni da ogni computer.
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>

Re: multiple detect call -> different results (tika 1.7)

Posted by Gabriele Guidi <ga...@eng.it>.
Hi

No, I ask it with "*markSupported
<http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#markSupported()>*
()" function and it says "NO".
No recreation.

The code test is very simple:

InputStream inputsbust = content.getContentStream();
System.out.println(" mark and reset inputStream ?
"+(inputsbust.markSupported()?"YES":"NO"));
System.out.println(" 1 mime : " + tika.detect(inputsbust));
System.out.println(" 2 mime : " + tika.detect(inputsbust));
byte[] bytes = IOUtils.toByteArray(inputsbust);
System.out.println(" 3 mime : " + tika.detect(bytes));
System.out.println(" 3.2 mime : " + tika.detect(bytes));


The result:

mark and reset of inputStream ? NO

 1 mime : application/pdf
 2 mime : application/octet-stream
 3 mime : application/octet-stream
 3.2 mime : application/octet-stream


If i put the 5th line ("byte[] bytes = IOUtils.toByteArray(inputsbust);")
as second line the result is:

mark and reset of inputStream ? NO

 1 mime : application/octet-stream
 2 mime : application/octet-stream
 3 mime : application/pdf
 3.2 mime : application/pdf


I hope it helps

Thanks


2015-01-29 10:49 GMT+01:00 Konstantin Gribov <gr...@gmail.com>:

> Hi,
>
> Does this InputStream support mark/reset fuctionality? Is InputStream
> recreated before each subsequent call to tika.detect or it called on
> partially consumed stream (in case mark isn't supported)?
>
> --
> Best regards,
> Konstantin Gribov
>
> Thu Jan 29 2015 at 9:25:28, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov>:
>
> Dear Gabriele,
>>
>> Thanks for your question. It should be sent to dev@tika.apache.org
>> (moving dev-owner@tika.apache.org to BCC).
>>
>> I’ll take a look tomorrow if someone else hasn’t answered yet.
>>
>> Cheers,
>> Chris
>>
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Gabriele Guidi <ga...@eng.it>
>> Date: Wednesday, January 28, 2015 at 5:25 AM
>> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
>> Subject: multiple detect call -> different results (tika 1.7)
>>
>> >
>> >
>> >Hi,
>> >
>> >
>> >I found a strange behavior. I have p7m file, then I extract file inside
>> >the signed one, after that I use tika to discover mime type, the first
>> >call it gives me "application/pdf" (that's correct). BUT every next call
>> >to the detect method of Tika to the
>> > same inputStream gives me "application/octet-stream". ...why?
>> >I cannot understand the behavior ...and find a solution.
>> >
>> >
>> >Just a snipped of code:
>> >
>> >
>> >
>> >InputStream inputsbust = content.getContentStream();
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >System.out.println(" 1 mime " + filepath + " : "
>> >+ tika.detect(inputsbust));
>> >System.out.println(" 2 mime " + filepath + " : "
>> >+ tika.detect(inputsbust));
>> >System.out.println(" 3 mime " + filepath + " : "
>> >+ tika.detect(inputsbust));
>> >
>> >
>> >
>> >Result:
>> >
>> > 1 mime /home/gguidi/01_file.pdf : application/pdf
>> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream
>> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >Thanks
>> >
>> >
>> >--
>> >
>> >
>> >Gabriele Guidi
>> >Direzione Pubblica Amministrazione
>> >gabriele.guidi@eng.it
>> >
>> >Engineering Ingegneria Informatica spa
>> >Via Marconi, 10 - 40122, Bologna
>> >Tel. +39-051.0435135
>> >www.eng.it <http://www.eng.it>
>> >
>> >
>> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
>> >Respect the environment. Please don't print this e-mail unless you really
>> >need to.
>> >Le informazioni trasmesse sono destinate esclusivamente alla persona o
>> >alla società in indirizzo e sono da intendersi confidenziali e riservate.
>> >Ogni trasmissione, inoltro, diffusione o altro uso
>> > di queste informazioni a persone o società differenti dal destinatario è
>> >proibita. Se ricevete questa comunicazione per errore, contattate il
>> >mittente e cancellate le informazioni da ogni computer.
>> >The information transmitted is intended only for the person or entity to
>> >which it is addressed and may contain confidential and/or privileged
>> >material. Any review, retransmission, dissemination or other use of, or
>> >taking of any action in reliance upon, this
>> > information by persons or entities other than the intended recipient is
>> >prohibited. If you received this in error, please contact the sender and
>> >delete the material from any computer.
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>


-- 



* Gabriele Guidi*
 Direzione Pubblica Amministrazione
gabriele.guidi@eng.it

*Engineering Ingegneria Informatica spa*
Via Marconi, 10 - 40122, Bologna
Tel. +39-051.0435135
 www.eng.it

 Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
Respect the environment. Please don't print this e-mail unless you really
need to.

Le informazioni trasmesse sono destinate esclusivamente alla persona o alla
società in indirizzo e sono da intendersi confidenziali e riservate. Ogni
trasmissione, inoltro, diffusione o altro uso di queste informazioni a
persone o società differenti dal destinatario è proibita. Se ricevete
questa comunicazione per errore, contattate il mittente e cancellate le
informazioni da ogni computer.
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

Re: multiple detect call -> different results (tika 1.7)

Posted by Konstantin Gribov <gr...@gmail.com>.
Hi,

Does this InputStream support mark/reset fuctionality? Is InputStream
recreated before each subsequent call to tika.detect or it called on
partially consumed stream (in case mark isn't supported)?

-- 
Best regards,
Konstantin Gribov

Thu Jan 29 2015 at 9:25:28, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov>:

> Dear Gabriele,
>
> Thanks for your question. It should be sent to dev@tika.apache.org
> (moving dev-owner@tika.apache.org to BCC).
>
> I’ll take a look tomorrow if someone else hasn’t answered yet.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Gabriele Guidi <ga...@eng.it>
> Date: Wednesday, January 28, 2015 at 5:25 AM
> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
> Subject: multiple detect call -> different results (tika 1.7)
>
> >
> >
> >Hi,
> >
> >
> >I found a strange behavior. I have p7m file, then I extract file inside
> >the signed one, after that I use tika to discover mime type, the first
> >call it gives me "application/pdf" (that's correct). BUT every next call
> >to the detect method of Tika to the
> > same inputStream gives me "application/octet-stream". ...why?
> >I cannot understand the behavior ...and find a solution.
> >
> >
> >Just a snipped of code:
> >
> >
> >
> >InputStream inputsbust = content.getContentStream();
> >
> >
> >
> >
> >
> >
> >
> >System.out.println(" 1 mime " + filepath + " : "
> >+ tika.detect(inputsbust));
> >System.out.println(" 2 mime " + filepath + " : "
> >+ tika.detect(inputsbust));
> >System.out.println(" 3 mime " + filepath + " : "
> >+ tika.detect(inputsbust));
> >
> >
> >
> >Result:
> >
> > 1 mime /home/gguidi/01_file.pdf : application/pdf
> > 2 mime /home/gguidi/01_file.pdf : application/octet-stream
> > 3 mime /home/gguidi/01_file.pdf : application/octet-stream
> >
> >
> >
> >
> >
> >
> >
> >
> >Thanks
> >
> >
> >--
> >
> >
> >Gabriele Guidi
> >Direzione Pubblica Amministrazione
> >gabriele.guidi@eng.it
> >
> >Engineering Ingegneria Informatica spa
> >Via Marconi, 10 - 40122, Bologna
> >Tel. +39-051.0435135
> >www.eng.it <http://www.eng.it>
> >
> >
> >Rispetta l'ambiente. Non stampare questa e-mail se non necessario.
> >Respect the environment. Please don't print this e-mail unless you really
> >need to.
> >Le informazioni trasmesse sono destinate esclusivamente alla persona o
> >alla società in indirizzo e sono da intendersi confidenziali e riservate.
> >Ogni trasmissione, inoltro, diffusione o altro uso
> > di queste informazioni a persone o società differenti dal destinatario è
> >proibita. Se ricevete questa comunicazione per errore, contattate il
> >mittente e cancellate le informazioni da ogni computer.
> >The information transmitted is intended only for the person or entity to
> >which it is addressed and may contain confidential and/or privileged
> >material. Any review, retransmission, dissemination or other use of, or
> >taking of any action in reliance upon, this
> > information by persons or entities other than the intended recipient is
> >prohibited. If you received this in error, please contact the sender and
> >delete the material from any computer.
> >
> >
> >
> >
> >
> >
>
>