You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2016/07/04 15:45:41 UTC

Re: TIKA-1164

Hi Samuel I am forwarding your email to dev@tika.a.o and moving
dev-owner@t.a.o to BCC.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 7/4/16, 8:41 AM, "scatherine.ext@gouv.mc" <sc...@gouv.mc> wrote:

>Hi,
>
>I use Tika to detect MediaType and i have the same problem than the JIRA TIKA-1164
>https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>But I use the version 1.13. How can I solve this problem, please ?
>
>MediaType mediaType=null;
>        Metadata md =
>new Metadata();
>        md.set(Metadata.RESOURCE_NAME_KEY,
>fileName);
>        Detector detector = TikaConfig.getDefaultConfig().getDetector();
>
>        try {
>            mediaType =
>detector.detect(TikaInputStream.get(content),
>md);
>
>        } catch (IOException
>e) {
>           
>            mediaType =
>null;
>        }
>
>The contentsize (content.available()) change between before and after the detect call.
>
>Regards,
>
>Samuel Catherine
>
>

RE: TIKA-1164

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Right.  Use Path instead of File.

From: scatherine.ext@gouv.mc [mailto:scatherine.ext@gouv.mc]
Sent: Monday, July 11, 2016 3:42 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dev@tika.apache.org
Subject: RE: TIKA-1164


Hi Timothy,

Thanks

When I use directly TikaInputStream.get(), it's fine but this method is deprecated in Tika 1.13 and it seems remove in Tika 2.0.

Regards

Samuel Catherine
Intervenant pour le compte de la Direction Informatique
scatherine.ext@gouv.mc<ma...@gouv.mc>
+377 98 98 48 93


[Inactive hide details for "Allison, Timothy B." ---08/07/2016 21:26:26---Y, this makes sense.         Detector detector = TikaC]"Allison, Timothy B." ---08/07/2016 21:26:26---Y, this makes sense.         Detector detector = TikaConfig.getDefaultConfig().getDetector();

De : "Allison, Timothy B." <ta...@mitre.org>>
A : "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>, "scatherine.ext@gouv.mc<ma...@gouv.mc>" <sc...@gouv.mc>>
Date : 08/07/2016 21:26
Objet : RE: TIKA-1164

________________________________



Y, this makes sense.

       Detector detector = TikaConfig.getDefaultConfig().getDetector();
       File file = new File("testPDFVarious.pdf");
       try (FileInputStream is = new FileInputStream(file)) {
           try (InputStream tis = TikaInputStream.get(is)) {
               System.out.println("length: " + file.length());
               System.out.println("avail before: " + tis.available());
               System.out.println("DETECTED: " + detector.detect(tis, new Metadata()));
               System.out.println("avail after tis: " + tis.available());
               System.out.println("avail after is: " + is.available());
           }
       }

length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
avail after is: 139955

The original input stream is not buffered, and so there is no way to reset it, so y, the detector has to read quite a few bytes to do detection.

Note, though, that the TikaInputStream or even a BufferedInputStream will be correctly reset and will have all bytes available.

Btw, it is better to call TikaInputStream.get() directly on the file.  If a parser needs to copy the original inputstream to a temp file, it can avoid that copy, if you've created your TikaInputSTream directly from the file.

TikaInputStream tis = TikaInputStream.get(file)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: Friday, July 8, 2016 10:14 AM
To: scatherine.ext@gouv.mc<ma...@gouv.mc>; dev@tika.apache.org<ma...@tika.apache.org>
Subject: Re: TIKA-1164

Hi Samuel,

I myself haven’t had a chance to look into this yet - maybe someone else on the dev list?

Cheers,
Chris




On 7/8/16, 5:33 AM, "scatherine.ext@gouv.mc<ma...@gouv.mc>" <sc...@gouv.mc>> wrote:

>Hi,
>
>Excuse me to this mail but have you seen my problem ?
>
>Regards,
>
>Samuel Catherine
>
>
>
>Samuel
> CATHERINE---05/07/2016 10:31:31---Hi Chris, Ok thanks for the forward.
>
>De : Samuel CATHERINE/Monaco-Gouvernement/MC A : "Mattmann, Chris A
>(3980)" <ch...@jpl.nasa.gov>>@MCGOUV
>Cc : "dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>> Date : 05/07/2016
>10:31 Objet : Re: TIKA-1164
>
>________________________________________
>
>
>Hi Chris,
>
>Ok thanks for the forward.
>To help you, when I work only with InputStream (like Rest Service), I haven't got the problem.
>The case become when i used a File converted in FileInputStream.
>
>FileInputStream content=new FileInputStream(file);
>
>content.avalailable()
>//is ok after definition but is ko after the
>detector.detect(TikaInputStream.get(content),md)
>
>Regards,
>
>Samuel Catherine
>
>
>
>
>"Mattmann,
> Chris A (3980)" ---04/07/2016 17:45:47---Hi Samuel I am forwarding your email to dev@tika.a.o<ma...@tika.a.o> and moving dev-owner@t.a.o<ma...@t.a.o> to BCC.
>
>De : "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>> A :
>"scatherine.ext@gouv.mc<ma...@gouv.mc>" <sc...@gouv.mc>> Cc :
>"dev@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>> Date : 04/07/2016 17:45
>Objet : Re: TIKA-1164 ________________________________________
>
>
>
>Hi Samuel I am forwarding your email to dev@tika.a.o<ma...@tika.a.o> and moving
>dev-owner@t.a.o<ma...@t.a.o> to BCC.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Director, Information Retrieval and Data Science Group (IRDS) Adjunct
>Associate Professor, Computer Science Department University of Southern
>California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
>On 7/4/16, 8:41 AM, "scatherine.ext@gouv.mc<ma...@gouv.mc>" <sc...@gouv.mc>> wrote:
>
>>Hi,
>>
>>I use Tika to detect MediaType and i have the same problem than the
>>JIRA TIKA-1164
>>https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jir
>>a.plugin.system.issuetabpanels:all-tabpanel
>>But I use the version 1.13. How can I solve this problem, please ?
>>
>>MediaType mediaType=null;
>>        Metadata md =
>>new Metadata();
>>        md.set(Metadata.RESOURCE_NAME_KEY,
>>fileName);
>>        Detector detector =
>>TikaConfig.getDefaultConfig().getDetector();
>>
>>        try {
>>            mediaType =
>>detector.detect(TikaInputStream.get(content),
>>md);
>>
>>        } catch (IOException
>>e) {
>>
>>            mediaType =
>>null;
>>        }
>>
>>The contentsize (content.available()) change between before and after the detect call.
>>
>>Regards,
>>
>>Samuel Catherine
>>
>>
>


RE: TIKA-1164

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y, this makes sense.

        Detector detector = TikaConfig.getDefaultConfig().getDetector();
        File file = new File("testPDFVarious.pdf");
        try (FileInputStream is = new FileInputStream(file)) {
            try (InputStream tis = TikaInputStream.get(is)) {
                System.out.println("length: " + file.length());
                System.out.println("avail before: " + tis.available());
                System.out.println("DETECTED: " + detector.detect(tis, new Metadata()));
                System.out.println("avail after tis: " + tis.available());
                System.out.println("avail after is: " + is.available());
            }
        }

length: 205491
avail before: 205491
DETECTED: application/pdf
avail after tis: 205491
avail after is: 139955

The original input stream is not buffered, and so there is no way to reset it, so y, the detector has to read quite a few bytes to do detection.

Note, though, that the TikaInputStream or even a BufferedInputStream will be correctly reset and will have all bytes available.

Btw, it is better to call TikaInputStream.get() directly on the file.  If a parser needs to copy the original inputstream to a temp file, it can avoid that copy, if you've created your TikaInputSTream directly from the file.

TikaInputStream tis = TikaInputStream.get(file)

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: Friday, July 8, 2016 10:14 AM
To: scatherine.ext@gouv.mc; dev@tika.apache.org
Subject: Re: TIKA-1164

Hi Samuel,

I myself haven’t had a chance to look into this yet - maybe someone else on the dev list?

Cheers,
Chris




On 7/8/16, 5:33 AM, "scatherine.ext@gouv.mc" <sc...@gouv.mc> wrote:

>Hi,
>
>Excuse me to this mail but have you seen my problem ?
>
>Regards,
>
>Samuel Catherine
>
>
>
>Samuel
> CATHERINE---05/07/2016 10:31:31---Hi Chris, Ok thanks for the forward.
>
>De : Samuel CATHERINE/Monaco-Gouvernement/MC A : "Mattmann, Chris A 
>(3980)" <ch...@jpl.nasa.gov>@MCGOUV
>Cc : "dev@tika.apache.org" <de...@tika.apache.org> Date : 05/07/2016 
>10:31 Objet : Re: TIKA-1164
>
>________________________________________
>
>
>Hi Chris,
>
>Ok thanks for the forward.
>To help you, when I work only with InputStream (like Rest Service), I haven't got the problem.
>The case become when i used a File converted in FileInputStream.
>
>FileInputStream content=new FileInputStream(file);
>
>content.avalailable()
>//is ok after definition but is ko after the 
>detector.detect(TikaInputStream.get(content),md)
>
>Regards,
>
>Samuel Catherine
>
>
>
>
>"Mattmann,
> Chris A (3980)" ---04/07/2016 17:45:47---Hi Samuel I am forwarding your email to dev@tika.a.o and moving dev-owner@t.a.o to BCC.
>
>De : "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> A : 
>"scatherine.ext@gouv.mc" <sc...@gouv.mc> Cc : 
>"dev@tika.apache.org" <de...@tika.apache.org> Date : 04/07/2016 17:45 
>Objet : Re: TIKA-1164 ________________________________________
>
>
>
>Hi Samuel I am forwarding your email to dev@tika.a.o and moving 
>dev-owner@t.a.o to BCC.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398) NASA Jet 
>Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
>Associate Professor, Computer Science Department University of Southern 
>California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
>On 7/4/16, 8:41 AM, "scatherine.ext@gouv.mc" <sc...@gouv.mc> wrote:
>
>>Hi,
>>
>>I use Tika to detect MediaType and i have the same problem than the 
>>JIRA TIKA-1164 
>>https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jir
>>a.plugin.system.issuetabpanels:all-tabpanel
>>But I use the version 1.13. How can I solve this problem, please ?
>>
>>MediaType mediaType=null;
>>        Metadata md =
>>new Metadata();
>>        md.set(Metadata.RESOURCE_NAME_KEY,
>>fileName);
>>        Detector detector = 
>>TikaConfig.getDefaultConfig().getDetector();
>>
>>        try {
>>            mediaType =
>>detector.detect(TikaInputStream.get(content),
>>md);
>>
>>        } catch (IOException
>>e) {
>>           
>>            mediaType =
>>null;
>>        }
>>
>>The contentsize (content.available()) change between before and after the detect call.
>>
>>Regards,
>>
>>Samuel Catherine
>>
>>
>


Re: TIKA-1164

Posted by Chris Mattmann <ma...@apache.org>.
Hi Samuel,

I myself haven’t had a chance to look into this yet - maybe someone else
on the dev list?

Cheers,
Chris




On 7/8/16, 5:33 AM, "scatherine.ext@gouv.mc" <sc...@gouv.mc> wrote:

>Hi,
>
>Excuse me to this mail but have you seen my problem ?
>
>Regards,
>
>Samuel Catherine
>
>
>
>Samuel
> CATHERINE---05/07/2016 10:31:31---Hi Chris, Ok thanks for the forward.
>
>De : Samuel CATHERINE/Monaco-Gouvernement/MC
>A : "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>@MCGOUV
>Cc : "dev@tika.apache.org" <de...@tika.apache.org>
>Date : 05/07/2016 10:31
>Objet : Re: TIKA-1164
>
>________________________________________
>
>
>Hi Chris,
>
>Ok thanks for the forward.
>To help you, when I work only with InputStream (like Rest Service), I haven't got the problem.
>The case become when i used a File converted in FileInputStream.
>
>FileInputStream content=new FileInputStream(file);
>
>content.avalailable() 
>//is ok after definition but is ko after the detector.detect(TikaInputStream.get(content),md)
>
>Regards,
>
>Samuel Catherine
>
>
>
>
>"Mattmann,
> Chris A (3980)" ---04/07/2016 17:45:47---Hi Samuel I am forwarding your email to dev@tika.a.o and moving dev-owner@t.a.o to BCC.
>
>De : "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>
>A : "scatherine.ext@gouv.mc" <sc...@gouv.mc>
>Cc : "dev@tika.apache.org" <de...@tika.apache.org>
>Date : 04/07/2016 17:45
>Objet : Re: TIKA-1164
>________________________________________
>
>
>
>Hi Samuel I am forwarding your email to dev@tika.a.o and moving
>dev-owner@t.a.o to BCC.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
>On 7/4/16, 8:41 AM, "scatherine.ext@gouv.mc" <sc...@gouv.mc> wrote:
>
>>Hi,
>>
>>I use Tika to detect MediaType and i have the same problem than the JIRA TIKA-1164
>>https://issues.apache.org/jira/browse/TIKA-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>>But I use the version 1.13. How can I solve this problem, please ?
>>
>>MediaType mediaType=null;
>>        Metadata md =
>>new Metadata();
>>        md.set(Metadata.RESOURCE_NAME_KEY,
>>fileName);
>>        Detector detector = TikaConfig.getDefaultConfig().getDetector();
>>
>>        try {
>>            mediaType =
>>detector.detect(TikaInputStream.get(content),
>>md);
>>
>>        } catch (IOException
>>e) {
>>           
>>            mediaType =
>>null;
>>        }
>>
>>The contentsize (content.available()) change between before and after the detect call.
>>
>>Regards,
>>
>>Samuel Catherine
>>
>>
>