You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2014/06/12 05:27:20 UTC

Re: Question re installing Tika

Hi Richard,

Hope you are well, will try and answer below:


-----Original Message-----

From: Richard <rg...@hotmail.com>
Date: Friday, June 6, 2014 6:07 AM
To: "user@tika.apache.org" <us...@tika.apache.org>,
"dev-owner@tika.apache.org" <de...@tika.apache.org>
Subject: Question re installing Tika

>Hello
> 
>I am new to the Apache suite of products and dealing with text in pdfs,
>more generally. In particular I am trying to install Tika (the
>tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
>
> 
>However I am confused about how to do the Tika installation.
>
> 
>From reading various webpages (eg
>http://tika.apache.org/1.5/gettingstarted.html
><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
> 
>1)     
>Download the .jar from
>http://tika.apache.org/download.html
><http://tika.apache.org/download.html> (do I need to put it in a specific
>windows folder?)

Nope you don't have to put in any specific folder, wherever you are
comfortable calling the jar from.

>2)     
>Download Maven 2 (from http://maven.apache.org/ ) and follow up the
>instructions for Windows on
>http://maven.apache.org/download.cgi#Installation

No need to do this unless you are building from scratch.

>3)     
>Also where do I set the base directory?

You just need to install Apache Tika and its *-app.jar file into some
folder, and then
call it by doing java -jar /path/to/tika-*version*-app.jar --help

> 
>4)     
>Where do I run the command ³mvn install² from? Is it the command line?

If you are building from source, then you would run this at the top level
directory containing
files like pom.xml, tika-parent, tika-parsers, etc.

>
>
>Any help would be most gratefully received.

Cheers!

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




>


Re: Question re installing Tika

Posted by Tyler Palsulich <tp...@gmail.com>.
Hi Richard,

I forgot to mention, if you're not going to be contributing to Tika, you
don't need to install directly from source. You can just add the following
entry to your pom.xml file:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.5</version>
 </dependency>

Tyler


On Fri, Jun 27, 2014 at 2:44 AM, Tyler Palsulich <tp...@gmail.com>
wrote:

> Hi Richard,
>
> The code below is derived from Chris' search engine class at USC (
> http://www-scf.usc.edu/~csci572/). Hopefully it will point you in the
> right direction.
>
>         // Open all pdf files, process each one
>         File pdfdir = new File("./some/pdf/directory");
>         File[] pdfs = pdfdir.listFiles();
>         for (File pdf:pdfs) {
>             if (pdf.isFile()) processfile(pdf);
>         }
> //Your process method would look something like
>     private void processfile(File f) {
>         PDFParser parser = new PDFParser();
>         Metadata metadata = new Metadata();
>         FileInputStream fis = new FileInputStream(f);
>         try {
>             FileWriter writer = new FileWriter(f.getName() +
> ".content.txt");
>             parser.parse(fis,
>                     new BodyContentHandler(writer),
>                     metadata,
>                     new ParseContext());
>             writer.flush();
>             writer.close();
>         } finally {
>             fis.close();
>         }
>     }
>
> You can find the details of the parse call here:
> http://tika.apache.org/1.5/parser.html. Let me know if you have any
> questions!
>
> Hope that helps,
> Tyler
>
>
>
> On Thu, Jun 26, 2014 at 4:50 AM, Richard <rg...@hotmail.com> wrote:
>
>> Thanks very much Chris ... its all working now.
>> You haven't by chance happen to have programmatically looped through a
>> directory full of pdfs and used Tika to extract each of their pdf contents
>> into separate text or xml files? If so, what do you recommend to do the
>> extraction?
>> Kind regards
>> Richard
>> > Date: Mon, 16 Jun 2014 23:03:49 -0700
>> > Subject: Re: Question re installing Tika
>> > From: mattmann@apache.org
>> > To: rgwlawson@hotmail.com; user@tika.apache.org
>> > CC: dev@tika.apache.org
>> >
>> > Hi Richard,
>> >
>> > No problem at all, my attempted answers below:
>> >
>> >
>> > -----Original Message-----
>> > From: Richard <rg...@hotmail.com>
>> > Date: Monday, June 16, 2014 3:47 PM
>> > To: Chris Mattmann <Ch...@jpl.nasa.gov>, "
>> user@tika.apache.org"
>> > <us...@tika.apache.org>
>> > Cc: "dev@tika.apache.org" <de...@tika.apache.org>
>> > Subject: RE: Question re installing Tika
>> >
>> > >Thanks very much for responding to me, Chris. I hope you don't mind if
>> I
>> > >ask a few more questions about the setup process which I have done to
>> > >date as follows (and by way of
>> > > background I have a Windows 7 64 bit pc):
>> > >
>> > >
>> > >1) I downloaded the tika-app-1.5.jar
>> > ><http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.5.jar> from
>> > >http://tika.apache.org/download.html
>> > >2) I was recommended by a friend to rename it to tika-app.jar, which I
>> > >have done, and placed it in my c:\Users\Myusername directory
>> > >3) I added the environment variable JAVA_HOME (as a system variable).
>> > >4) I then brought up the cmd window, changed directory to
>> > >c:\Users\Myusername and typed in  "java -jar tika-app.jar"
>> > >
>> > >
>> > >However the gui does not appear.
>> >
>> > Yep, if you type java -jar tika-app.jar --help, you'll see the command
>> > line output and the switches.
>> > I believe to pull the GUI up you need to do:
>> >
>> > java -jar tika-app.jar --gui
>> >
>> > >
>> > >
>> > >
>> > >I have the latest version of Java: Version 7 Update 60 but I was
>> > >wondering if I needed the Java SDK to run this?
>> > >
>> > >
>> > >Many thanks again for your help
>> >
>> > No problem, see above :)
>> >
>> > Cheers,
>> > Chris
>> >
>> > >
>> > >
>> > >Richard
>> > >
>> > >
>> > >> From: chris.a.mattmann@jpl.nasa.gov
>> > >> To: rgwlawson@hotmail.com; user@tika.apache.org
>> > >> CC: dev@tika.apache.org
>> > >> Subject: Re: Question re installing Tika
>> > >> Date: Thu, 12 Jun 2014 03:27:20 +0000
>> > >>
>> > >> Hi Richard,
>> > >>
>> > >> Hope you are well, will try and answer below:
>> > >>
>> > >>
>> > >> -----Original Message-----
>> > >>
>> > >> From: Richard <rg...@hotmail.com>
>> > >> Date: Friday, June 6, 2014 6:07 AM
>> > >> To: "user@tika.apache.org" <us...@tika.apache.org>,
>> > >> "dev-owner@tika.apache.org" <de...@tika.apache.org>
>> > >> Subject: Question re installing Tika
>> > >>
>> > >> >Hello
>> > >> >
>> > >> >I am new to the Apache suite of products and dealing with text in
>> pdfs,
>> > >> >more generally. In particular I am trying to install Tika (the
>> > >> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
>> > >> >
>> > >> >
>> > >> >However I am confused about how to do the Tika installation.
>> > >> >
>> > >> >
>> > >> >From reading various webpages (eg
>> > >> >http://tika.apache.org/1.5/gettingstarted.html
>> > >> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need
>> to
>> > >> >
>> > >> >1)
>> > >> >Download the .jar from
>> > >> >http://tika.apache.org/download.html
>> > >> ><http://tika.apache.org/download.html> (do I need to put it in a
>> > >>specific
>> > >> >windows folder?)
>> > >>
>> > >> Nope you don't have to put in any specific folder, wherever you are
>> > >> comfortable calling the jar from.
>> > >>
>> > >> >2)
>> > >> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
>> > >> >instructions for Windows on
>> > >> >http://maven.apache.org/download.cgi#Installation
>> > >>
>> > >> No need to do this unless you are building from scratch.
>> > >>
>> > >> >3)
>> > >> >Also where do I set the base directory?
>> > >>
>> > >> You just need to install Apache Tika and its *-app.jar file into some
>> > >> folder, and then
>> > >> call it by doing java -jar /path/to/tika-*version*-app.jar --help
>> > >>
>> > >> >
>> > >> >4)
>> > >> >Where do I run the command ³mvn install² from? Is it the command
>> line?
>> > >>
>> > >> If you are building from source, then you would run this at the top
>> > >>level
>> > >> directory containing
>> > >> files like pom.xml, tika-parent, tika-parsers, etc.
>> > >>
>> > >> >
>> > >> >
>> > >> >Any help would be most gratefully received.
>> > >>
>> > >> Cheers!
>> > >>
>> > >> Chris
>> > >>
>> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > >> Chris Mattmann, Ph.D.
>> > >> Chief Architect
>> > >> Instrument Software and Science Data Systems Section (398)
>> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > >> Office: 168-519, Mailstop: 168-527
>> > >> Email: chris.a.mattmann@nasa.gov
>> > >> WWW: http://sunset.usc.edu/~mattmann/
>> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > >> Adjunct Associate Professor, Computer Science Department
>> > >> University of Southern California, Los Angeles, CA 90089 USA
>> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> >
>> > >>
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>>
>>
>
>

Re: Question re installing Tika

Posted by Tyler Palsulich <tp...@gmail.com>.
Hi Richard,

The code below is derived from Chris' search engine class at USC (
http://www-scf.usc.edu/~csci572/). Hopefully it will point you in the right
direction.

        // Open all pdf files, process each one
        File pdfdir = new File("./some/pdf/directory");
        File[] pdfs = pdfdir.listFiles();
        for (File pdf:pdfs) {
            if (pdf.isFile()) processfile(pdf);
        }
//Your process method would look something like
    private void processfile(File f) {
        PDFParser parser = new PDFParser();
        Metadata metadata = new Metadata();
        FileInputStream fis = new FileInputStream(f);
        try {
            FileWriter writer = new FileWriter(f.getName() +
".content.txt");
            parser.parse(fis,
                    new BodyContentHandler(writer),
                    metadata,
                    new ParseContext());
            writer.flush();
            writer.close();
        } finally {
            fis.close();
        }
    }

You can find the details of the parse call here:
http://tika.apache.org/1.5/parser.html. Let me know if you have any
questions!

Hope that helps,
Tyler



On Thu, Jun 26, 2014 at 4:50 AM, Richard <rg...@hotmail.com> wrote:

> Thanks very much Chris ... its all working now.
> You haven't by chance happen to have programmatically looped through a
> directory full of pdfs and used Tika to extract each of their pdf contents
> into separate text or xml files? If so, what do you recommend to do the
> extraction?
> Kind regards
> Richard
> > Date: Mon, 16 Jun 2014 23:03:49 -0700
> > Subject: Re: Question re installing Tika
> > From: mattmann@apache.org
> > To: rgwlawson@hotmail.com; user@tika.apache.org
> > CC: dev@tika.apache.org
> >
> > Hi Richard,
> >
> > No problem at all, my attempted answers below:
> >
> >
> > -----Original Message-----
> > From: Richard <rg...@hotmail.com>
> > Date: Monday, June 16, 2014 3:47 PM
> > To: Chris Mattmann <Ch...@jpl.nasa.gov>, "
> user@tika.apache.org"
> > <us...@tika.apache.org>
> > Cc: "dev@tika.apache.org" <de...@tika.apache.org>
> > Subject: RE: Question re installing Tika
> >
> > >Thanks very much for responding to me, Chris. I hope you don't mind if I
> > >ask a few more questions about the setup process which I have done to
> > >date as follows (and by way of
> > > background I have a Windows 7 64 bit pc):
> > >
> > >
> > >1) I downloaded the tika-app-1.5.jar
> > ><http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.5.jar> from
> > >http://tika.apache.org/download.html
> > >2) I was recommended by a friend to rename it to tika-app.jar, which I
> > >have done, and placed it in my c:\Users\Myusername directory
> > >3) I added the environment variable JAVA_HOME (as a system variable).
> > >4) I then brought up the cmd window, changed directory to
> > >c:\Users\Myusername and typed in  "java -jar tika-app.jar"
> > >
> > >
> > >However the gui does not appear.
> >
> > Yep, if you type java -jar tika-app.jar --help, you'll see the command
> > line output and the switches.
> > I believe to pull the GUI up you need to do:
> >
> > java -jar tika-app.jar --gui
> >
> > >
> > >
> > >
> > >I have the latest version of Java: Version 7 Update 60 but I was
> > >wondering if I needed the Java SDK to run this?
> > >
> > >
> > >Many thanks again for your help
> >
> > No problem, see above :)
> >
> > Cheers,
> > Chris
> >
> > >
> > >
> > >Richard
> > >
> > >
> > >> From: chris.a.mattmann@jpl.nasa.gov
> > >> To: rgwlawson@hotmail.com; user@tika.apache.org
> > >> CC: dev@tika.apache.org
> > >> Subject: Re: Question re installing Tika
> > >> Date: Thu, 12 Jun 2014 03:27:20 +0000
> > >>
> > >> Hi Richard,
> > >>
> > >> Hope you are well, will try and answer below:
> > >>
> > >>
> > >> -----Original Message-----
> > >>
> > >> From: Richard <rg...@hotmail.com>
> > >> Date: Friday, June 6, 2014 6:07 AM
> > >> To: "user@tika.apache.org" <us...@tika.apache.org>,
> > >> "dev-owner@tika.apache.org" <de...@tika.apache.org>
> > >> Subject: Question re installing Tika
> > >>
> > >> >Hello
> > >> >
> > >> >I am new to the Apache suite of products and dealing with text in
> pdfs,
> > >> >more generally. In particular I am trying to install Tika (the
> > >> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
> > >> >
> > >> >
> > >> >However I am confused about how to do the Tika installation.
> > >> >
> > >> >
> > >> >From reading various webpages (eg
> > >> >http://tika.apache.org/1.5/gettingstarted.html
> > >> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
> > >> >
> > >> >1)
> > >> >Download the .jar from
> > >> >http://tika.apache.org/download.html
> > >> ><http://tika.apache.org/download.html> (do I need to put it in a
> > >>specific
> > >> >windows folder?)
> > >>
> > >> Nope you don't have to put in any specific folder, wherever you are
> > >> comfortable calling the jar from.
> > >>
> > >> >2)
> > >> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
> > >> >instructions for Windows on
> > >> >http://maven.apache.org/download.cgi#Installation
> > >>
> > >> No need to do this unless you are building from scratch.
> > >>
> > >> >3)
> > >> >Also where do I set the base directory?
> > >>
> > >> You just need to install Apache Tika and its *-app.jar file into some
> > >> folder, and then
> > >> call it by doing java -jar /path/to/tika-*version*-app.jar --help
> > >>
> > >> >
> > >> >4)
> > >> >Where do I run the command ³mvn install² from? Is it the command
> line?
> > >>
> > >> If you are building from source, then you would run this at the top
> > >>level
> > >> directory containing
> > >> files like pom.xml, tika-parent, tika-parsers, etc.
> > >>
> > >> >
> > >> >
> > >> >Any help would be most gratefully received.
> > >>
> > >> Cheers!
> > >>
> > >> Chris
> > >>
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Chris Mattmann, Ph.D.
> > >> Chief Architect
> > >> Instrument Software and Science Data Systems Section (398)
> > >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > >> Office: 168-519, Mailstop: 168-527
> > >> Email: chris.a.mattmann@nasa.gov
> > >> WWW: http://sunset.usc.edu/~mattmann/
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >> Adjunct Associate Professor, Computer Science Department
> > >> University of Southern California, Los Angeles, CA 90089 USA
> > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >>
> > >>
> > >>
> > >>
> > >> >
> > >>
> > >
> > >
> > >
> > >
> >
> >
>
>

RE: Question re installing Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.
My plan is to add a tika-batch package as part of TIKA-1330.  One of the primary use cases will be input directory -> output directory.  There will be hooks for people to add db -> db, and maybe someone with Hadoop skills would be willing to contribute a tika-batch-hadoop package.

That should be ready by the end of this coming week.

But, bat scripting is far simpler.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Thursday, June 26, 2014 8:58 AM
To: user@tika.apache.org
Subject: Re: Question re installing Tika

+1000 I'm not the Windows guru, but will try and look it up




-----Original Message-----
From: Nick Burch <ap...@gagravarr.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, June 26, 2014 5:55 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Question re installing Tika

>On Thu, 26 Jun 2014, Chris Mattmann wrote:
>> looks like a great example to put on the website too ;)
>
>To be fair to all users, we probably ought to have an example that works
>on windows as well. Any powershell gurus around who care to take a stab
>at 
>the windows equivalent?
>
>Nick
>
>> -----Original Message-----
>> From: Nick Burch <ap...@gagravarr.org>
>> Reply-To: <us...@tika.apache.org>
>> Date: Thursday, June 26, 2014 5:23 AM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: RE: Question re installing Tika
>>
>>> On Thu, 26 Jun 2014, Richard wrote:
>>>> You haven't by chance happen to have programmatically looped through a
>>>> directory full of pdfs and used Tika to extract each of their pdf
>>>> contents into separate text or xml files? If so, what do you recommend
>>>> to do the extraction?
>>>
>>> For a proof of concept, how about something simple like a bash for loop
>>> and the tika app?
>>>
>>> for i in *.pdf; do j=`echo "$i" | sed 's/.pdf//'`; java -jar
>>>tika-app.jar
>>>   --text "$i" > "$j.txt"; java -jar tika-app.jar --xml "$i" > "$j.xml";
>>> done
>>>
>>> Nick
>>
>>
>>


Re: Question re installing Tika

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
+1000 I'm not the Windows guru, but will try and look it up




-----Original Message-----
From: Nick Burch <ap...@gagravarr.org>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, June 26, 2014 5:55 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Question re installing Tika

>On Thu, 26 Jun 2014, Chris Mattmann wrote:
>> looks like a great example to put on the website too ;)
>
>To be fair to all users, we probably ought to have an example that works
>on windows as well. Any powershell gurus around who care to take a stab
>at 
>the windows equivalent?
>
>Nick
>
>> -----Original Message-----
>> From: Nick Burch <ap...@gagravarr.org>
>> Reply-To: <us...@tika.apache.org>
>> Date: Thursday, June 26, 2014 5:23 AM
>> To: "user@tika.apache.org" <us...@tika.apache.org>
>> Subject: RE: Question re installing Tika
>>
>>> On Thu, 26 Jun 2014, Richard wrote:
>>>> You haven't by chance happen to have programmatically looped through a
>>>> directory full of pdfs and used Tika to extract each of their pdf
>>>> contents into separate text or xml files? If so, what do you recommend
>>>> to do the extraction?
>>>
>>> For a proof of concept, how about something simple like a bash for loop
>>> and the tika app?
>>>
>>> for i in *.pdf; do j=`echo "$i" | sed 's/.pdf//'`; java -jar
>>>tika-app.jar
>>>   --text "$i" > "$j.txt"; java -jar tika-app.jar --xml "$i" > "$j.xml";
>>> done
>>>
>>> Nick
>>
>>
>>


Re: Question re installing Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 26 Jun 2014, Chris Mattmann wrote:
> looks like a great example to put on the website too ;)

To be fair to all users, we probably ought to have an example that works 
on windows as well. Any powershell gurus around who care to take a stab at 
the windows equivalent?

Nick

> -----Original Message-----
> From: Nick Burch <ap...@gagravarr.org>
> Reply-To: <us...@tika.apache.org>
> Date: Thursday, June 26, 2014 5:23 AM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: RE: Question re installing Tika
>
>> On Thu, 26 Jun 2014, Richard wrote:
>>> You haven't by chance happen to have programmatically looped through a
>>> directory full of pdfs and used Tika to extract each of their pdf
>>> contents into separate text or xml files? If so, what do you recommend
>>> to do the extraction?
>>
>> For a proof of concept, how about something simple like a bash for loop
>> and the tika app?
>>
>> for i in *.pdf; do j=`echo "$i" | sed 's/.pdf//'`; java -jar tika-app.jar
>>   --text "$i" > "$j.txt"; java -jar tika-app.jar --xml "$i" > "$j.xml";
>> done
>>
>> Nick
>
>
>

Re: Question re installing Tika

Posted by Chris Mattmann <ch...@gmail.com>.
looks like a great example to put on the website too ;)

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Nick Burch <ap...@gagravarr.org>
Reply-To: <us...@tika.apache.org>
Date: Thursday, June 26, 2014 5:23 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: RE: Question re installing Tika

>On Thu, 26 Jun 2014, Richard wrote:
>> You haven't by chance happen to have programmatically looped through a
>> directory full of pdfs and used Tika to extract each of their pdf
>> contents into separate text or xml files? If so, what do you recommend
>> to do the extraction?
>
>For a proof of concept, how about something simple like a bash for loop
>and the tika app?
>
>for i in *.pdf; do j=`echo "$i" | sed 's/.pdf//'`; java -jar tika-app.jar
>   --text "$i" > "$j.txt"; java -jar tika-app.jar --xml "$i" > "$j.xml";
>done
>
>Nick



RE: Question re installing Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 26 Jun 2014, Richard wrote:
> You haven't by chance happen to have programmatically looped through a 
> directory full of pdfs and used Tika to extract each of their pdf 
> contents into separate text or xml files? If so, what do you recommend 
> to do the extraction?

For a proof of concept, how about something simple like a bash for loop 
and the tika app?

for i in *.pdf; do j=`echo "$i" | sed 's/.pdf//'`; java -jar tika-app.jar
   --text "$i" > "$j.txt"; java -jar tika-app.jar --xml "$i" > "$j.xml"; done

Nick

RE: Question re installing Tika

Posted by Richard <rg...@hotmail.com>.
Thanks very much Chris ... its all working now.
You haven't by chance happen to have programmatically looped through a directory full of pdfs and used Tika to extract each of their pdf contents into separate text or xml files? If so, what do you recommend to do the extraction?
Kind regards
Richard 
> Date: Mon, 16 Jun 2014 23:03:49 -0700
> Subject: Re: Question re installing Tika
> From: mattmann@apache.org
> To: rgwlawson@hotmail.com; user@tika.apache.org
> CC: dev@tika.apache.org
> 
> Hi Richard,
> 
> No problem at all, my attempted answers below:
> 
> 
> -----Original Message-----
> From: Richard <rg...@hotmail.com>
> Date: Monday, June 16, 2014 3:47 PM
> To: Chris Mattmann <Ch...@jpl.nasa.gov>, "user@tika.apache.org"
> <us...@tika.apache.org>
> Cc: "dev@tika.apache.org" <de...@tika.apache.org>
> Subject: RE: Question re installing Tika
> 
> >Thanks very much for responding to me, Chris. I hope you don't mind if I
> >ask a few more questions about the setup process which I have done to
> >date as follows (and by way of
> > background I have a Windows 7 64 bit pc):
> >
> >
> >1) I downloaded the tika-app-1.5.jar
> ><http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.5.jar> from
> >http://tika.apache.org/download.html
> >2) I was recommended by a friend to rename it to tika-app.jar, which I
> >have done, and placed it in my c:\Users\Myusername directory
> >3) I added the environment variable JAVA_HOME (as a system variable).
> >4) I then brought up the cmd window, changed directory to
> >c:\Users\Myusername and typed in  "java -jar tika-app.jar"
> >
> >
> >However the gui does not appear.
> 
> Yep, if you type java -jar tika-app.jar --help, you'll see the command
> line output and the switches.
> I believe to pull the GUI up you need to do:
> 
> java -jar tika-app.jar --gui
> 
> > 
> >
> >
> >I have the latest version of Java: Version 7 Update 60 but I was
> >wondering if I needed the Java SDK to run this?
> >
> >
> >Many thanks again for your help
> 
> No problem, see above :)
> 
> Cheers,
> Chris
> 
> >
> >
> >Richard
> >
> >
> >> From: chris.a.mattmann@jpl.nasa.gov
> >> To: rgwlawson@hotmail.com; user@tika.apache.org
> >> CC: dev@tika.apache.org
> >> Subject: Re: Question re installing Tika
> >> Date: Thu, 12 Jun 2014 03:27:20 +0000
> >> 
> >> Hi Richard,
> >> 
> >> Hope you are well, will try and answer below:
> >> 
> >> 
> >> -----Original Message-----
> >> 
> >> From: Richard <rg...@hotmail.com>
> >> Date: Friday, June 6, 2014 6:07 AM
> >> To: "user@tika.apache.org" <us...@tika.apache.org>,
> >> "dev-owner@tika.apache.org" <de...@tika.apache.org>
> >> Subject: Question re installing Tika
> >> 
> >> >Hello
> >> > 
> >> >I am new to the Apache suite of products and dealing with text in pdfs,
> >> >more generally. In particular I am trying to install Tika (the
> >> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
> >> >
> >> > 
> >> >However I am confused about how to do the Tika installation.
> >> >
> >> > 
> >> >From reading various webpages (eg
> >> >http://tika.apache.org/1.5/gettingstarted.html
> >> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
> >> > 
> >> >1) 
> >> >Download the .jar from
> >> >http://tika.apache.org/download.html
> >> ><http://tika.apache.org/download.html> (do I need to put it in a
> >>specific
> >> >windows folder?)
> >> 
> >> Nope you don't have to put in any specific folder, wherever you are
> >> comfortable calling the jar from.
> >> 
> >> >2) 
> >> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
> >> >instructions for Windows on
> >> >http://maven.apache.org/download.cgi#Installation
> >> 
> >> No need to do this unless you are building from scratch.
> >> 
> >> >3) 
> >> >Also where do I set the base directory?
> >> 
> >> You just need to install Apache Tika and its *-app.jar file into some
> >> folder, and then
> >> call it by doing java -jar /path/to/tika-*version*-app.jar --help
> >> 
> >> > 
> >> >4) 
> >> >Where do I run the command ³mvn install² from? Is it the command line?
> >> 
> >> If you are building from source, then you would run this at the top
> >>level
> >> directory containing
> >> files like pom.xml, tika-parent, tika-parsers, etc.
> >> 
> >> >
> >> >
> >> >Any help would be most gratefully received.
> >> 
> >> Cheers!
> >> 
> >> Chris
> >> 
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattmann@nasa.gov
> >> WWW: http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> 
> >> 
> >> 
> >> 
> >> >
> >> 
> >
> >
> >
> >
> 
> 
 		 	   		  

Re: Question re installing Tika

Posted by Chris Mattmann <ma...@apache.org>.
Hi Richard,

No problem at all, my attempted answers below:


-----Original Message-----
From: Richard <rg...@hotmail.com>
Date: Monday, June 16, 2014 3:47 PM
To: Chris Mattmann <Ch...@jpl.nasa.gov>, "user@tika.apache.org"
<us...@tika.apache.org>
Cc: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: Question re installing Tika

>Thanks very much for responding to me, Chris. I hope you don't mind if I
>ask a few more questions about the setup process which I have done to
>date as follows (and by way of
> background I have a Windows 7 64 bit pc):
>
>
>1) I downloaded the tika-app-1.5.jar
><http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.5.jar> from
>http://tika.apache.org/download.html
>2) I was recommended by a friend to rename it to tika-app.jar, which I
>have done, and placed it in my c:\Users\Myusername directory
>3) I added the environment variable JAVA_HOME (as a system variable).
>4) I then brought up the cmd window, changed directory to
>c:\Users\Myusername and typed in  "java -jar tika-app.jar"
>
>
>However the gui does not appear.

Yep, if you type java -jar tika-app.jar --help, you'll see the command
line output and the switches.
I believe to pull the GUI up you need to do:

java -jar tika-app.jar --gui

> 
>
>
>I have the latest version of Java: Version 7 Update 60 but I was
>wondering if I needed the Java SDK to run this?
>
>
>Many thanks again for your help

No problem, see above :)

Cheers,
Chris

>
>
>Richard
>
>
>> From: chris.a.mattmann@jpl.nasa.gov
>> To: rgwlawson@hotmail.com; user@tika.apache.org
>> CC: dev@tika.apache.org
>> Subject: Re: Question re installing Tika
>> Date: Thu, 12 Jun 2014 03:27:20 +0000
>> 
>> Hi Richard,
>> 
>> Hope you are well, will try and answer below:
>> 
>> 
>> -----Original Message-----
>> 
>> From: Richard <rg...@hotmail.com>
>> Date: Friday, June 6, 2014 6:07 AM
>> To: "user@tika.apache.org" <us...@tika.apache.org>,
>> "dev-owner@tika.apache.org" <de...@tika.apache.org>
>> Subject: Question re installing Tika
>> 
>> >Hello
>> > 
>> >I am new to the Apache suite of products and dealing with text in pdfs,
>> >more generally. In particular I am trying to install Tika (the
>> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
>> >
>> > 
>> >However I am confused about how to do the Tika installation.
>> >
>> > 
>> >From reading various webpages (eg
>> >http://tika.apache.org/1.5/gettingstarted.html
>> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
>> > 
>> >1) 
>> >Download the .jar from
>> >http://tika.apache.org/download.html
>> ><http://tika.apache.org/download.html> (do I need to put it in a
>>specific
>> >windows folder?)
>> 
>> Nope you don't have to put in any specific folder, wherever you are
>> comfortable calling the jar from.
>> 
>> >2) 
>> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
>> >instructions for Windows on
>> >http://maven.apache.org/download.cgi#Installation
>> 
>> No need to do this unless you are building from scratch.
>> 
>> >3) 
>> >Also where do I set the base directory?
>> 
>> You just need to install Apache Tika and its *-app.jar file into some
>> folder, and then
>> call it by doing java -jar /path/to/tika-*version*-app.jar --help
>> 
>> > 
>> >4) 
>> >Where do I run the command ³mvn install² from? Is it the command line?
>> 
>> If you are building from source, then you would run this at the top
>>level
>> directory containing
>> files like pom.xml, tika-parent, tika-parsers, etc.
>> 
>> >
>> >
>> >Any help would be most gratefully received.
>> 
>> Cheers!
>> 
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> >
>> 
>
>
>
>



Re: Question re installing Tika

Posted by Chris Mattmann <ma...@apache.org>.
Hi Richard,

No problem at all, my attempted answers below:


-----Original Message-----
From: Richard <rg...@hotmail.com>
Date: Monday, June 16, 2014 3:47 PM
To: Chris Mattmann <Ch...@jpl.nasa.gov>, "user@tika.apache.org"
<us...@tika.apache.org>
Cc: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: RE: Question re installing Tika

>Thanks very much for responding to me, Chris. I hope you don't mind if I
>ask a few more questions about the setup process which I have done to
>date as follows (and by way of
> background I have a Windows 7 64 bit pc):
>
>
>1) I downloaded the tika-app-1.5.jar
><http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.5.jar> from
>http://tika.apache.org/download.html
>2) I was recommended by a friend to rename it to tika-app.jar, which I
>have done, and placed it in my c:\Users\Myusername directory
>3) I added the environment variable JAVA_HOME (as a system variable).
>4) I then brought up the cmd window, changed directory to
>c:\Users\Myusername and typed in  "java -jar tika-app.jar"
>
>
>However the gui does not appear.

Yep, if you type java -jar tika-app.jar --help, you'll see the command
line output and the switches.
I believe to pull the GUI up you need to do:

java -jar tika-app.jar --gui

> 
>
>
>I have the latest version of Java: Version 7 Update 60 but I was
>wondering if I needed the Java SDK to run this?
>
>
>Many thanks again for your help

No problem, see above :)

Cheers,
Chris

>
>
>Richard
>
>
>> From: chris.a.mattmann@jpl.nasa.gov
>> To: rgwlawson@hotmail.com; user@tika.apache.org
>> CC: dev@tika.apache.org
>> Subject: Re: Question re installing Tika
>> Date: Thu, 12 Jun 2014 03:27:20 +0000
>> 
>> Hi Richard,
>> 
>> Hope you are well, will try and answer below:
>> 
>> 
>> -----Original Message-----
>> 
>> From: Richard <rg...@hotmail.com>
>> Date: Friday, June 6, 2014 6:07 AM
>> To: "user@tika.apache.org" <us...@tika.apache.org>,
>> "dev-owner@tika.apache.org" <de...@tika.apache.org>
>> Subject: Question re installing Tika
>> 
>> >Hello
>> > 
>> >I am new to the Apache suite of products and dealing with text in pdfs,
>> >more generally. In particular I am trying to install Tika (the
>> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
>> >
>> > 
>> >However I am confused about how to do the Tika installation.
>> >
>> > 
>> >From reading various webpages (eg
>> >http://tika.apache.org/1.5/gettingstarted.html
>> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
>> > 
>> >1) 
>> >Download the .jar from
>> >http://tika.apache.org/download.html
>> ><http://tika.apache.org/download.html> (do I need to put it in a
>>specific
>> >windows folder?)
>> 
>> Nope you don't have to put in any specific folder, wherever you are
>> comfortable calling the jar from.
>> 
>> >2) 
>> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
>> >instructions for Windows on
>> >http://maven.apache.org/download.cgi#Installation
>> 
>> No need to do this unless you are building from scratch.
>> 
>> >3) 
>> >Also where do I set the base directory?
>> 
>> You just need to install Apache Tika and its *-app.jar file into some
>> folder, and then
>> call it by doing java -jar /path/to/tika-*version*-app.jar --help
>> 
>> > 
>> >4) 
>> >Where do I run the command ³mvn install² from? Is it the command line?
>> 
>> If you are building from source, then you would run this at the top
>>level
>> directory containing
>> files like pom.xml, tika-parent, tika-parsers, etc.
>> 
>> >
>> >
>> >Any help would be most gratefully received.
>> 
>> Cheers!
>> 
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW: http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> >
>> 
>
>
>
>



RE: Question re installing Tika

Posted by Richard <rg...@hotmail.com>.


Thanks very much for responding to me, Chris. I hope you don't mind if I ask a few more questions about the setup process which I have done to date as follows (and by way of background I have a Windows 7 64 bit pc):
1) I downloaded the tika-app-1.5.jar from http://tika.apache.org/download.html2) I was recommended by a friend to rename it to tika-app.jar, which I have done, and placed it in my c:\Users\Myusername directory3) I added the environment variable JAVA_HOME (as a system variable).4) I then brought up the cmd window, changed directory to c:\Users\Myusername and typed in  "java -jar tika-app.jar"
However the gui does not appear. 
I have the latest version of Java: Version 7 Update 60 but I was wondering if I needed the Java SDK to run this?
Many thanks again for your help
Richard

> From: chris.a.mattmann@jpl.nasa.gov
> To: rgwlawson@hotmail.com; user@tika.apache.org
> CC: dev@tika.apache.org
> Subject: Re: Question re installing Tika
> Date: Thu, 12 Jun 2014 03:27:20 +0000
> 
> Hi Richard,
> 
> Hope you are well, will try and answer below:
> 
> 
> -----Original Message-----
> 
> From: Richard <rg...@hotmail.com>
> Date: Friday, June 6, 2014 6:07 AM
> To: "user@tika.apache.org" <us...@tika.apache.org>,
> "dev-owner@tika.apache.org" <de...@tika.apache.org>
> Subject: Question re installing Tika
> 
> >Hello
> > 
> >I am new to the Apache suite of products and dealing with text in pdfs,
> >more generally. In particular I am trying to install Tika (the
> >tika-app_1.5.jar) as well as Solr on my Windows 7 pc.
> >
> > 
> >However I am confused about how to do the Tika installation.
> >
> > 
> >From reading various webpages (eg
> >http://tika.apache.org/1.5/gettingstarted.html
> ><http://tika.apache.org/1.5/gettingstarted.html>) it seems I need to
> > 
> >1)     
> >Download the .jar from
> >http://tika.apache.org/download.html
> ><http://tika.apache.org/download.html> (do I need to put it in a specific
> >windows folder?)
> 
> Nope you don't have to put in any specific folder, wherever you are
> comfortable calling the jar from.
> 
> >2)     
> >Download Maven 2 (from http://maven.apache.org/ ) and follow up the
> >instructions for Windows on
> >http://maven.apache.org/download.cgi#Installation
> 
> No need to do this unless you are building from scratch.
> 
> >3)     
> >Also where do I set the base directory?
> 
> You just need to install Apache Tika and its *-app.jar file into some
> folder, and then
> call it by doing java -jar /path/to/tika-*version*-app.jar --help
> 
> > 
> >4)     
> >Where do I run the command ³mvn install² from? Is it the command line?
> 
> If you are building from source, then you would run this at the top level
> directory containing
> files like pom.xml, tika-parent, tika-parsers, etc.
> 
> >
> >
> >Any help would be most gratefully received.
> 
> Cheers!
> 
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> >
>