You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2012/07/01 03:49:23 UTC

Re: Server mode documentation?

Hi Jason,

Try this out:

http://wiki.apache.org/tika/TikaJAXRS

We'd be totally happy for feedback and thanks for checking this out!

Cheers,
Chris

On Jun 30, 2012, at 12:26 PM, Jason Judge wrote:

> Is there any documentation on running tika in server mode? I bought the book, hoping for some hints, but there was nothing of use in there.
> 
> I realise that it is "under development", but most OS projects are, and some clues towards how to get started would be appreciated.
> 
> I would like to push documents through tika from a PHP application, so I am assuming server mode would be the best way to do this (less memory requirement, so long as the documents can be queued for one process running constantly on the server). I would need to get metadata, text content, and identify the language and mime type. Doing this through the command line I understand, but through a server mode connection is a total mystery.
> 
> Thanks,
> 
> -- Jason Judge
> 
> 
> -- 
> Jason Judge, Technical Director
> Consilience Media Ltd
> 
> Direct: 0191 303 8492 (with voice mail)
> Office: 0191 251 1104
> Skype: jason_judge
> Mobile: 07771 656 448
> 12 Norham Rd, Whitley Bay, Tyne and Wear NE26 2SB
> 
> jason.judge@consil.co.uk
> www.consil.co.uk


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Server mode documentation?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Jul 1, 2012 at 12:49 PM, Nick Burch <ni...@alfresco.com> wrote:
> However, something seems to have gone wrong and no new copies have been
> built recently (last was in March!). Hopefully someone who knows tika-server
> and/or the maven snapshot repo better can take a look!

The tika-server module was still commented out in the top-level POM
because of the custom repositories it was depending on earlier. That
issue has already been resolved, so in revision 1355868 I uncommented
tika-server in the top-level POM. A snapshot version should become
available on the snapshot repository within an hour or so as the CI
build picks up that revision.

BR,

Jukka Zitting

Re: Server mode documentation?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
I'm totally interested in participating but not sure I could do it in person.

If you guys do it, could you set up a Google Hangout for me?

Thanks Nick!

Cheers,
Chris

On Jul 1, 2012, at 9:37 AM, Nick Burch wrote:

> On Sun, 1 Jul 2012, Jason Judge wrote:
>> So, feature requests, command line, or...learn java. It is going to be a busy Summer :-)
> 
> Anyone up for a Tika hackathon weekend in Oxford later this summer? Jason could hop on a direct train down from Newcastle, and we're only an hour from Heathrow for everyone else... :) Anyone interested?
> 
> Nick


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
On 01/07/2012 17:37, Nick Burch wrote:
> On Sun, 1 Jul 2012, Jason Judge wrote:
>> So, feature requests, command line, or...learn java. It is going to be a busy
>> Summer :-)
>
> Anyone up for a Tika hackathon weekend in Oxford later this summer? Jason
> could hop on a direct train down from Newcastle, and we're only an hour from
> Heathrow for everyone else... :) Anyone interested?
>
> Nick

I'm not sure I would be able to make it down to Oxford this Summer (but you
never know).

One thing that would be awesome, would be a PSR-0 compliant PHP library that
could drive Tika through a php-to-java bridge such as
http://php-java-bridge.sourceforge.net/pjb/ If a PHP application had access to
all the features of Tika, that would be incredibly useful, and functionality
would not be limited by what happens to have been added to the command-line or
server builds for testing.

I'm trying to get this working anyway, but not getting a lot of help from the
java-bridge project (the lists appear to be only for people who already know
what they are doing, which I certainly don't ;-)

-- Jason

Re: Server mode documentation?

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 1 Jul 2012, Jason Judge wrote:
> So, feature requests, command line, or...learn java. It is going to be a 
> busy Summer :-)

Anyone up for a Tika hackathon weekend in Oxford later this summer? Jason 
could hop on a direct train down from Newcastle, and we're only an hour 
from Heathrow for everyone else... :) Anyone interested?

Nick

Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
<ma...@consil.co.uk>
www.consil.co.uk <http://www.consil.co.uk/>On 01/07/2012 16:51, Nick Burch wrote:
> On Sun, 1 Jul 2012, Jason Judge wrote:
>> Am I understanding it correctly that tika-server and tika-app are just two
>> examples of the way tika can be used, and are just thrown together as a
>> quick-start demo rather than core functionality of the main part of the
>> project, which is a collection of libraries and tools to be used by other
>> java applications.
>
> They should be more than a quick-start, but neither are how most people use
> Tika. Most Tika users are Java programmers, so call either the Tika facade
> class (simple use cases), or the Parser/Detector/etc directly (advanced uses).
>
> The tika-app has tended to be used for testing and debugging, but is
> increasingly also being used for non-Java integrations. The tika server is
> quite new, so finding areas where core Tika functionality isn't exposed is to
> be expected. The Tika API is pretty simple and easy to use, so it's generally
> pretty easy for a (Java) programmer to expose extra bits of it in the app or
> server when they have the need. Sadly, this does tend to mean that non Java
> users need to raise enhancement requests when they hit things that aren't
> exposed....
>
> Nick

Thanks Nick. That bit of background helps a lot.

Tika is pretty unique across the whole of the open source landscape in terms of
its flexibility, wide range of inputs and ease of use. I could see its use
expanding into many other areas and platforms to fit a niche. I've already
noticed it is used by some PHP applications, such as Knowledge Tree, but they
tend to use it as a plugin to solr, with solr providing the API into Tika.

So, feature requests, command line, or...learn java. It is going to be a busy
Summer :-)

-- Jason


Re: Server mode documentation?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Guys,

I'd say Jason your comments are well taken, and Nick's replies are spot on. 

I got involved with tika-server after Maxim Valyanskiy built a simple JAX-RS
layer in his $dayjob and was willing to contribute it back in TIKA-593. His original
contribution used the Jersey JAX-RS libraries and I was keenly interested in 
converting it to use Apache CXF since I had such a great experience with
CXF on the Apache OODT project using it to expose our data curation 
web services.

So, I spent a lot of time some months back trying to get this working, with 
Maxim Valyanskiy and with Sergey Beryozkin who lent a hand from the 
CXF project. My interest was getting the tika-server module working in my
own environment, and documenting what *was* there and less on putting
on my architecture hat, and trying to line up things and improve the APIs
to make them more consistent with tika-app.

That is not to say that we don't want to do that, but just saying that I don't
think it's been done yet. Jason: If you'd like to help us propose something that
helps us get more consistent from your perspective as a newcomer to the 
project, we would love to hear your ideas!

Cheers,
Chris

On Jul 1, 2012, at 8:51 AM, Nick Burch wrote:

> On Sun, 1 Jul 2012, Jason Judge wrote:
>> Am I understanding it correctly that tika-server and tika-app are just two examples of the way tika can be used, and are just thrown together as a quick-start demo rather than core functionality of the main part of the project, which is a collection of libraries and tools to be used by other java applications.
> 
> They should be more than a quick-start, but neither are how most people use Tika. Most Tika users are Java programmers, so call either the Tika facade class (simple use cases), or the Parser/Detector/etc directly (advanced uses).
> 
> The tika-app has tended to be used for testing and debugging, but is increasingly also being used for non-Java integrations. The tika server is quite new, so finding areas where core Tika functionality isn't exposed is to be expected. The Tika API is pretty simple and easy to use, so it's generally pretty easy for a (Java) programmer to expose extra bits of it in the app or server when they have the need. Sadly, this does tend to mean that non Java users need to raise enhancement requests when they hit things that aren't exposed....
> 
> Nick


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Server mode documentation?

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 1 Jul 2012, Jason Judge wrote:
> Am I understanding it correctly that tika-server and tika-app are just 
> two examples of the way tika can be used, and are just thrown together 
> as a quick-start demo rather than core functionality of the main part of 
> the project, which is a collection of libraries and tools to be used by 
> other java applications.

They should be more than a quick-start, but neither are how most people 
use Tika. Most Tika users are Java programmers, so call either the Tika 
facade class (simple use cases), or the Parser/Detector/etc directly 
(advanced uses).

The tika-app has tended to be used for testing and debugging, but is 
increasingly also being used for non-Java integrations. The tika server is 
quite new, so finding areas where core Tika functionality isn't exposed is 
to be expected. The Tika API is pretty simple and easy to use, so it's 
generally pretty easy for a (Java) programmer to expose extra bits of it 
in the app or server when they have the need. Sadly, this does tend to 
mean that non Java users need to raise enhancement requests when they hit 
things that aren't exposed....

Nick

Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
I've looked at the code, and not being a Java programmer, I may be
misunderstanding it. But from what I can see, the tika server will only return
either the metadata as CSV, or the content as plain text. There are no other
formats supported - no metadata as JSON, or content as XHTML.

Is that correct?

So if I want to access all the same features through the server as I can get
through the command line (tiki-app), I would need to extend the tika-server
resources to add in new paths to serve as parameters? Extending the current
paths (/meta/whatever and /tika/foobar) is kind of blocked by treating anything
that follows as a keyword for the log files. Being able to use
/meta/{output-format} would have been nice.

Am I understanding it correctly that tika-server and tika-app are just two
examples of the way tika can be used, and are just thrown together as a
quick-start demo rather than core functionality of the main part of the project,
which is a collection of libraries and tools to be used by other java applications.

It just feels strange that every way of access the functionality (server, app
CLI, GUI, app in server mode) has wildly different interfaces, with access to
different ranges of functionality, so I am guessing they have all been developed
interdependently as separate "demo interaction" layers rather than as different
ways to access a common set of functionality.

Would that be a fair appraisal? I'm just trying to get a grip, as an outsider,
on how the project is structured and the mindset behind how it all fits
together, so I have a better idea where to find answers and the best approaches
to use for integration.

Regards,

-- Jason



On 01/07/2012 14:05, Jason Judge wrote:
>
> The one thing I can't see how to do, is how to detect the language. The
> language is neither in the text nor in the metadata. Would I need to fetch the
> XHTML version of the document and get the language out of the header section?
> Not sure how to fetch the XHTML TBH - the documentation only covers plain text.
>
> -- Jason
>
> On 01/07/2012 13:34, Jukka Zitting wrote:
>
...

Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
On 01/07/2012 17:31, Mattmann, Chris A (388J) wrote:
> Hi Jason,
>
> On Jul 1, 2012, at 6:05 AM, Jason Judge wrote:
>
>> I see, so tika-app in server mode and tika-server are not the same thing. tika-app in server mode is just a way of providing an alternative input stream, but offers no control through that stream over what it actually does.
>>
>> I have downloaded the tika-server and it works like a charm.
> Glad to hear it's working for ya!
>
>> The one thing I can't see how to do, is how to detect the language. The language is neither in the text nor in the metadata. Would I need to fetch the XHTML version of the document and get the language out of the header section? Not sure how to fetch the XHTML TBH - the documentation only covers plain text.
> I don't think we added a language detection end point yet, but it's certainly
> something we should do.
>
> In case we don't get to it as soon as you get a chance to, feel free to 
> contribute it back by:
>
> 1. filing an issue in our JIRA system at: https://issues.apache.org/jira/browse/TIKA to record the desire for the language detection end point
> 2. submitting a patch and/or working with the committers on that issue you create in #1.
>
> HTH!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
Chris,

I've raised https://issues.apache.org/jira/browse/TIKA-944 - hopefully not
scoped too wide.

-- Jason


Re: Server mode documentation?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Jason,

On Jul 1, 2012, at 6:05 AM, Jason Judge wrote:

> I see, so tika-app in server mode and tika-server are not the same thing. tika-app in server mode is just a way of providing an alternative input stream, but offers no control through that stream over what it actually does.
> 
> I have downloaded the tika-server and it works like a charm.

Glad to hear it's working for ya!

> 
> The one thing I can't see how to do, is how to detect the language. The language is neither in the text nor in the metadata. Would I need to fetch the XHTML version of the document and get the language out of the header section? Not sure how to fetch the XHTML TBH - the documentation only covers plain text.

I don't think we added a language detection end point yet, but it's certainly
something we should do.

In case we don't get to it as soon as you get a chance to, feel free to 
contribute it back by:

1. filing an issue in our JIRA system at: https://issues.apache.org/jira/browse/TIKA to record the desire for the language detection end point
2. submitting a patch and/or working with the committers on that issue you create in #1.

HTH!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
I see, so tika-app in server mode and tika-server are not the same thing.
tika-app in server mode is just a way of providing an alternative input stream,
but offers no control through that stream over what it actually does.

I have downloaded the tika-server and it works like a charm.

The one thing I can't see how to do, is how to detect the language. The language
is neither in the text nor in the metadata. Would I need to fetch the XHTML
version of the document and get the language out of the header section? Not sure
how to fetch the XHTML TBH - the documentation only covers plain text.

-- Jason

On 01/07/2012 13:34, Jukka Zitting wrote:
> Hi,
>
> On Sun, Jul 1, 2012 at 1:28 PM, Jason Judge <ja...@consil.co.uk> wrote:
>> Is it not tika-app running in server mode that I need? tika-server is only
>> about 800kbytes in size, so could not possibly contain all the functionality
>> that the 25Mbyte tika-app contains.
> There's a more recent tika-server 1.2-SNAPSHOT version now that's 33MB
> in size. That's the one you want.
>
> Note that currently both the tika-server jar and the --server option
> of tika-app provide somewhat similar functionality. The tika-server
> features are described in the wiki page Chris pointed to, while the
> server mode of the tika-app simply parses documents sent through a
> network connection programmatically or with a tool like netcat [1] and
> responds with the parse output as governed by the rest of the tika-app
> command line options.
>
> [1] http://netcat.sourceforge.net/
>
> BR,
>
> Jukka Zitting



Re: Server mode documentation?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Jul 1, 2012 at 1:28 PM, Jason Judge <ja...@consil.co.uk> wrote:
> Is it not tika-app running in server mode that I need? tika-server is only
> about 800kbytes in size, so could not possibly contain all the functionality
> that the 25Mbyte tika-app contains.

There's a more recent tika-server 1.2-SNAPSHOT version now that's 33MB
in size. That's the one you want.

Note that currently both the tika-server jar and the --server option
of tika-app provide somewhat similar functionality. The tika-server
features are described in the wiki page Chris pointed to, while the
server mode of the tika-app simply parses documents sent through a
network connection programmatically or with a tool like netcat [1] and
responds with the parse output as governed by the rest of the tika-app
command line options.

[1] http://netcat.sourceforge.net/

BR,

Jukka Zitting

Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
Hi, On Sun, Jul 1, 2012 at 2:37 PM, Mark Kerzner <ma...@shmsoft.com> wrote:
>> Do you mean that every call to Tika is a JVM startup? But it looks like a
>> straight Java call to me, if your application is already running inside of a
>> JVM?
> In that case there's no significant startup cost, at least once you've
> already loaded the Tika classes to memory. The server mode is
> typically more interesting for non-Java clients that face the question
> of either executing tika-app separately for each document or accessing
> an already running server process.
>
> BR,
>
> Jukka Zitting

This is my situation. I would like to treat the document text and metadata (and
language) extraction as a web service. tika will run on its own, in its own
environment, and other non-java applications will send it requests and get back
results.

-- Jason


Re: Server mode documentation?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Jul 1, 2012 at 2:37 PM, Mark Kerzner <ma...@shmsoft.com> wrote:
> Do you mean that every call to Tika is a JVM startup? But it looks like a
> straight Java call to me, if your application is already running inside of a
> JVM?

In that case there's no significant startup cost, at least once you've
already loaded the Tika classes to memory. The server mode is
typically more interesting for non-Java clients that face the question
of either executing tika-app separately for each document or accessing
an already running server process.

BR,

Jukka Zitting

Re: Server mode documentation?

Posted by Mark Kerzner <ma...@shmsoft.com>.
Do you mean that every call to Tika is a JVM startup? But it looks like a
straight Java call to me, if your application is already running inside of
a JVM?

PS. I am using Tika inside of Hadoop mapper, so if there are significant
initialisations, I could do them in the mapper setup() call.

Thank you,
Mark

On Sun, Jul 1, 2012 at 7:28 AM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Sun, Jul 1, 2012 at 2:17 PM, Mark Kerzner <ma...@shmsoft.com>
> wrote:
> > Out of curiosity, what would be the performance benefit of server vs
> > initialising every time?
>
> You replace JVM startup overhead with that of a transmitting the
> document over a network connection. How that affects overall system
> performance depends on your deployment details.
>
> BR,
>
> Jukka Zitting
>

Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
In my case - initially at least - the tika server would be on the same physical
server as the application needing to extract text from the documents that are
uploaded to it. So network traffic is not so much an issue.

The main advantages I can see are:

1. Speed - the server is up and running all the time, so can process a document
immediately. Obviously with many requests coming fast, then they could get
backed up in a queue, but I'm hoping that queue would clear faster.

2. Memory usage. By running the server, the memory usage can be more easily
controlled. It would use memory all the time it was running, but that would be
in a process completely independent of the web application that needs the
documents processed. If the web application needed to run a command line script
every time, with a 25M JAR file (before it is decompressed) and a Java run-time,
and the document being processed in memory, then I can see all sorts of memory
issues getting in the way of its operation.

-- Jason


On 01/07/2012 13:28, Jukka Zitting wrote:
> Hi,
>
> On Sun, Jul 1, 2012 at 2:17 PM, Mark Kerzner <ma...@shmsoft.com> wrote:
>> Out of curiosity, what would be the performance benefit of server vs
>> initialising every time?
> You replace JVM startup overhead with that of a transmitting the
> document over a network connection. How that affects overall system
> performance depends on your deployment details.
>
> BR,
>
> Jukka Zitting



Re: Server mode documentation?

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Jul 1, 2012 at 2:17 PM, Mark Kerzner <ma...@shmsoft.com> wrote:
> Out of curiosity, what would be the performance benefit of server vs
> initialising every time?

You replace JVM startup overhead with that of a transmitting the
document over a network connection. How that affects overall system
performance depends on your deployment details.

BR,

Jukka Zitting

Re: Server mode documentation?

Posted by Mark Kerzner <ma...@shmsoft.com>.
Out of curiosity, what would be the performance benefit of server vs
initialising every time?

Mark

On Sun, Jul 1, 2012 at 6:28 AM, Jason Judge <ja...@consil.co.uk>wrote:

>  Nick,
>
> Is it not tika-app running in server mode that I need? tika-server is only
> about 800kbytes in size, so could not possibly contain all the
> functionality that the 25Mbyte tika-app contains.
>
> Unless, of course, the size is a consequence of something going wrong in
> the snapshot builds?
>
> -- Jason
>
>
> On 01/07/2012 11:49, Nick Burch wrote:
>
> On Sun, 1 Jul 2012, Nick Burch wrote:
>
> Tika snapshots are available in the Snapshot Repository:
>   http://repository.apache.org/snapshots/org/apache/tika/
> The current latest tika-app snapshot is:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.2-SNAPSHOT/tika-app-1.2-20120630.180509-70.jar
> (Look in the parent directory to see what's the latest)
>
>
> Ooops, just realised you'll want tika-server not tika-app for what you're
> doing
>
> The latest tika-server ought to be available from:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server/1.2-SNAPSHOT/
>
> However, something seems to have gone wrong and no new copies have been
> built recently (last was in March!). Hopefully someone who knows
> tika-server and/or the maven snapshot repo better can take a look!
>
> Nick
>
>
>
>

Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
Nick,

Is it not tika-app running in server mode that I need? tika-server is only about
800kbytes in size, so could not possibly contain all the functionality that the
25Mbyte tika-app contains.

Unless, of course, the size is a consequence of something going wrong in the
snapshot builds?

-- Jason

On 01/07/2012 11:49, Nick Burch wrote:
> On Sun, 1 Jul 2012, Nick Burch wrote:
>> Tika snapshots are available in the Snapshot Repository:
>>   http://repository.apache.org/snapshots/org/apache/tika/
>> The current latest tika-app snapshot is:
>>  
>> https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.2-SNAPSHOT/tika-app-1.2-20120630.180509-70.jar
>> (Look in the parent directory to see what's the latest)
>
> Ooops, just realised you'll want tika-server not tika-app for what you're doing
>
> The latest tika-server ought to be available from:
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server/1.2-SNAPSHOT/
>
>
> However, something seems to have gone wrong and no new copies have been built
> recently (last was in March!). Hopefully someone who knows tika-server and/or
> the maven snapshot repo better can take a look!
>
> Nick



Re: Server mode documentation?

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 1 Jul 2012, Nick Burch wrote:
> Tika snapshots are available in the Snapshot Repository:
>   http://repository.apache.org/snapshots/org/apache/tika/
> The current latest tika-app snapshot is:
>   https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.2-SNAPSHOT/tika-app-1.2-20120630.180509-70.jar
> (Look in the parent directory to see what's the latest)

Ooops, just realised you'll want tika-server not tika-app for what you're 
doing

The latest tika-server ought to be available from:
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-server/1.2-SNAPSHOT/

However, something seems to have gone wrong and no new copies have been 
built recently (last was in March!). Hopefully someone who knows 
tika-server and/or the maven snapshot repo better can take a look!

Nick

Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
Thanks. I just need to get something minimal up that works. If I need to tweak
its functionality later, then I can certainly get into the building of a custom
version, but for now it would be a massive diversion.

I'm stilling getting the hanging (e.g. "curl -X GET http://localhost:9998/tika"
just never comes back, and breaking into curl gives a "Connection reset" error
on the tika server). The GUI mode works great under Windows, but I'll try this
under Linux to see if I get any different results.

-- Jason


On 01/07/2012 11:46, Nick Burch wrote:
> On Sun, 1 Jul 2012, Jason Judge wrote:
>> Thank you. I don't even know how I would have begun finding that page - I
>> searched everywhere for documentation, and never came up with that.
>
> I think the docs probably started on the wiki when the feature was new and
> experimental. I wonder if it's now time for someone to promote that to the
> main site?
>
>> Do I need to build my own 1.2 snapshot from source, or is there a nightly
>> snapshot that is built and can be downloaded? Sorry if these seem daft
>> questions - I really am searching hard for these answers, but navigating
>> around the Apache sites as a newcomer to these sites, is not easy.
>
> I'd normally recommend someone just grabs the source and builds with maven, so
> you'd have the source tree available for changes. However, as you're not a
> Java programmer, the chance to dig in / change things is likely to be less of
> interest...
>
> Tika snapshots are available in the Snapshot Repository:
>    http://repository.apache.org/snapshots/org/apache/tika/
> The current latest tika-app snapshot is:
>   
> https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.2-SNAPSHOT/tika-app-1.2-20120630.180509-70.jar
> (Look in the parent directory to see what's the latest)
>
> Nick



Re: Server mode documentation?

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 1 Jul 2012, Jason Judge wrote:
> Thank you. I don't even know how I would have begun finding that page - 
> I searched everywhere for documentation, and never came up with that.

I think the docs probably started on the wiki when the feature was new and 
experimental. I wonder if it's now time for someone to promote that to the 
main site?

> Do I need to build my own 1.2 snapshot from source, or is there a 
> nightly snapshot that is built and can be downloaded? Sorry if these 
> seem daft questions - I really am searching hard for these answers, but 
> navigating around the Apache sites as a newcomer to these sites, is not 
> easy.

I'd normally recommend someone just grabs the source and builds with 
maven, so you'd have the source tree available for changes. However, as 
you're not a Java programmer, the chance to dig in / change things is 
likely to be less of interest...

Tika snapshots are available in the Snapshot Repository:
    http://repository.apache.org/snapshots/org/apache/tika/
The current latest tika-app snapshot is:
    https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.2-SNAPSHOT/tika-app-1.2-20120630.180509-70.jar
(Look in the parent directory to see what's the latest)

Nick

Re: Server mode documentation?

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Jason,

Seems like you got this figured out.

Thanks!

Cheers,
Chris

On Jul 1, 2012, at 3:14 AM, Jason Judge wrote:

> Thank you. I don't even know how I would have begun finding that page - I searched everywhere for documentation, and never came up with that.
> 
> I have tried these examples with tika 1.1 on my Windows machine, but the curl requests just hang, and the tika server spits out the occasional fatal error. The example usage page uses tika 1.2 (snapshot) say maybe 1.1 is simply not ready for this kind of use yet?
> 
> Do I need to build my own 1.2 snapshot from source, or is there a nightly snapshot that is built and can be downloaded? Sorry if these seem daft questions - I really am searching hard for these answers, but navigating around the Apache sites as a newcomer to these sites, is not easy.
> 
> -- Jason
> 
> 
> On 01/07/2012 02:49, Mattmann, Chris A (388J) wrote:
>> Hi Jason,
>> 
>> Try this out:
>> 
>> 
>> http://wiki.apache.org/tika/TikaJAXRS
>> 
>> 
>> We'd be totally happy for feedback and thanks for checking this out!
>> 
>> Cheers,
>> Chris
>> 
>> On Jun 30, 2012, at 12:26 PM, Jason Judge wrote:
>> 
>> 
>>> Is there any documentation on running tika in server mode? I bought the book, hoping for some hints, but there was nothing of use in there.
>>> 
>>> I realise that it is "under development", but most OS projects are, and some clues towards how to get started would be appreciated.
>>> 
>>> I would like to push documents through tika from a PHP application, so I am assuming server mode would be the best way to do this (less memory requirement, so long as the documents can be queued for one process running constantly on the server). I would need to get metadata, text content, and identify the language and mime type. Doing this through the command line I understand, but through a server mode connection is a total mystery.
>>> 
>>> Thanks,
>>> 
>>> -- Jason Judge
>>> 
>>> 
>>> -- 
>>> Jason Judge, Technical Director
>>> Consilience Media Ltd
>>> 
>>> Direct: 0191 303 8492 (with voice mail)
>>> Office: 0191 251 1104
>>> Skype: jason_judge
>>> Mobile: 07771 656 448
>>> 12 Norham Rd, Whitley Bay, Tyne and Wear NE26 2SB
>>> 
>>> 
>>> jason.judge@consil.co.uk
>>> www.consil.co.uk
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: 
>> chris.a.mattmann@nasa.gov
>> 
>> WWW:   
>> http://sunset.usc.edu/~mattmann/
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Server mode documentation?

Posted by Jason Judge <ja...@consil.co.uk>.
Thank you. I don't even know how I would have begun finding that page - I
searched everywhere for documentation, and never came up with that.

I have tried these examples with tika 1.1 on my Windows machine, but the curl
requests just hang, and the tika server spits out the occasional fatal error.
The example usage page uses tika 1.2 (snapshot) say maybe 1.1 is simply not
ready for this kind of use yet?

Do I need to build my own 1.2 snapshot from source, or is there a nightly
snapshot that is built and can be downloaded? Sorry if these seem daft questions
- I really am searching hard for these answers, but navigating around the Apache
sites as a newcomer to these sites, is not easy.

-- Jason


On 01/07/2012 02:49, Mattmann, Chris A (388J) wrote:
> Hi Jason,
>
> Try this out:
>
> http://wiki.apache.org/tika/TikaJAXRS
>
> We'd be totally happy for feedback and thanks for checking this out!
>
> Cheers,
> Chris
>
> On Jun 30, 2012, at 12:26 PM, Jason Judge wrote:
>
>> Is there any documentation on running tika in server mode? I bought the book, hoping for some hints, but there was nothing of use in there.
>>
>> I realise that it is "under development", but most OS projects are, and some clues towards how to get started would be appreciated.
>>
>> I would like to push documents through tika from a PHP application, so I am assuming server mode would be the best way to do this (less memory requirement, so long as the documents can be queued for one process running constantly on the server). I would need to get metadata, text content, and identify the language and mime type. Doing this through the command line I understand, but through a server mode connection is a total mystery.
>>
>> Thanks,
>>
>> -- Jason Judge
>>
>>
>> -- 
>> Jason Judge, Technical Director
>> Consilience Media Ltd
>>
>> Direct: 0191 303 8492 (with voice mail)
>> Office: 0191 251 1104
>> Skype: jason_judge
>> Mobile: 07771 656 448
>> 12 Norham Rd, Whitley Bay, Tyne and Wear NE26 2SB
>>
>> jason.judge@consil.co.uk
>> www.consil.co.uk
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>



Re: Text extraction from large PDF files

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 1 Jul 2012, Zabrane Mickael wrote:
> So no way to avoid load the file in RAM? That's sad.
>
> Any other advices guys?

You could try asking on the Apache PDFBox list for advice - Tika's PDF 
extraction is all powered by PDFBox

Nick

Re: Text extraction from large PDF files

Posted by Zabrane Mickael <za...@gmail.com>.
Hi Nick,

So no way to avoid load the file in RAM? That's sad.

Any other advices guys?

Regards,
Zabrane

On Jul 1, 2012, at 12:42 PM, Nick Burch wrote:

> On Sun, 1 Jul 2012, Zabrane Mickael wrote:
>> I've a couple of big PDF files (between 100-200Mb).
>> 
>> Can someone shows me a way to extract text from them chunk by chunk (i.e without loading the whole file in RAM)?
> 
> I believe that the PDF format doesn't support processing in that way - I think you have to process the whole thing before you can make sense of it
> 
> Nick



Re: Text extraction from large PDF files

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 1 Jul 2012, Zabrane Mickael wrote:
> I've a couple of big PDF files (between 100-200Mb).
>
> Can someone shows me a way to extract text from them chunk by chunk (i.e 
> without loading the whole file in RAM)?

I believe that the PDF format doesn't support processing in that way - I 
think you have to process the whole thing before you can make sense of it

Nick

Re: Text extraction from large PDF files

Posted by Zabrane Mickael <za...@gmail.com>.
Thanks Kumar for sharing this code.
I'll test it tomorrow.

Regards,
Zabrane

On Jul 1, 2012, at 11:05 AM, Anuj Kumar wrote:

> Hi Zab,
> 
> Have you tried TikaInputStream?
> 
> Here is a snippet with AutoDetectParser-
> 
> Initialize
> -------------------
> this.context = new ParseContext();
> this.parser = new AutoDetectParser();
> this.context.set(Parser.class, parser);
> 
> Parse
> ---------
> // create a string writer object
> StringWriter textBuffer = new StringWriter();
> // get the Tika Input Stream from the input stream to target file
> InputStream stream = TikaInputStream.get(inStream);
> // create a content handler
> ContentHandler handler = new TeeContentHandler(
> 		getTextContentHandler(textBuffer));
> // create metadata object
> Metadata metadata = new Metadata();
> try {
> 	// parse the document
> 	parser.parse(stream, handler, metadata, context);
> 	// return the parsed text
> 	return textBuffer.toString();
> } catch (SAXException ex) {
> 	// log the exception
> 	LOG.error(ex.getMessage());
> 	// throw as IOException
> 	throw new IOException(ex);
> } catch (TikaException ex) {
> 	// log the exception
> 	LOG.error(ex.getMessage());
> 	ex.printStackTrace();
> 	// throw as IOException
> 	throw new IOException(ex);
> } finally {
> 	if (stream != null) {
> 		// close the stream
> 		stream.close();
> 	}
> }
> 
> Regards,
> Anuj
> 
> On Sun, Jul 1, 2012 at 2:21 PM, Zabrane Mickael <za...@gmail.com> wrote:
> Hi guys,
> 
> I've a couple of big PDF files (between 100-200Mb).
> 
> Can someone shows me a way to extract text from them chunk by chunk (i.e without
> loading the whole file in RAM)?
> 
> Is there a simple way to it? Code to share?
> 
> Thanks
> Zab
> 




Re: Text extraction from large PDF files

Posted by Anuj Kumar <an...@gmail.com>.
Hi Zab,

Have you tried TikaInputStream?

Here is a snippet with AutoDetectParser-

Initialize
-------------------
this.context = new ParseContext();
this.parser = new AutoDetectParser();
this.context.set(Parser.class, parser);

Parse
---------
// create a string writer object
StringWriter textBuffer = new StringWriter();
// get the Tika Input Stream from the input stream to target file
InputStream stream = TikaInputStream.get(inStream);
// create a content handler
ContentHandler handler = new TeeContentHandler(
getTextContentHandler(textBuffer));
// create metadata object
Metadata metadata = new Metadata();
try {
// parse the document
parser.parse(stream, handler, metadata, context);
// return the parsed text
return textBuffer.toString();
} catch (SAXException ex) {
// log the exception
LOG.error(ex.getMessage());
// throw as IOException
throw new IOException(ex);
} catch (TikaException ex) {
// log the exception
LOG.error(ex.getMessage());
ex.printStackTrace();
// throw as IOException
throw new IOException(ex);
} finally {
if (stream != null) {
// close the stream
stream.close();
}
}

Regards,
Anuj

On Sun, Jul 1, 2012 at 2:21 PM, Zabrane Mickael <za...@gmail.com> wrote:

> Hi guys,
>
> I've a couple of big PDF files (between 100-200Mb).
>
> Can someone shows me a way to extract text from them chunk by chunk (i.e
> without
> loading the whole file in RAM)?
>
> Is there a simple way to it? Code to share?
>
> Thanks
> Zab

Text extraction from large PDF files

Posted by Zabrane Mickael <za...@gmail.com>.
Hi guys,

I've a couple of big PDF files (between 100-200Mb).

Can someone shows me a way to extract text from them chunk by chunk (i.e without
loading the whole file in RAM)?

Is there a simple way to it? Code to share?

Thanks
Zab