You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Ben Turner <be...@pobox.com> on 2013/05/02 08:54:27 UTC

Problem parsing large (15MB) text files on Ubuntu 10.10

We have been using Tika to process a large variety of files, one at a time,
running it in server mode as follows on an Ubuntu 10.10 machine, with Java
1.7.0_b21 :

java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100

This seems to process all PDFs we throw at it, occasionally bomb out on
PNGs (that's a seperate thread) and otherwise process JPGs (albeit as
"blank text") and other document types without concern.

However, when we threw a larger set of documents at it yesterday, we
noticed our process hang intermittently, and not always at the same
document after each restart and retry.

The file causing this was a 15MB plain text log file (from our rails
application) - regrettably this means I can't share it, but if I find
another good example, I will. This file seemed to spin through several
"chunks" of the file (we are downloading them from AWS) and then pause.

We tried taking AWS out of the question, by downloading the file locally,
and running in Ruby (1.8.7):

require 'socket'; s = TCPSocket.new('localhost', 9100);
File.open("/tmp/big.log", "r") { |f| s.write(f.read); s.close_write; puts
s.read }; s.close

This file still hung, failing to process. This was also the case trying to
scan the file running Tika in "GUI mode".

We have also tried using netcat (both nc and ncat, with are different tools
on Ubuntu) although this doesn't seem to work for ANY file on Ubuntu 10.10
- it does seem to work on Ubuntu 12.04, but the Ruby sample above doesn't,
so that's both a clue, and a bit confusing. I've sidelined this as "an
oddity of netcat on Ubuntu 10" but it might be important

Could there be an underlying OS library / package / behaviour causing tika
to fail to parse this plain text file ? It happily reports back the
metadata when run with the -m switch.

That's the extent of our investigation. Are there any other things we might
look into, or anything else we might be able to provide to assist with
diagnosing the issue ?

Regards,
Ben

Re: Problem parsing large (15MB) text files on Ubuntu 10.10

Posted by Dave Meikle <lo...@gmail.com>.
Thanks Ben. I have raised a JIRA ticket[1] so we can track work on this
issue.

Seems like it works fine on my Mac but can replicate your issues on various
versions of Ubuntu (10.04, 10.10 and 12.04) in my VM Lab.

Will do some straces to see what is going on.

Cheers,
Dave

[1] https://issues.apache.org/jira/browse/TIKA-1121

Re: Problem parsing large (15MB) text files on Ubuntu 10.10

Posted by Ben Turner <be...@pobox.com>.
I've created another text file (1.2MB) that fails to scan, as per my
previous post - a copy of it is available here:

https://www.dropbox.com/s/96iw12mrufovmql/gibberish.txt

Regards,
Ben


On 2 May 2013 16:54, Ben Turner <be...@pobox.com> wrote:

> We have been using Tika to process a large variety of files, one at a
> time, running it in server mode as follows on an Ubuntu 10.10 machine, with
> Java 1.7.0_b21 :
>
> java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100
>
> This seems to process all PDFs we throw at it, occasionally bomb out on
> PNGs (that's a seperate thread) and otherwise process JPGs (albeit as
> "blank text") and other document types without concern.
>
> However, when we threw a larger set of documents at it yesterday, we
> noticed our process hang intermittently, and not always at the same
> document after each restart and retry.
>
> The file causing this was a 15MB plain text log file (from our rails
> application) - regrettably this means I can't share it, but if I find
> another good example, I will. This file seemed to spin through several
> "chunks" of the file (we are downloading them from AWS) and then pause.
>
> We tried taking AWS out of the question, by downloading the file locally,
> and running in Ruby (1.8.7):
>
> require 'socket'; s = TCPSocket.new('localhost', 9100);
> File.open("/tmp/big.log", "r") { |f| s.write(f.read); s.close_write; puts
> s.read }; s.close
>
> This file still hung, failing to process. This was also the case trying to
> scan the file running Tika in "GUI mode".
>
> We have also tried using netcat (both nc and ncat, with are different
> tools on Ubuntu) although this doesn't seem to work for ANY file on Ubuntu
> 10.10 - it does seem to work on Ubuntu 12.04, but the Ruby sample above
> doesn't, so that's both a clue, and a bit confusing. I've sidelined this as
> "an oddity of netcat on Ubuntu 10" but it might be important
>
> Could there be an underlying OS library / package / behaviour causing tika
> to fail to parse this plain text file ? It happily reports back the
> metadata when run with the -m switch.
>
> That's the extent of our investigation. Are there any other things we
> might look into, or anything else we might be able to provide to assist
> with diagnosing the issue ?
>
> Regards,
> Ben
>