You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Ben Turner <be...@pobox.com> on 2013/05/02 08:54:27 UTC
Problem parsing large (15MB) text files on Ubuntu 10.10
We have been using Tika to process a large variety of files, one at a time,
running it in server mode as follows on an Ubuntu 10.10 machine, with Java
1.7.0_b21 :
java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100
This seems to process all PDFs we throw at it, occasionally bomb out on
PNGs (that's a seperate thread) and otherwise process JPGs (albeit as
"blank text") and other document types without concern.
However, when we threw a larger set of documents at it yesterday, we
noticed our process hang intermittently, and not always at the same
document after each restart and retry.
The file causing this was a 15MB plain text log file (from our rails
application) - regrettably this means I can't share it, but if I find
another good example, I will. This file seemed to spin through several
"chunks" of the file (we are downloading them from AWS) and then pause.
We tried taking AWS out of the question, by downloading the file locally,
and running in Ruby (1.8.7):
require 'socket'; s = TCPSocket.new('localhost', 9100);
File.open("/tmp/big.log", "r") { |f| s.write(f.read); s.close_write; puts
s.read }; s.close
This file still hung, failing to process. This was also the case trying to
scan the file running Tika in "GUI mode".
We have also tried using netcat (both nc and ncat, with are different tools
on Ubuntu) although this doesn't seem to work for ANY file on Ubuntu 10.10
- it does seem to work on Ubuntu 12.04, but the Ruby sample above doesn't,
so that's both a clue, and a bit confusing. I've sidelined this as "an
oddity of netcat on Ubuntu 10" but it might be important
Could there be an underlying OS library / package / behaviour causing tika
to fail to parse this plain text file ? It happily reports back the
metadata when run with the -m switch.
That's the extent of our investigation. Are there any other things we might
look into, or anything else we might be able to provide to assist with
diagnosing the issue ?
Regards,
Ben
Re: Problem parsing large (15MB) text files on Ubuntu 10.10
Posted by Dave Meikle <lo...@gmail.com>.
Thanks Ben. I have raised a JIRA ticket[1] so we can track work on this
issue.
Seems like it works fine on my Mac but can replicate your issues on various
versions of Ubuntu (10.04, 10.10 and 12.04) in my VM Lab.
Will do some straces to see what is going on.
Cheers,
Dave
[1] https://issues.apache.org/jira/browse/TIKA-1121
Re: Problem parsing large (15MB) text files on Ubuntu 10.10
Posted by Ben Turner <be...@pobox.com>.
I've created another text file (1.2MB) that fails to scan, as per my
previous post - a copy of it is available here:
https://www.dropbox.com/s/96iw12mrufovmql/gibberish.txt
Regards,
Ben
On 2 May 2013 16:54, Ben Turner <be...@pobox.com> wrote:
> We have been using Tika to process a large variety of files, one at a
> time, running it in server mode as follows on an Ubuntu 10.10 machine, with
> Java 1.7.0_b21 :
>
> java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100
>
> This seems to process all PDFs we throw at it, occasionally bomb out on
> PNGs (that's a seperate thread) and otherwise process JPGs (albeit as
> "blank text") and other document types without concern.
>
> However, when we threw a larger set of documents at it yesterday, we
> noticed our process hang intermittently, and not always at the same
> document after each restart and retry.
>
> The file causing this was a 15MB plain text log file (from our rails
> application) - regrettably this means I can't share it, but if I find
> another good example, I will. This file seemed to spin through several
> "chunks" of the file (we are downloading them from AWS) and then pause.
>
> We tried taking AWS out of the question, by downloading the file locally,
> and running in Ruby (1.8.7):
>
> require 'socket'; s = TCPSocket.new('localhost', 9100);
> File.open("/tmp/big.log", "r") { |f| s.write(f.read); s.close_write; puts
> s.read }; s.close
>
> This file still hung, failing to process. This was also the case trying to
> scan the file running Tika in "GUI mode".
>
> We have also tried using netcat (both nc and ncat, with are different
> tools on Ubuntu) although this doesn't seem to work for ANY file on Ubuntu
> 10.10 - it does seem to work on Ubuntu 12.04, but the Ruby sample above
> doesn't, so that's both a clue, and a bit confusing. I've sidelined this as
> "an oddity of netcat on Ubuntu 10" but it might be important
>
> Could there be an underlying OS library / package / behaviour causing tika
> to fail to parse this plain text file ? It happily reports back the
> metadata when run with the -m switch.
>
> That's the extent of our investigation. Are there any other things we
> might look into, or anything else we might be able to provide to assist
> with diagnosing the issue ?
>
> Regards,
> Ben
>