You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Ben Turner <be...@pobox.com> on 2013/03/12 06:33:55 UTC

Fwd: Broken socket pipe when writing a PNG to Tika (server mode)

We are trying to use Tika in a Ruby environment, aiming to read data from
Amazon S3 and stream it into ElasticSearch.

* Currently this means running Tika in server mode:

java -jar tika-app-1.3.jar -t --server --port 12345

* We then talk to it via ruby sockets (for non-rubyists, this streams a
document from the file system into our local tika server over a simple
socket) :

#!/usr/bin/env ruby
require 'socket'
TCPSocket.open('127.0.0.1', 12345) do |socket|
   File.open('/tmp/test.png', 'r') do |chunk|
     socket.write(chunk)
   end
   socket.close_write
   puts socket.read
end

* This works for PDF and JPEG files - outputting the text content for PDFs
and nothing for JPEGs. However, whenever I stream a PNG to tika, the ruby
code bombs out with a 'broken pipe' error during one of the writes.

I have added logging and seen a number of chunks do get written, but
somewhere in the file it fails. There is no output from the tika server
when this happens. I have also looked at the packets with Wireshark and
cannot see any obvious "null character" being written to cause the problem,
but it is binary data, so may not be so obvious to my limited knowledge at
this level.

In GUI mode, tika has no problem opening the PNG files in question.

So whilst I accept it could be something ruby-side, it seems fairly
consistent that PNGs fail to transmit over in server mode, so I wondered if
anyone might know why, and if it was a known issue around tika ?

Regards,
Ben

Re: Broken socket pipe when writing a PNG to Tika (server mode)

Posted by Ben Turner <be...@pobox.com>.
Dave,

My environment is Ubuntu 10.10 - my colleague has reproduced on Ubuntu
10.04 too

Ben


On 1 May 2013 04:38, Dave Meikle <lo...@gmail.com> wrote:

> Hi Ben,
>
> On 23 Apr 2013, at 08:22, Ben Turner <be...@pobox.com> wrote:
>
> > Hi Dave,
> >
> > Apologies to come back to this over a month later, but we had worked
> around / not seen the issue for a while, but as we start to ramp up our
> testing it's come back.
> > Investigating it from several angles today, the problem seems to be that
> SOME PNG files are failing when being parsed by Tika, but only when the -T
> or -t switch is applied.
>
> No problem at all.
>
> > Errno::ECONNRESET: (eval):19:in `read': Connection reset by peer
> >       from
> /home/bturner/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/interactive_editor-0.0.10/lib/interactive_editor.rb:55:in
> `eval'
> >       from (eval):19:in `do_it'
> >       from (eval):15:in `open'
> >       from (eval):15:in `do_it'
> >       from (eval):14:in `open'
> >       from (eval):14:in `do_it'
> >       from (eval):26
>
> Quick question - what operating system are you running on?
>
> I cannot get this to fail locally on my MacBook even when performing a
> range of tests across different files but performing a quick test on a
> Linux environment before jumping on the train seems to reproduce this issue.
>
> I will try to confirm this and narrow things down later on but would be
> interested in your environment too.
>
> Cheers,
> Dave
>
>

Re: Broken socket pipe when writing a PNG to Tika (server mode)

Posted by Dave Meikle <lo...@gmail.com>.
Hi Ben,

On 23 Apr 2013, at 08:22, Ben Turner <be...@pobox.com> wrote:

> Hi Dave,
> 
> Apologies to come back to this over a month later, but we had worked around / not seen the issue for a while, but as we start to ramp up our testing it's come back.
> Investigating it from several angles today, the problem seems to be that SOME PNG files are failing when being parsed by Tika, but only when the -T or -t switch is applied.

No problem at all.

> Errno::ECONNRESET: (eval):19:in `read': Connection reset by peer
> 	from /home/bturner/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/interactive_editor-0.0.10/lib/interactive_editor.rb:55:in `eval'
> 	from (eval):19:in `do_it'
> 	from (eval):15:in `open'
> 	from (eval):15:in `do_it'
> 	from (eval):14:in `open'
> 	from (eval):14:in `do_it'
> 	from (eval):26

Quick question - what operating system are you running on?

I cannot get this to fail locally on my MacBook even when performing a range of tests across different files but performing a quick test on a Linux environment before jumping on the train seems to reproduce this issue.

I will try to confirm this and narrow things down later on but would be interested in your environment too.

Cheers,
Dave

  

Re: Broken socket pipe when writing a PNG to Tika (server mode)

Posted by Ben Turner <be...@pobox.com>.
Hi Dave,

Apologies to come back to this over a month later, but we had worked around
/ not seen the issue for a while, but as we start to ramp up our testing
it's come back.
Investigating it from several angles today, the problem seems to be that
SOME PNG files are failing when being parsed by Tika, but only when the -T
or -t switch is applied.

So I am currently running tika locally (under Java 1.6.0_26) using the
following command:  java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100

And then running the following Ruby code (under ruby 1.8.7 patch 371,
although I think this would work on all releases)

#!/usr/bin/env ruby
require 'socket'

class FileStreamer
  attr_reader :filename

  def initialize(filename)
    @filename = filename
  end

  def do_it
    TCPSocket.open('127.0.0.1', 9100) do |socket|
      File.open(filename) do |file|
        content = file.read
        socket.write(content)
        socket.close_write
        puts socket.read
      end
    end
  end
end

file_streamer = FileStreamer.new('./Pictures/test.png').do_it

--> This then throws the following error:

Errno::ECONNRESET: (eval):19:in `read': Connection reset by peer
from
/home/bturner/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/interactive_editor-0.0.10/lib/interactive_editor.rb:55:in
`eval'
 from (eval):19:in `do_it'
from (eval):15:in `open'
from (eval):15:in `do_it'
 from (eval):14:in `open'
from (eval):14:in `do_it'
from (eval):26

The file I am using to cause this error can be downloaded from
http://imgur.com/r/quotesporn/hUGXn using the "Download Full Resolution"
link - or this direct link: http://bit.ly/ZLT9Xs

Our process is trying to extract content only (and not metadata) from all
files that are thrown at it - we realise this means PNG and JPEG files will
return nothing, but we're trying to handle all files the same, where
possible, as we can't be 100% sure of the file types before processing.
Hence we use the -t flag, and NOT the -m flag. It should be noted that
changing the -t flag to -m flag causes the PNG to be correctly processed
with a blank return value. Also it should be noted that we've not
experienced this behaviour from JPEGs or other "no textual content" formats
so far.

Thanks and regards,
Ben




On 13 March 2013 11:12, Dave Meikle <lo...@gmail.com> wrote:

> Hi Ben,
>
> On 12 Mar 2013, at 05:33, Ben Turner <be...@pobox.com> wrote:
>
> > * We then talk to it via ruby sockets (for non-rubyists, this streams a
> document from the file system into our local tika server over a simple
> socket) :
> >
> > #!/usr/bin/env ruby
> > require 'socket'
> > TCPSocket.open('127.0.0.1', 12345) do |socket|
> >    File.open('/tmp/test.png', 'r') do |chunk|
> >      socket.write(chunk)
> >    end
> >    socket.close_write
> >    puts socket.read
> > end
>
> There is no know fault around this so tried this locally, and with a wee
> tweak to the Ruby code to use socket.write(chunk.read), it works for me
> with all document types.  I also used -m on the server to make sure the PNG
> was being processed and it dumps back the metadata.
>
> Is there anything else in the way over the network (firewall, IDS, etc)?
>
> Cheers,
> Dave
>
>
>

Re: Broken socket pipe when writing a PNG to Tika (server mode)

Posted by Dave Meikle <lo...@gmail.com>.
Hi Ben,

On 12 Mar 2013, at 05:33, Ben Turner <be...@pobox.com> wrote:

> * We then talk to it via ruby sockets (for non-rubyists, this streams a document from the file system into our local tika server over a simple socket) :
> 
> #!/usr/bin/env ruby
> require 'socket'
> TCPSocket.open('127.0.0.1', 12345) do |socket|
>    File.open('/tmp/test.png', 'r') do |chunk|
>      socket.write(chunk)
>    end
>    socket.close_write
>    puts socket.read
> end

There is no know fault around this so tried this locally, and with a wee tweak to the Ruby code to use socket.write(chunk.read), it works for me with all document types.  I also used -m on the server to make sure the PNG was being processed and it dumps back the metadata.

Is there anything else in the way over the network (firewall, IDS, etc)?

Cheers,
Dave