You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "tranquillo (JIRA)" <ji...@apache.org> on 2015/10/20 09:03:27 UTC
[jira] [Created] (TIKA-1776) tika stop converting at this pdf document

tranquillo created TIKA-1776:
--------------------------------

             Summary: tika stop converting at this pdf document
                 Key: TIKA-1776
                 URL: https://issues.apache.org/jira/browse/TIKA-1776
             Project: Tika
          Issue Type: Bug
          Components: batch
    Affects Versions: 1.10
         Environment: Intel Core I5 4GB Ram, Notebook
OS: debian8, x64, Gnome
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
            Reporter: tranquillo


Hi and thank you all for this great project,

I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big files. But since over 15hours the process hangs with CPU load = 0% at one file: 
http://ratsinfo.dresden.de/getfile.php?id=149624&type=do 
wich is just 5mb large, but contains text, scans and CAD plans.

I run "get_xml()" from follwing class (located in tika_app.rb):
-----------------------------
require 'rubygems'
require 'stringio'
require 'open4'

class TikaApp
    def initialize(document)
        filename = File.basename(document)
        t = Time.now
        puts t.strftime("%H:%M:%S") + ": analyze #{filename}"
        @document = document
        java_cmd = 'java'
        java_args = '-server -Djava.awt.headless=true'
        tika_path = "tika-app.jar"
        @tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'"
    end

    def get_xml
        run_tika('--xml')
    end

    def get_metadata
        run_tika('--metadata --json')
    end


    private

    def run_tika(option)
        final_cmd = "#{@tika_cmd} #{option} '#{@document}'"
        pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
        stdout_result = stdout.read.strip
        stderr_result = stderr.read.strip
        unless strip_stderr(stderr_result).empty?
        end

        stdout_result
    ensure
        stdin.close
        stdout.close
        stderr.close
    end

    def strip_stderr(s)
        s.gsub(/^(info|warn) - .*$/i, '').strip
    end
end
----------

The tika command with this function looks like this: 
java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)