You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "tranquillo (JIRA)" <ji...@apache.org> on 2015/10/20 09:03:27 UTC
[jira] [Created] (TIKA-1776) tika stop converting at this pdf
document
tranquillo created TIKA-1776:
--------------------------------
Summary: tika stop converting at this pdf document
Key: TIKA-1776
URL: https://issues.apache.org/jira/browse/TIKA-1776
Project: Tika
Issue Type: Bug
Components: batch
Affects Versions: 1.10
Environment: Intel Core I5 4GB Ram, Notebook
OS: debian8, x64, Gnome
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
Reporter: tranquillo
Hi and thank you all for this great project,
I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big files. But since over 15hours the process hangs with CPU load = 0% at one file:
http://ratsinfo.dresden.de/getfile.php?id=149624&type=do
wich is just 5mb large, but contains text, scans and CAD plans.
I run "get_xml()" from follwing class (located in tika_app.rb):
-----------------------------
require 'rubygems'
require 'stringio'
require 'open4'
class TikaApp
def initialize(document)
filename = File.basename(document)
t = Time.now
puts t.strftime("%H:%M:%S") + ": analyze #{filename}"
@document = document
java_cmd = 'java'
java_args = '-server -Djava.awt.headless=true'
tika_path = "tika-app.jar"
@tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'"
end
def get_xml
run_tika('--xml')
end
def get_metadata
run_tika('--metadata --json')
end
private
def run_tika(option)
final_cmd = "#{@tika_cmd} #{option} '#{@document}'"
pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
stdout_result = stdout.read.strip
stderr_result = stderr.read.strip
unless strip_stderr(stderr_result).empty?
end
stdout_result
ensure
stdin.close
stdout.close
stderr.close
end
def strip_stderr(s)
s.gsub(/^(info|warn) - .*$/i, '').strip
end
end
----------
The tika command with this function looks like this:
java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)