You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "tranquillo (JIRA)" <ji...@apache.org> on 2015/10/20 19:07:27 UTC
[jira] [Comment Edited] (TIKA-1776) tika stop converting at this
pdf document
[ https://issues.apache.org/jira/browse/TIKA-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965374#comment-14965374 ]
tranquillo edited comment on TIKA-1776 at 10/20/15 5:06 PM:
------------------------------------------------------------
Thank you Tim for your quick response,
you are right, tika seemed not be the problem, because a try out on the console converts the file. It also prints a lot of errors, if everything outputs on console, but i think thats ok.
This thing can be closed, i will try to call tika another way.
Sorry for the effort.
was (Author: tranquillo):
Thank you Tim for your quick response,
you are right, tika seemed not be the problem, because a try out on the console converts the file. It also prints a lot of errors, if everything outputs on console, but i think thats ok.
This thing can be closed, i will try to call tika another way.
Sorry for the circumstances.
> tika stop converting at this pdf document
> -----------------------------------------
>
> Key: TIKA-1776
> URL: https://issues.apache.org/jira/browse/TIKA-1776
> Project: Tika
> Issue Type: Bug
> Components: batch
> Affects Versions: 1.10
> Environment: Intel Core I5 4GB Ram, Notebook
> OS: debian8, x64, Gnome
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]
> Reporter: tranquillo
>
> Hi and thank you all for this great project,
> I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big files. But since over 15hours the process hangs with CPU load = 0% at one file:
> http://ratsinfo.dresden.de/getfile.php?id=149624&type=do
> wich is just 5mb large, but contains text, scans and CAD plans.
> I run "get_xml()" from follwing class (located in tika_app.rb):
> -----------------------------
> require 'rubygems'
> require 'stringio'
> require 'open4'
> class TikaApp
> def initialize(document)
> filename = File.basename(document)
> t = Time.now
> puts t.strftime("%H:%M:%S") + ": analyze #{filename}"
> @document = document
> java_cmd = 'java'
> java_args = '-server -Djava.awt.headless=true'
> tika_path = "tika-app.jar"
> @tika_cmd = "#{java_cmd} #{java_args} -jar '#{tika_path}'"
> end
> def get_xml
> run_tika('--xml')
> end
> def get_metadata
> run_tika('--metadata --json')
> end
> private
> def run_tika(option)
> final_cmd = "#{@tika_cmd} #{option} '#{@document}'"
> pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
> stdout_result = stdout.read.strip
> stderr_result = stderr.read.strip
> unless strip_stderr(stderr_result).empty?
> end
> stdout_result
> ensure
> stdin.close
> stdout.close
> stderr.close
> end
> def strip_stderr(s)
> s.gsub(/^(info|warn) - .*$/i, '').strip
> end
> end
> ----------
> The tika command with this function looks like this:
> java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)