You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "vivek joshi (JIRA)" <ji...@apache.org> on 2014/02/03 10:58:09 UTC
[jira] [Created] (TIKA-1227) Apache Tika 1.4 Duplicate extract data
vivek joshi created TIKA-1227:
---------------------------------
Summary: Apache Tika 1.4 Duplicate extract data
Key: TIKA-1227
URL: https://issues.apache.org/jira/browse/TIKA-1227
Project: Tika
Issue Type: Bug
Components: general
Affects Versions: 1.4
Environment: Ubuntu12.04, Python 2.7, Apache Tika 1.4
Reporter: vivek joshi
When Extracting text using Apache Tika 1.4, the Text is getting duplicated.
APACHE_TIKA_PATH = os.path.abspath(os.path.join(PROJECT_ROOT, apache_tika/tika-app-1.4.jar'))
sout = subprocess.check_output("java -jar %s -t %s"%(APACHE_TIKA_PATH, document),shell=True)
sout contains duplicate text.
Issue both for Doc and PDF files.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)