You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/02/22 14:34:18 UTC
[jira] [Comment Edited] (TIKA-1863) --text-main content missing in
output file
[ https://issues.apache.org/jira/browse/TIKA-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156876#comment-15156876 ]
Tim Allison edited comment on TIKA-1863 at 2/22/16 1:33 PM:
------------------------------------------------------------
Ah, ok. The pdfbox [app|http://mirror.sdunix.com/apache/pdfbox/1.8.11/pdfbox-app-1.8.11.jar] is here.
I'll take a look at the file you attached. Any chance you could share a doc file?
was (Author: tallison@mitre.org):
Ah, ok. The pdfbox [http://mirror.sdunix.com/apache/pdfbox/1.8.11/pdfbox-app-1.8.11.jar|app] is here.
I'll take a look at the file you attached. Any chance you could share a doc file?
> --text-main content missing in output file
> ------------------------------------------
>
> Key: TIKA-1863
> URL: https://issues.apache.org/jira/browse/TIKA-1863
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.12
> Environment: Windows 10 64
> Reporter: Marcin Gil
>
> When converting both PDF and DOC files to text with following command
> java -jar tika.jar --text-main --encoding=UTF-8 input.pdf > output.txt
> The output file is missing a random amount of LAST and FIRST lines in input file.
> Example file:
> https://dl.dropboxusercontent.com/u/11435743/tika-issue-1.pdf
> Text starting from "15 Akt oskarżenia" is missing (at the bottom of the file).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)