You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/02/22 15:00:19 UTC

[jira] [Resolved] (TIKA-1863) --text-main content missing in output file

     [ https://issues.apache.org/jira/browse/TIKA-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-1863.
-------------------------------
    Resolution: Won't Fix

{{--text-main}} uses the {{BoilerpipeContentHandler}}, which tries to determine what the "main content" of a document is -- mainly designed to remove advertising/links/noise on html documents.  

I confirmed that Boilerpipe categorizes some of the first paragraphs and the last paragraph (note 15 through "624 KPK") as "not content."  

At the general Tika level, we don't control what Boilerpipe does, and I'm not aware of a method to alter its algorithm for determining content vs. not content.

In short, I don't think we can fix this.  We can recommend using other extraction options mentioned in an earlier comment.


> --text-main content missing in output file
> ------------------------------------------
>
>                 Key: TIKA-1863
>                 URL: https://issues.apache.org/jira/browse/TIKA-1863
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.12
>         Environment: Windows 10 64
>            Reporter: Marcin Gil
>
> When converting both PDF and DOC files to text with following command
> java -jar tika.jar --text-main --encoding=UTF-8 input.pdf > output.txt
> The output file is missing a random amount of LAST and FIRST lines in input file. 
> Example file:
> https://dl.dropboxusercontent.com/u/11435743/tika-issue-1.pdf
> Text starting from "15 Akt oskarżenia" is missing (at the bottom of the file).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)