You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2020/06/19 13:52:00 UTC

[jira] [Commented] (TIKA-3118) PDFParser: totalCharsPerPage vs. actual chars per page after parsing

    [ https://issues.apache.org/jira/browse/TIKA-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140566#comment-17140566 ] 

Tim Allison commented on TIKA-3118:
-----------------------------------

I'm not against this addition, but I'm not sure I understand the benefit.

 

The "totalCharsPerPage" was initially intended to help with determining whether or not to run OCR.

 

How would you use the parsedCharsPerPage in a ContentHandler?

> PDFParser: totalCharsPerPage vs. actual chars per page after parsing
> --------------------------------------------------------------------
>
>                 Key: TIKA-3118
>                 URL: https://issues.apache.org/jira/browse/TIKA-3118
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.24
>            Reporter: Jeroen Steggink
>            Priority: Minor
>
> While parsing a PDF document I'd like to know the actual characters per page that are produced, not which are in the document itself. While the totalCharsPerPage (as defined in the class AbstractPDF2HTML) could be interesting to know how many characters there are, for actually using extracted text, it could be of more use to know what the actual number is. Currently the only part missing to a real count, is incorporating the added word spacing and line separators.
> I propose to create another attribute (parsedCharsPerPage or extracted) and have an increment in the following methods in PDF2XHTML
> writeCharacters, writeWordSeparator and writeLineSeparator.
> One use case would be to be able to split the content written in a ContentHandler, because you have an actual truth about the number of characters written for a page.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)