You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeroen Steggink (Jira)" <ji...@apache.org> on 2020/06/19 07:11:00 UTC

[jira] [Created] (TIKA-3118) PDFParser: totalCharsPerPage vs. actual chars per page after parsing

Jeroen Steggink created TIKA-3118:
-------------------------------------

             Summary: PDFParser: totalCharsPerPage vs. actual chars per page after parsing
                 Key: TIKA-3118
                 URL: https://issues.apache.org/jira/browse/TIKA-3118
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.24
            Reporter: Jeroen Steggink


While parsing a PDF document I'd like to know the actual characters per page that are produced, not which are in the document itself. While the totalCharsPerPage (as defined in the class AbstractPDF2HTML) could be interesting to know how many characters there are, for actually using extracted text, it could be of more use to know what the actual number is. Currently the only part missing to a real count, is incorporating the added word spacing and line separators.

I propose to create another attribute (parsedCharsPerPage or extracted) and have an increment in the following methods in PDF2XHTML
writeCharacters, writeWordSeparator and writeLineSeparator.

One use case would be to be able to split the content written in a ContentHandler, because you have an actual truth about the number of characters written for a page.

What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)