You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Praveer (JIRA)" <ji...@apache.org> on 2015/12/30 14:24:49 UTC

[jira] [Created] (PDFBOX-3177) Change some modifiers from private to protected in PDFTextStripper Class

Praveer created PDFBOX-3177:
-------------------------------

             Summary: Change some modifiers from private to protected in PDFTextStripper Class
                 Key: PDFBOX-3177
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3177
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 1.8.10
         Environment: All
            Reporter: Praveer
             Fix For: 1.8.10


Hi,

I am parsing a very complicated PDF for which text extraction is not in proper sequence, so I had to enable setSortByPosition = True.

Now I want to access each TextPosition element and do some processing with them, normally i would override processTextPosition method and do my stuff there, But since I have enabled setSortByPosition, the code that sorts before extracting text is invoked after processTextPosition, so I can not override processTextPosition to get text according to their position.

I did some research and found that overriding writeLine method of PDFTextStripper can be useful for me
because it processes each TextPosition after they are sorted according to their position.

So I have done a POC in my personal computer by doing following changes in PDFTextStripper class
1  - 'private' void writeLine() changed to 'protected'
2 -  'private' static final class WordWithTextPositions changed to 'protected' 

After this everything works as per my expectation, I think these changes are also going to help other people who use this library.

I can contribute this code myself, if you suggest, let me know, thanks and regards
Praveer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org