You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Praveer (JIRA)" <ji...@apache.org> on 2015/12/30 14:02:49 UTC

[jira] [Created] (PDFBOX-3176) Add a removeRegion method in PDFTextSTripperByArea class

Praveer created PDFBOX-3176:
-------------------------------

             Summary: Add a removeRegion method in PDFTextSTripperByArea class
                 Key: PDFBOX-3176
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3176
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 1.8.10
         Environment: All
            Reporter: Praveer
             Fix For: 1.8.10


Hi,

I am parsing a very complicated PDF, for which I had to enable (setSortByPosition as true), otherwise the Parser is not able to do sequential text extraction.

So I decided to use PDFTextStripperByArea class, and then make rectangles to extract text. But problem here is that If I make many rectangles in a single page, again there is no logical sequence of text extracted, So to get around this it will be awesome to have a method to remove regions, then we can add a region extract text, remove that region , then again add new region and so on....

I have already done a POC in my local computer and it works fine. added this method and tested.

public void removeRegion(String regionName) {
    this.regions.remove(regionName);
    this.regionArea.remove(regionName);
}

I can contribute this code myself, if you suggest, let me know, thanks and regards
Praveer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org