You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Praveer (JIRA)" <ji...@apache.org> on 2015/12/30 14:02:49 UTC
[jira] [Created] (PDFBOX-3176) Add a removeRegion method in
PDFTextSTripperByArea class
Praveer created PDFBOX-3176:
-------------------------------
Summary: Add a removeRegion method in PDFTextSTripperByArea class
Key: PDFBOX-3176
URL: https://issues.apache.org/jira/browse/PDFBOX-3176
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 1.8.10
Environment: All
Reporter: Praveer
Fix For: 1.8.10
Hi,
I am parsing a very complicated PDF, for which I had to enable (setSortByPosition as true), otherwise the Parser is not able to do sequential text extraction.
So I decided to use PDFTextStripperByArea class, and then make rectangles to extract text. But problem here is that If I make many rectangles in a single page, again there is no logical sequence of text extracted, So to get around this it will be awesome to have a method to remove regions, then we can add a region extract text, remove that region , then again add new region and so on....
I have already done a POC in my local computer and it works fine. added this method and tested.
public void removeRegion(String regionName) {
this.regions.remove(regionName);
this.regionArea.remove(regionName);
}
I can contribute this code myself, if you suggest, let me know, thanks and regards
Praveer
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org