You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Ilija Pavlic (Created) (JIRA)" <ji...@apache.org> on 2012/01/04 23:39:39 UTC

[jira] [Created] (PDFBOX-1201) PDFTextStripperByArea y coordinate shifted "up"

PDFTextStripperByArea y coordinate shifted "up"
-----------------------------------------------

                 Key: PDFBOX-1201
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.6.0
            Reporter: Ilija Pavlic
            Priority: Minor


The text stripper region seems to be shifted up from the given coordinates, causing lines below the region to be included and ones above the defined region to be included.

...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();

Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);

// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height);
contentStream.close();

stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region");
...
document.save(...);
...

The cyan rectangle overlays the desired region exactly when viewing the saved output document. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1201) PDFTextStripperByArea doesn't capture text that flows inside the capture region

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1201:
---------------------------------

    Comment: was deleted

(was: It seems like the missed text is part of the larger text box that starts and ends outside the capture region but the text itself is located inside the capture region. )
    
> PDFTextStripperByArea doesn't capture text that flows inside the capture region
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>
> The text stripper region doesn't capture text starting and finishing outside the capture region but flowing through the capture region.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1201) PDFTextStripperByArea y coordinate shifted "up"

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1201:
---------------------------------


Unfortunately, I cannot share the sample pdf.
                
> PDFTextStripperByArea y coordinate shifted "up"
> -----------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>
> The text stripper region seems to be shifted up from the given coordinates, causing lines below the region to be included and ones above the defined region to be included.
> ...
> PDPage page = (PDPage) allPages.get(0);
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
> stripper.addRegion("test region", region);
> // overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
> PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
> contentStream.setNonStrokingColor( Color.CYAN );
> contentStream.fillRect(x, y, width, height);
> contentStream.close();
> stripper.extractRegions(page);
> String content = stripper.getTextForRegion("test region");
> ...
> document.save(...);
> ...
> The cyan rectangle overlays the desired region exactly when viewing the saved output document. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1201) PDFTextStripperByArea doesn't capture text that flows inside the capture region

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1201:
---------------------------------

    Comment: was deleted

(was: Manually shifting the region considerably lower still misses the lines even though they are generously covered by the cyan rectangle. Defining the region to start (x, 0f, width, height) captures the wanted lines.)
    
> PDFTextStripperByArea doesn't capture text that flows inside the capture region
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>
> The text stripper region doesn't capture text starting and finishing outside the capture region but flowing through the capture region.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1201) PDFTextStripperByArea doesn't capture text that flows inside the capture region

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1201:
---------------------------------

    Comment: was deleted

(was: Unfortunately, I cannot share the sample pdf.)
    
> PDFTextStripperByArea doesn't capture text that flows inside the capture region
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>
> The text stripper region doesn't capture text starting and finishing outside the capture region but flowing through the capture region.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1201) PDFTextStripperByArea y coordinate shifted "up"

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179954#comment-13179954 ] 

Ilija Pavlic commented on PDFBOX-1201:
--------------------------------------

Manually shifting the region considerably lower still misses the lines even though they are generously covered by the cyan rectangle. Defining the region to start (x, 0f, width, height) captures the region.
                
> PDFTextStripperByArea y coordinate shifted "up"
> -----------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>
> The text stripper region seems to be shifted up from the given coordinates, causing lines below the region to be included and ones above the defined region to be included.
> ...
> PDPage page = (PDPage) allPages.get(0);
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
> stripper.addRegion("test region", region);
> // overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
> PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
> contentStream.setNonStrokingColor( Color.CYAN );
> contentStream.fillRect(x, y, width, height);
> contentStream.close();
> stripper.extractRegions(page);
> String content = stripper.getTextForRegion("test region");
> ...
> document.save(...);
> ...
> The cyan rectangle overlays the desired region exactly when viewing the saved output document. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1201) PDFTextStripperByArea y coordinate shifted "up"

Posted by "Ilija Pavlic (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179965#comment-13179965 ] 

Ilija Pavlic commented on PDFBOX-1201:
--------------------------------------

It seems like the missed text is part of the larger text box that starts and ends outside the capture region but the text itself is located inside the capture region. 
                
> PDFTextStripperByArea y coordinate shifted "up"
> -----------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>
> The text stripper region seems to be shifted up from the given coordinates, causing lines below the region to be included and ones above the defined region to be included.
> ...
> PDPage page = (PDPage) allPages.get(0);
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
> stripper.addRegion("test region", region);
> // overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
> PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
> contentStream.setNonStrokingColor( Color.CYAN );
> contentStream.fillRect(x, y, width, height);
> contentStream.close();
> stripper.extractRegions(page);
> String content = stripper.getTextForRegion("test region");
> ...
> document.save(...);
> ...
> The cyan rectangle overlays the desired region exactly when viewing the saved output document. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (PDFBOX-1201) PDFTextStripperByArea y coordinate shifted "up"

Posted by "Ilija Pavlic (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179954#comment-13179954 ] 

Ilija Pavlic edited comment on PDFBOX-1201 at 1/4/12 10:54 PM:
---------------------------------------------------------------

Manually shifting the region considerably lower still misses the lines even though they are generously covered by the cyan rectangle. Defining the region to start (x, 0f, width, height) captures the wanted lines.
                
      was (Author: ipavlic):
    Manually shifting the region considerably lower still misses the lines even though they are generously covered by the cyan rectangle. Defining the region to start (x, 0f, width, height) captures the region.
                  
> PDFTextStripperByArea y coordinate shifted "up"
> -----------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>            Priority: Minor
>
> The text stripper region seems to be shifted up from the given coordinates, causing lines below the region to be included and ones above the defined region to be included.
> ...
> PDPage page = (PDPage) allPages.get(0);
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
> stripper.addRegion("test region", region);
> // overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
> PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
> contentStream.setNonStrokingColor( Color.CYAN );
> contentStream.fillRect(x, y, width, height);
> contentStream.close();
> stripper.extractRegions(page);
> String content = stripper.getTextForRegion("test region");
> ...
> document.save(...);
> ...
> The cyan rectangle overlays the desired region exactly when viewing the saved output document. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1201) PDFTextStripperByArea doesn't capture text that flows inside the capture region

Posted by "Ilija Pavlic (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilija Pavlic updated PDFBOX-1201:
---------------------------------

       Priority: Major  (was: Minor)
    Description: The text stripper region doesn't capture text starting and finishing outside the capture region but flowing through the capture region.  (was: The text stripper region seems to be shifted up from the given coordinates, causing lines below the region to be included and ones above the defined region to be included.

...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();

Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);

// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height);
contentStream.close();

stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region");
...
document.save(...);
...

The cyan rectangle overlays the desired region exactly when viewing the saved output document. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle.)
        Summary: PDFTextStripperByArea doesn't capture text that flows inside the capture region  (was: PDFTextStripperByArea y coordinate shifted "up")
    
> PDFTextStripperByArea doesn't capture text that flows inside the capture region
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Ilija Pavlic
>
> The text stripper region doesn't capture text starting and finishing outside the capture region but flowing through the capture region.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira