You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Navnath Kumbhar (JIRA)" <ji...@apache.org> on 2017/10/25 07:25:00 UTC
[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

    [ https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218205#comment-16218205 ] 

Navnath Kumbhar commented on PDFBOX-3970:
-----------------------------------------

Hello Tilman,
I also checked the values with your example class *DrawPrintTextLocations*. Text coordinate values are same.
*But, I am more concerned with the cell in which that text is located.*

To process the vertical and horizontal lines of the cell, as per my understanding, I need to process the path operators like *re*, *m*,*l* .etc. 
I have overridden the method *processOperator* in my own project to process those different operators. 

Here is the method *processStreamOperators()* from the pdfbox class *PDFStreamEngine*.

{code:java}
   private void processStreamOperators(PDContentStream contentStream)
     throws IOException
   {
     List<COSBase> arguments = new ArrayList();
     PDFStreamParser parser = new PDFStreamParser(contentStream);
     Object token = parser.parseNextToken();
     while (token != null)
     {
       if ((token instanceof COSObject))
       {
        arguments.add(((COSObject)token).getObject());
       }
      else if ((token instanceof Operator))
       {
         processOperator((Operator)token, arguments);
         arguments = new ArrayList();
       }
       else
       {
         arguments.add((COSBase)token);
       }
       token = parser.parseNextToken();
     }
   }
{code}

As you can see in the above code, coordinate values that I get for cell's vertical and horizontal paths are in the variable *arguments*.
And these arguments are processed in my overridden *processOperator()* method.
For example, here is my operator processing condition (here I am adding only for *re* operator) : 
{code:java}
		String operation = operator.getName();
		
		if (operation.equals("re")) {
			if (configuration.needsExtractTables()) {
				Point2D point1 = createPoint(page,
						getTransformation(this),
						PdfHelper.toDouble(arguments.get(0)),
						PdfHelper.toDouble(arguments.get(1)));
				Point2D point2 = createPoint(page,
						getTransformation(this),
						PdfHelper.toDouble(arguments.get(0)) + PdfHelper.toDouble(arguments.get(2)),
						PdfHelper.toDouble(arguments.get(1)) + PdfHelper.toDouble(arguments.get(3)));
				
				graphicHandler.start(page, point1);
				graphicHandler.add(new Point2D.Double(point2.getX(), point1.getY()));
				graphicHandler.add(point2);
				graphicHandler.add(new Point2D.Double(point1.getX(), point2.getY()));
				graphicHandler.close();
			}
		}
{code}

So, as per definition, operator *re* append a rectangle to a current path as a complete subpath with lower left corner(x,y) and dimensions width and height in user space.

Values in the variable *arguments* that I received from pdfbox are : [COSInt{100}, COSInt{672}, COSInt{120}, COSInt{32}]. These values are nothing but operands to the operator *re*.

As per the pdf reference document, the syntax of the *re* operator is :
*x y width height re* [4 operands before the operator re]
672 is the Y-value processed from bottom of the page by pdfbox.
When I subtract it from page height, I get the Y value from top of the page which is 120. [Page height is 792]

I hope, this will help you.

Thank you in advance!
Regards,
Navnath Kumbhar.





 

> x,y co-ordinates of the text inside the cell are not getting correctly.
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-3970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>         Environment: Operating system: Windows 7 (64 bit).
>            Reporter: Navnath Kumbhar
>         Attachments: paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value given by pdfbox from height of the page as I see that for paths, y-values are processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org