You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by John Walker <jo...@newconceptsdev.com> on 2015/08/15 02:06:28 UTC

Problems Using PDFBox To Manually Track TextPosition

Hello,

 

I'm using PDFBox to parse the contentstream for a page in a PDF.   Based on
the list of operations, there are two lines of text that I expect to be in
very different places on the page vertically.  However, when the page is
displayed in Sumatra or Acrobat, this text is vertically aligned.

 

The method I'm using to predict text position has been accurate in the past.
I'm not sure if the method is faulty, or if I'm mis-understanding the
operation list I'm getting from PDFBox.

 

Here is the list of operations, with annotations explaining how I think they
should impact vertical position of text cursor: 

 

http://pastebin.com/GUWWX3Kv

 

As you can see, I'm basically only moving my model of the cursor in reaction
to Tm's and Td's.  (TJ's aren't relevant because text is horizontal and the
y position is the one I'm tracking.)   I also ignored the cm, because
there's a Tm right after it.

 

Am I mis-interpreting the PDF Operators (as I suspect)?  Is there any
potential that this is a PDFBox issue?  

 

Thanks in advance!

 

-John 


Re: Problems Using PDFBox To Manually Track TextPosition

Posted by John Hewson <jo...@jahewson.com>.
> On 15 Aug 2015, at 20:43, johnw@newconceptsdev.com wrote:
> 
> John,
> 
> Thanks for the response.
> 
> PDFStreamEngine looks promising.  My use case is a bit weird.  Just on the off chance that I can't extract the info I need with PDFStreamEngine, I had some follow up questions about the operations:

PDFStreamEngine powers all of the various text extraction and rendering classes in PDFBox so it should do everything you need. Take a look at PageDrawer to see how it handles text using (only a subset of) PDFStreamEngine’s APIs.

> I thought that Tm replaces the current text matrix completely (unlike cm), and that therefore, if I'm only concerned about text position, I could just treat the Tx and Ty members of the new matrix as the new text position.  Is this not accurate? Or is it just that I have to watch for cm's and other operations after Tm that transform (not replace) the current text matrix?

Sorry, yes that’s right, tm replaces the entire matrix. It’s cm which multiples against the existing matrix. The text position depends on both of those matrices though. Both matrices are also part of the graphics state.

Note that the tx and ty don’t give you an x and y position, but specify the x and y translation of the matrix. The scale and rotation elements will also affect the final x and y position, which is why you need to perform the proper matrix operation instead of extracting just those elements.

— John

> With the q operations, does graphics state include text position?  What about path clipping?  

Yes, it includes the text matrix and the CTM, as well as the clipping path. See PDGraphicsState.

> 
> Sorry for the dense-ness, I'm in a bit over my head on this one.  (And I realize that PDFStreamEngine is the cleaner way to go if I can -- thank you for that recommendation!)
> 
> -John
> 
> 
> 
> -----Original Message-----
> From: "John Hewson" <jo...@jahewson.com>
> Sent: Saturday, August 15, 2015 11:29pm
> To: users@pdfbox.apache.org
> Subject: Re: Problems Using PDFBox To Manually Track TextPosition
> 
> 
>> On 14 Aug 2015, at 17:06, John Walker <jo...@newconceptsdev.com> wrote:
>> 
>> Hello,
>> 
>> 
>> 
>> I'm using PDFBox to parse the contentstream for a page in a PDF.   Based on
>> the list of operations, there are two lines of text that I expect to be in
>> very different places on the page vertically.  However, when the page is
>> displayed in Sumatra or Acrobat, this text is vertically aligned.
> 
> I’d recommend subclassing PDFStreamEngine if you want to hook into the PDF operators, specifically showTextString(s) and associated methods, such as showGlyph.
> 
> Parsing the stream yourself brings many challenges.
> 
>> 
>> The method I'm using to predict text position has been accurate in the past.
>> I'm not sure if the method is faulty, or if I'm mis-understanding the
>> operation list I'm getting from PDFBox.
>> 
>> 
>> 
>> Here is the list of operations, with annotations explaining how I think they
>> should impact vertical position of text cursor: 
>> 
>> 
>> 
>> http://pastebin.com/GUWWX3Kv
>> 
>> 
>> 
>> As you can see, I'm basically only moving my model of the cursor in reaction
>> to Tm's and Td's.  (TJ's aren't relevant because text is horizontal and the
>> y position is the one I'm tracking.)   I also ignored the cm, because
>> there's a Tm right after it.
> 
> You’re definitely misunderstanding the operators. Tm doesn’t set the x and y values, it specifies a matrix which is multiplied with the current Tm matrix in the graphics state. In addition, the graphics state itself can be saved/restored via the q and Q operators. You’ll also need to take the CTM into account (that’s the cm operator).
> 
> Anyway, don’t do that, use PDFStreamEngine instead.
> 
> — John
> 
>> 
>> Am I mis-interpreting the PDF Operators (as I suspect)?  Is there any
>> potential that this is a PDFBox issue?  
>> 
>> 
>> 
>> Thanks in advance!
>> 
>> 
>> 
>> -John 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Problems Using PDFBox To Manually Track TextPosition

Posted by jo...@newconceptsdev.com.
John,

Thanks for the response.

PDFStreamEngine looks promising.  My use case is a bit weird.  Just on the off chance that I can't extract the info I need with PDFStreamEngine, I had some follow up questions about the operations:

I thought that Tm replaces the current text matrix completely (unlike cm), and that therefore, if I'm only concerned about text position, I could just treat the Tx and Ty members of the new matrix as the new text position.  Is this not accurate? Or is it just that I have to watch for cm's and other operations after Tm that transform (not replace) the current text matrix?

With the q operations, does graphics state include text position?  What about path clipping?  

Sorry for the dense-ness, I'm in a bit over my head on this one.  (And I realize that PDFStreamEngine is the cleaner way to go if I can -- thank you for that recommendation!)

-John



-----Original Message-----
From: "John Hewson" <jo...@jahewson.com>
Sent: Saturday, August 15, 2015 11:29pm
To: users@pdfbox.apache.org
Subject: Re: Problems Using PDFBox To Manually Track TextPosition


> On 14 Aug 2015, at 17:06, John Walker <jo...@newconceptsdev.com> wrote:
> 
> Hello,
> 
> 
> 
> I'm using PDFBox to parse the contentstream for a page in a PDF.   Based on
> the list of operations, there are two lines of text that I expect to be in
> very different places on the page vertically.  However, when the page is
> displayed in Sumatra or Acrobat, this text is vertically aligned.

I’d recommend subclassing PDFStreamEngine if you want to hook into the PDF operators, specifically showTextString(s) and associated methods, such as showGlyph.

Parsing the stream yourself brings many challenges.

> 
> The method I'm using to predict text position has been accurate in the past.
> I'm not sure if the method is faulty, or if I'm mis-understanding the
> operation list I'm getting from PDFBox.
> 
> 
> 
> Here is the list of operations, with annotations explaining how I think they
> should impact vertical position of text cursor: 
> 
> 
> 
> http://pastebin.com/GUWWX3Kv
> 
> 
> 
> As you can see, I'm basically only moving my model of the cursor in reaction
> to Tm's and Td's.  (TJ's aren't relevant because text is horizontal and the
> y position is the one I'm tracking.)   I also ignored the cm, because
> there's a Tm right after it.

You’re definitely misunderstanding the operators. Tm doesn’t set the x and y values, it specifies a matrix which is multiplied with the current Tm matrix in the graphics state. In addition, the graphics state itself can be saved/restored via the q and Q operators. You’ll also need to take the CTM into account (that’s the cm operator).

Anyway, don’t do that, use PDFStreamEngine instead.

— John

> 
> Am I mis-interpreting the PDF Operators (as I suspect)?  Is there any
> potential that this is a PDFBox issue?  
> 
> 
> 
> Thanks in advance!
> 
> 
> 
> -John 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Problems Using PDFBox To Manually Track TextPosition

Posted by John Hewson <jo...@jahewson.com>.
> On 14 Aug 2015, at 17:06, John Walker <jo...@newconceptsdev.com> wrote:
> 
> Hello,
> 
> 
> 
> I'm using PDFBox to parse the contentstream for a page in a PDF.   Based on
> the list of operations, there are two lines of text that I expect to be in
> very different places on the page vertically.  However, when the page is
> displayed in Sumatra or Acrobat, this text is vertically aligned.

I’d recommend subclassing PDFStreamEngine if you want to hook into the PDF operators, specifically showTextString(s) and associated methods, such as showGlyph.

Parsing the stream yourself brings many challenges.

> 
> The method I'm using to predict text position has been accurate in the past.
> I'm not sure if the method is faulty, or if I'm mis-understanding the
> operation list I'm getting from PDFBox.
> 
> 
> 
> Here is the list of operations, with annotations explaining how I think they
> should impact vertical position of text cursor: 
> 
> 
> 
> http://pastebin.com/GUWWX3Kv
> 
> 
> 
> As you can see, I'm basically only moving my model of the cursor in reaction
> to Tm's and Td's.  (TJ's aren't relevant because text is horizontal and the
> y position is the one I'm tracking.)   I also ignored the cm, because
> there's a Tm right after it.

You’re definitely misunderstanding the operators. Tm doesn’t set the x and y values, it specifies a matrix which is multiplied with the current Tm matrix in the graphics state. In addition, the graphics state itself can be saved/restored via the q and Q operators. You’ll also need to take the CTM into account (that’s the cm operator).

Anyway, don’t do that, use PDFStreamEngine instead.

— John

> 
> Am I mis-interpreting the PDF Operators (as I suspect)?  Is there any
> potential that this is a PDFBox issue?  
> 
> 
> 
> Thanks in advance!
> 
> 
> 
> -John 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org