You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Brian Carrier <ca...@digital-evidence.org> on 2008/11/11 20:45:32 UTC

when to rotate text

I'm trying to find a solution to the text rotation problems that  
satisfy the current regression tests and that work on some files that  
I found bugs with. I've a hit a point though where I need insight  
from people who know more about the graphics and non-text part of  
PDFBox.

Background:
- Text in PDF files are stored in chunks of one or more characters.  
The "text matrix" can define if the text goes to the right, left, up,  
or down.
- PDFStreamEngine.showString() takes the text stored in the PDF file  
and decodes it and saves each chunk in a TextPosition object.  The  
TextPosition object has an X,Y coordinate for the text chunk.
- PDFTextStripper.flushText() prints the raw text (which requires  
sorting the text chunks into the correct order and determining how  
far apart they are so that extra spaces are added if needed).

As an example of this, I have a page that is a normal landscape  
document.  Internally, the text starts at the lower left, the text  
direction is "up", and the page is rotated 90.

Problem:
- The step of sorting and outputting text requires knowledge about  
the page rotation because PDFTextStripper needs to know where the  
"upper left corner" of the page is and how to sort (via  
TextPositionComparator). For example, in the previous example the  
"upper left" presentation corner is really the lower left corner in  
coordinate space.

There seem to be (at least) three ways to solve this:
1) Store the coordinates in TextPosition in coordinates that are  
adjusted for the page rotation (this is the original way of doing it).
2) Store the coordinates in TextPosition in the native PDF  
coordinates and then each user of TextPosition can adjust for the  
rotation (this is what one of the patches does).
3) Store the native coordinates in TextPosition along with the page  
rotation value and provide an alternate API that gives the adjusted  
coordinates.

The only other area where TextPosition.getX() is called is in  
PageDrawer.showCharacter() in a call to PDFont.drawString(). I can't  
find many references in PDFBox to page rotation and don't know how  
the graphics code takes rotation into account and I can't figure out  
if drawString() is assuming a page rotation adjusted coordinate or not.

Is there an overall design approach in PDFBox with respect to when  
the rotation should be taken into account? Are any of the above  
proposed solutions most inline with the rest of the PDFBox code?

thanks,
brian



AW: AW: when to rotate text

Posted by An...@rwe.com.
>> Back to our problem. Looking for answers about the rotation-
>> behaviour of all non-text-elements,
>> I realized that all these elements are drawn different from text- 
>> elements. I created a simple
>> word-doc with two boxes and some text and generated a pdf-doc using  
>> Adobe PDFMaker.
>> If you try to show the pdf with the PDFReader from pdfbox the boxes  
>> are rotated but not the text.
>
>Was this from the trunk version (i.e. store adjusted coordinates) or  
>your patched version (i.e. store non-adjusted coordinates)?
I'm using a patches trunk-version. The textmatrix should do the whole thing.
There is only one conversion left: we still have to move the reference for 0,0
from lower left to upper left.

>> One conclusion seems to be: of course the whole coords-flipping and
>> moving thing in the PDFStreamEngine doesn't rotate the text.
>> We have to find out how the rotation is handled by non-text- 
>> elements. After solving this puzzle, we perhaps know how to proceed.

>Yes.  Your other post that references the text matrix is the missing  
>part of the puzzle that is not currently in PDFBox. The current code  
>assumes that all text is "right to left / horizontal".  The notion of  
>width and height in TextPosition also needs to be reconsidered  
>because those assume some form of direction.  Having starting and  
>ending coordinates may make more sense.

>Because the direction of the text is dependent on the matrix and the  
>page rotation, I'm more inclined to store the adjusted coordinates in  
>TextPosition so that every user of the object does not need to adjust  
>the coordinates themselves.
I agree. The adjustments are only needed for the TextStripper-stuff and therefore
it makes much more sense to place them within the TextPosition.

Andreas


----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann - 
- Sitz der Gesellschaft: Dortmund - 
- Eingetragen beim Amtsgericht Dortmund - 
- Handelsregister-Nr. HR B 21222 - 
- USt.-IdNr. DE 2588 96 719 - 

Re: AW: when to rotate text

Posted by Brian Carrier <ca...@digital-evidence.org>.
On Nov 12, 2008, at 9:47 AM, <An...@rwe.com> wrote:

>
>
>> I'm trying to find a solution to the text rotation problems that
>> satisfy the current regression tests and that work on some files that
>> I found bugs with. I've a hit a point though where I need insight
>> from people who know more about the graphics and non-text part of
>> PDFBox.
>> ..... snip
>
> That's a good summary of the whole problem. Well done.

thanks.

> But first off all, I realized that there is perhaps a  
> missunderstanding on my side.
>
> There is one key question for me. What do I have to do to get a  
> DINA4-page with a landscape orientation?
> Do I have to rotate every element on the page, which has always the  
> same dimension with portrait orientation?
> Or do I just have to flip the x-y dimensions of the page and the  
> every element is placed, as if the page has a portrait orientation?
>
> For example:
> 1.
> - PageSize = DINA4-PORTRAIT  = PDRectangle(PDPage.PAGE_SIZE_A4)
> - rotation = 90
> - element orientation = vertical??
> - Textpositioning 0<x<596 and 0<y<843
> 2.
> - PageSize = DINA4-LANDSCAPE = DINA4-PORTRAIT with flipped  
> dimensions = PDRectangle(843,596)
> - rotation = 0
> - element orientation = horizontal
> - Textpositioning 0<x<843 and 0<y<596
>
> I fear the first case will be the right one. Especially if we talk  
> about the NoRotate-flag, which is part of pdf since 1.3
> I'm using the second one and it works, but it seems to be wrong.  
> Your rotation.pdf has portrait-dimensions and a rotation of 90.

The examples that I have been debugging with are the first case.   In  
theory though, both could occur.

> Back to our problem. Looking for answers about the rotation- 
> behaviour of all non-text-elements,
> I realized that all these elements are drawn different from text- 
> elements. I created a simple
> word-doc with two boxes and some text and generated a pdf-doc using  
> Adobe PDFMaker.
> If you try to show the pdf with the PDFReader from pdfbox the boxes  
> are rotated but not the text.

Was this from the trunk version (i.e. store adjusted coordinates) or  
your patched version (i.e. store non-adjusted coordinates)?

> One conclusion seems to be: of course the whole coords-flipping and  
> moving thing in the PDFStreamEngine doesn't rotate the text.
> We have to find out how the rotation is handled by non-text- 
> elements. After solving this puzzle, we perhaps know how to proceed.

Yes.  Your other post that references the text matrix is the missing  
part of the puzzle that is not currently in PDFBox. The current code  
assumes that all text is "right to left / horizontal".  The notion of  
width and height in TextPosition also needs to be reconsidered  
because those assume some form of direction.  Having starting and  
ending coordinates may make more sense.

Because the direction of the text is dependent on the matrix and the  
page rotation, I'm more inclined to store the adjusted coordinates in  
TextPosition so that every user of the object does not need to adjust  
the coordinates themselves.

thanks,
brian


Re: AW: when to rotate text

Posted by Brian Carrier <ca...@digital-evidence.org>.
On Nov 14, 2008, at 2:44 AM, <An...@rwe.com>  
<An...@rwe.com> wrote:

>>> One conclusion seems to be: of course the whole coords-flipping and
>>> moving thing in the PDFStreamEngine doesn't rotate the text. We have
>>> to find out how the rotation is handled by non-text-elements.  
>>> After solving this puzzle, we perhaps know how to proceed.
>> I guess I found something. The whole translation, scaling and  
>> rotation stuff is done by using a Matrix. All non-text elements take
>> care of the shearing values, but not the text element.  
>> PDFStreamEngine.showString() only uses the scaling and the  
>> position when
>> calling PageDrawer.showCharacter(). The only thing we have to do,  
>> is to add the shearing values or the whole Transformation-Matrix  
>> to the
>> parameters for that method. A quick hack works not 100%, but shows  
>> that I'm on the right path.
> I spend some more time in this problem and I've found a working  
> solution for the text rotation problem. My rotation-example (test- 
> landscape2.pdf attached to PDFBOX-363) works 100% and the  
> rotation.pdf works 100% concerning the rotation, but there are  
> still some other problems with the positioning, some letters  
> overlapped others. This has nothing to do with the rotation-issue,  
> because the overlapping is on both page, portrait and landscape.

Hi Andreas,

It seems like we are both working on the same thing.  To prevent the  
previous situation where we both submitted different patches to the  
same problem and neither made much progress towards getting commited,  
perhaps we should discuss our approaches so that we can more easily  
figure out the that will be most easily incorporated.

Here is the overview of my approach (which currently passes all  
regression tests or has improved results in the tests).  Obviously a  
few minor tweaks and fixes were added, but here are the major changes:
- All knowledge about the text matrix rotation and page rotation has  
been moved to TextPosition.
- TextPosition takes the starting and ending text matrix of each text  
chunk in its constructor so that it knows the layout of the text.
- TextPosition has new methods that allow you to get either the  
adjusted or the original X, Y, length, and width coordinates.
- TextPositionComparator gets the adjusted coordinates so that it can  
compare locations without knowing about rotation / text direction.
- PDFTextStripper also uses the new APIs to get the adjusted  
coordinates so that it does not need to know about rotation / text  
direction.
- PDFStreamEngine doesn't need to know about rotation and text  
direction either.  It just passes the matrix.

The effect of this approach is that none of the callers need to know  
about rotation and text orientation, just TextPosition.  Is this  
similar to your approach?

thanks,
brian


AW: when to rotate text

Posted by An...@rwe.com.
>> One conclusion seems to be: of course the whole coords-flipping and 
>> moving thing in the PDFStreamEngine doesn't rotate the text. We have 
>> to find out how the rotation is handled by non-text-elements. After solving this puzzle, we perhaps know how to proceed.
>I guess I found something. The whole translation, scaling and rotation stuff is done by using a Matrix. All non-text elements take 
>care of the shearing values, but not the text element. PDFStreamEngine.showString() only uses the scaling and the position when 
>calling PageDrawer.showCharacter(). The only thing we have to do, is to add the shearing values or the whole Transformation-Matrix to the 
>parameters for that method. A quick hack works not 100%, but shows that I'm on the right path.
I spend some more time in this problem and I've found a working solution for the text rotation problem. My rotation-example (test-landscape2.pdf attached to PDFBOX-363) works 100% and the rotation.pdf works 100% concerning the rotation, but there are still some other problems with the positioning, some letters overlapped others. This has nothing to do with the rotation-issue, because the overlapping is on both page, portrait and landscape.

Due to my holiday in Berlin, I will provide a patch at the end of the next week


CU,
Andreas

RWE IT GmbH

----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann - 
- Sitz der Gesellschaft: Dortmund - 
- Eingetragen beim Amtsgericht Dortmund - 
- Handelsregister-Nr. HR B 21222 - 
- USt.-IdNr. DE 2588 96 719 - 

AW: when to rotate text

Posted by An...@rwe.com.
> One conclusion seems to be: of course the whole coords-flipping and moving thing in the PDFStreamEngine doesn't rotate the text.
> We have to find out how the rotation is handled by non-text-elements. After solving this puzzle, we perhaps know how to proceed.
I guess I found something. The whole translation, scaling and rotation stuff is done by using a Matrix. All non-text elements take 
care of the shearing values, but not the text element. PDFStreamEngine.showString() only uses the scaling and the position when 
calling PageDrawer.showCharacter(). The only thing we have to do, is to add the shearing values or the whole Transformation-Matrix
to the parameters for that method. A quick hack works not 100%, but shows that I'm on the right path.

But this has to wait, because I will be out of town for a week or so. But everyone is invited to solve the problem ...

----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann - 
- Sitz der Gesellschaft: Dortmund - 
- Eingetragen beim Amtsgericht Dortmund - 
- Handelsregister-Nr. HR B 21222 - 
- USt.-IdNr. DE 2588 96 719 - 

AW: when to rotate text

Posted by An...@rwe.com.
>Back to our problem. Looking for answers about the rotation-behaviour of all non-text-elements, 
>I realized that all these elements are drawn different from text-elements. I created a simple 
>word-doc with two boxes and some text and generated a pdf-doc using Adobe PDFMaker. 
>If you try to show the pdf with the PDFReader from pdfbox the boxes are rotated but not the text.
I've tried to attach the file to my posting, but obviously it isn't allowed. :-o
You can find the test-document as attachment to PDFBOX-363


----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann - 
- Sitz der Gesellschaft: Dortmund - 
- Eingetragen beim Amtsgericht Dortmund - 
- Handelsregister-Nr. HR B 21222 - 
- USt.-IdNr. DE 2588 96 719 - 

AW: when to rotate text

Posted by An...@rwe.com.

>I'm trying to find a solution to the text rotation problems that  
>satisfy the current regression tests and that work on some files that  
>I found bugs with. I've a hit a point though where I need insight  
>from people who know more about the graphics and non-text part of  
>PDFBox.
> ..... snip

That's a good summary of the whole problem. Well done.

But first off all, I realized that there is perhaps a missunderstanding on my side.

There is one key question for me. What do I have to do to get a DINA4-page with a landscape orientation?
Do I have to rotate every element on the page, which has always the same dimension with portrait orientation?
Or do I just have to flip the x-y dimensions of the page and the every element is placed, as if the page has a portrait orientation?

For example:
1.
- PageSize = DINA4-PORTRAIT  = PDRectangle(PDPage.PAGE_SIZE_A4)
- rotation = 90
- element orientation = vertical??
- Textpositioning 0<x<596 and 0<y<843
2.
- PageSize = DINA4-LANDSCAPE = DINA4-PORTRAIT with flipped dimensions = PDRectangle(843,596)
- rotation = 0
- element orientation = horizontal
- Textpositioning 0<x<843 and 0<y<596

I fear the first case will be the right one. Especially if we talk about the NoRotate-flag, which is part of pdf since 1.3
I'm using the second one and it works, but it seems to be wrong. Your rotation.pdf has portrait-dimensions and a rotation of 90.

Back to our problem. Looking for answers about the rotation-behaviour of all non-text-elements, 
I realized that all these elements are drawn different from text-elements. I created a simple 
word-doc with two boxes and some text and generated a pdf-doc using Adobe PDFMaker. 
If you try to show the pdf with the PDFReader from pdfbox the boxes are rotated but not the text.

One conclusion seems to be: of course the whole coords-flipping and moving thing in the PDFStreamEngine doesn't rotate the text.
We have to find out how the rotation is handled by non-text-elements. After solving this puzzle, we perhaps know how to proceed.



----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann - 
- Sitz der Gesellschaft: Dortmund - 
- Eingetragen beim Amtsgericht Dortmund - 
- Handelsregister-Nr. HR B 21222 - 
- USt.-IdNr. DE 2588 96 719 -