You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Brian Carrier <ca...@digital-evidence.org> on 2008/11/11 20:45:32 UTC
when to rotate text
I'm trying to find a solution to the text rotation problems that
satisfy the current regression tests and that work on some files that
I found bugs with. I've a hit a point though where I need insight
from people who know more about the graphics and non-text part of
PDFBox.
Background:
- Text in PDF files are stored in chunks of one or more characters.
The "text matrix" can define if the text goes to the right, left, up,
or down.
- PDFStreamEngine.showString() takes the text stored in the PDF file
and decodes it and saves each chunk in a TextPosition object. The
TextPosition object has an X,Y coordinate for the text chunk.
- PDFTextStripper.flushText() prints the raw text (which requires
sorting the text chunks into the correct order and determining how
far apart they are so that extra spaces are added if needed).
As an example of this, I have a page that is a normal landscape
document. Internally, the text starts at the lower left, the text
direction is "up", and the page is rotated 90.
Problem:
- The step of sorting and outputting text requires knowledge about
the page rotation because PDFTextStripper needs to know where the
"upper left corner" of the page is and how to sort (via
TextPositionComparator). For example, in the previous example the
"upper left" presentation corner is really the lower left corner in
coordinate space.
There seem to be (at least) three ways to solve this:
1) Store the coordinates in TextPosition in coordinates that are
adjusted for the page rotation (this is the original way of doing it).
2) Store the coordinates in TextPosition in the native PDF
coordinates and then each user of TextPosition can adjust for the
rotation (this is what one of the patches does).
3) Store the native coordinates in TextPosition along with the page
rotation value and provide an alternate API that gives the adjusted
coordinates.
The only other area where TextPosition.getX() is called is in
PageDrawer.showCharacter() in a call to PDFont.drawString(). I can't
find many references in PDFBox to page rotation and don't know how
the graphics code takes rotation into account and I can't figure out
if drawString() is assuming a page rotation adjusted coordinate or not.
Is there an overall design approach in PDFBox with respect to when
the rotation should be taken into account? Are any of the above
proposed solutions most inline with the rest of the PDFBox code?
thanks,
brian
AW: AW: when to rotate text
Posted by An...@rwe.com.
>> Back to our problem. Looking for answers about the rotation-
>> behaviour of all non-text-elements,
>> I realized that all these elements are drawn different from text-
>> elements. I created a simple
>> word-doc with two boxes and some text and generated a pdf-doc using
>> Adobe PDFMaker.
>> If you try to show the pdf with the PDFReader from pdfbox the boxes
>> are rotated but not the text.
>
>Was this from the trunk version (i.e. store adjusted coordinates) or
>your patched version (i.e. store non-adjusted coordinates)?
I'm using a patches trunk-version. The textmatrix should do the whole thing.
There is only one conversion left: we still have to move the reference for 0,0
from lower left to upper left.
>> One conclusion seems to be: of course the whole coords-flipping and
>> moving thing in the PDFStreamEngine doesn't rotate the text.
>> We have to find out how the rotation is handled by non-text-
>> elements. After solving this puzzle, we perhaps know how to proceed.
>Yes. Your other post that references the text matrix is the missing
>part of the puzzle that is not currently in PDFBox. The current code
>assumes that all text is "right to left / horizontal". The notion of
>width and height in TextPosition also needs to be reconsidered
>because those assume some form of direction. Having starting and
>ending coordinates may make more sense.
>Because the direction of the text is dependent on the matrix and the
>page rotation, I'm more inclined to store the adjusted coordinates in
>TextPosition so that every user of the object does not need to adjust
>the coordinates themselves.
I agree. The adjustments are only needed for the TextStripper-stuff and therefore
it makes much more sense to place them within the TextPosition.
Andreas
----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann -
- Sitz der Gesellschaft: Dortmund -
- Eingetragen beim Amtsgericht Dortmund -
- Handelsregister-Nr. HR B 21222 -
- USt.-IdNr. DE 2588 96 719 -
Re: AW: when to rotate text
Posted by Brian Carrier <ca...@digital-evidence.org>.
On Nov 12, 2008, at 9:47 AM, <An...@rwe.com> wrote:
>
>
>> I'm trying to find a solution to the text rotation problems that
>> satisfy the current regression tests and that work on some files that
>> I found bugs with. I've a hit a point though where I need insight
>> from people who know more about the graphics and non-text part of
>> PDFBox.
>> ..... snip
>
> That's a good summary of the whole problem. Well done.
thanks.
> But first off all, I realized that there is perhaps a
> missunderstanding on my side.
>
> There is one key question for me. What do I have to do to get a
> DINA4-page with a landscape orientation?
> Do I have to rotate every element on the page, which has always the
> same dimension with portrait orientation?
> Or do I just have to flip the x-y dimensions of the page and the
> every element is placed, as if the page has a portrait orientation?
>
> For example:
> 1.
> - PageSize = DINA4-PORTRAIT = PDRectangle(PDPage.PAGE_SIZE_A4)
> - rotation = 90
> - element orientation = vertical??
> - Textpositioning 0<x<596 and 0<y<843
> 2.
> - PageSize = DINA4-LANDSCAPE = DINA4-PORTRAIT with flipped
> dimensions = PDRectangle(843,596)
> - rotation = 0
> - element orientation = horizontal
> - Textpositioning 0<x<843 and 0<y<596
>
> I fear the first case will be the right one. Especially if we talk
> about the NoRotate-flag, which is part of pdf since 1.3
> I'm using the second one and it works, but it seems to be wrong.
> Your rotation.pdf has portrait-dimensions and a rotation of 90.
The examples that I have been debugging with are the first case. In
theory though, both could occur.
> Back to our problem. Looking for answers about the rotation-
> behaviour of all non-text-elements,
> I realized that all these elements are drawn different from text-
> elements. I created a simple
> word-doc with two boxes and some text and generated a pdf-doc using
> Adobe PDFMaker.
> If you try to show the pdf with the PDFReader from pdfbox the boxes
> are rotated but not the text.
Was this from the trunk version (i.e. store adjusted coordinates) or
your patched version (i.e. store non-adjusted coordinates)?
> One conclusion seems to be: of course the whole coords-flipping and
> moving thing in the PDFStreamEngine doesn't rotate the text.
> We have to find out how the rotation is handled by non-text-
> elements. After solving this puzzle, we perhaps know how to proceed.
Yes. Your other post that references the text matrix is the missing
part of the puzzle that is not currently in PDFBox. The current code
assumes that all text is "right to left / horizontal". The notion of
width and height in TextPosition also needs to be reconsidered
because those assume some form of direction. Having starting and
ending coordinates may make more sense.
Because the direction of the text is dependent on the matrix and the
page rotation, I'm more inclined to store the adjusted coordinates in
TextPosition so that every user of the object does not need to adjust
the coordinates themselves.
thanks,
brian
Re: AW: when to rotate text
Posted by Brian Carrier <ca...@digital-evidence.org>.
On Nov 14, 2008, at 2:44 AM, <An...@rwe.com>
<An...@rwe.com> wrote:
>>> One conclusion seems to be: of course the whole coords-flipping and
>>> moving thing in the PDFStreamEngine doesn't rotate the text. We have
>>> to find out how the rotation is handled by non-text-elements.
>>> After solving this puzzle, we perhaps know how to proceed.
>> I guess I found something. The whole translation, scaling and
>> rotation stuff is done by using a Matrix. All non-text elements take
>> care of the shearing values, but not the text element.
>> PDFStreamEngine.showString() only uses the scaling and the
>> position when
>> calling PageDrawer.showCharacter(). The only thing we have to do,
>> is to add the shearing values or the whole Transformation-Matrix
>> to the
>> parameters for that method. A quick hack works not 100%, but shows
>> that I'm on the right path.
> I spend some more time in this problem and I've found a working
> solution for the text rotation problem. My rotation-example (test-
> landscape2.pdf attached to PDFBOX-363) works 100% and the
> rotation.pdf works 100% concerning the rotation, but there are
> still some other problems with the positioning, some letters
> overlapped others. This has nothing to do with the rotation-issue,
> because the overlapping is on both page, portrait and landscape.
Hi Andreas,
It seems like we are both working on the same thing. To prevent the
previous situation where we both submitted different patches to the
same problem and neither made much progress towards getting commited,
perhaps we should discuss our approaches so that we can more easily
figure out the that will be most easily incorporated.
Here is the overview of my approach (which currently passes all
regression tests or has improved results in the tests). Obviously a
few minor tweaks and fixes were added, but here are the major changes:
- All knowledge about the text matrix rotation and page rotation has
been moved to TextPosition.
- TextPosition takes the starting and ending text matrix of each text
chunk in its constructor so that it knows the layout of the text.
- TextPosition has new methods that allow you to get either the
adjusted or the original X, Y, length, and width coordinates.
- TextPositionComparator gets the adjusted coordinates so that it can
compare locations without knowing about rotation / text direction.
- PDFTextStripper also uses the new APIs to get the adjusted
coordinates so that it does not need to know about rotation / text
direction.
- PDFStreamEngine doesn't need to know about rotation and text
direction either. It just passes the matrix.
The effect of this approach is that none of the callers need to know
about rotation and text orientation, just TextPosition. Is this
similar to your approach?
thanks,
brian
AW: when to rotate text
Posted by An...@rwe.com.
>> One conclusion seems to be: of course the whole coords-flipping and
>> moving thing in the PDFStreamEngine doesn't rotate the text. We have
>> to find out how the rotation is handled by non-text-elements. After solving this puzzle, we perhaps know how to proceed.
>I guess I found something. The whole translation, scaling and rotation stuff is done by using a Matrix. All non-text elements take
>care of the shearing values, but not the text element. PDFStreamEngine.showString() only uses the scaling and the position when
>calling PageDrawer.showCharacter(). The only thing we have to do, is to add the shearing values or the whole Transformation-Matrix to the
>parameters for that method. A quick hack works not 100%, but shows that I'm on the right path.
I spend some more time in this problem and I've found a working solution for the text rotation problem. My rotation-example (test-landscape2.pdf attached to PDFBOX-363) works 100% and the rotation.pdf works 100% concerning the rotation, but there are still some other problems with the positioning, some letters overlapped others. This has nothing to do with the rotation-issue, because the overlapping is on both page, portrait and landscape.
Due to my holiday in Berlin, I will provide a patch at the end of the next week
CU,
Andreas
RWE IT GmbH
----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann -
- Sitz der Gesellschaft: Dortmund -
- Eingetragen beim Amtsgericht Dortmund -
- Handelsregister-Nr. HR B 21222 -
- USt.-IdNr. DE 2588 96 719 -
AW: when to rotate text
Posted by An...@rwe.com.
> One conclusion seems to be: of course the whole coords-flipping and moving thing in the PDFStreamEngine doesn't rotate the text.
> We have to find out how the rotation is handled by non-text-elements. After solving this puzzle, we perhaps know how to proceed.
I guess I found something. The whole translation, scaling and rotation stuff is done by using a Matrix. All non-text elements take
care of the shearing values, but not the text element. PDFStreamEngine.showString() only uses the scaling and the position when
calling PageDrawer.showCharacter(). The only thing we have to do, is to add the shearing values or the whole Transformation-Matrix
to the parameters for that method. A quick hack works not 100%, but shows that I'm on the right path.
But this has to wait, because I will be out of town for a week or so. But everyone is invited to solve the problem ...
----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann -
- Sitz der Gesellschaft: Dortmund -
- Eingetragen beim Amtsgericht Dortmund -
- Handelsregister-Nr. HR B 21222 -
- USt.-IdNr. DE 2588 96 719 -
AW: when to rotate text
Posted by An...@rwe.com.
>Back to our problem. Looking for answers about the rotation-behaviour of all non-text-elements,
>I realized that all these elements are drawn different from text-elements. I created a simple
>word-doc with two boxes and some text and generated a pdf-doc using Adobe PDFMaker.
>If you try to show the pdf with the PDFReader from pdfbox the boxes are rotated but not the text.
I've tried to attach the file to my posting, but obviously it isn't allowed. :-o
You can find the test-document as attachment to PDFBOX-363
----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann -
- Sitz der Gesellschaft: Dortmund -
- Eingetragen beim Amtsgericht Dortmund -
- Handelsregister-Nr. HR B 21222 -
- USt.-IdNr. DE 2588 96 719 -
AW: when to rotate text
Posted by An...@rwe.com.
>I'm trying to find a solution to the text rotation problems that
>satisfy the current regression tests and that work on some files that
>I found bugs with. I've a hit a point though where I need insight
>from people who know more about the graphics and non-text part of
>PDFBox.
> ..... snip
That's a good summary of the whole problem. Well done.
But first off all, I realized that there is perhaps a missunderstanding on my side.
There is one key question for me. What do I have to do to get a DINA4-page with a landscape orientation?
Do I have to rotate every element on the page, which has always the same dimension with portrait orientation?
Or do I just have to flip the x-y dimensions of the page and the every element is placed, as if the page has a portrait orientation?
For example:
1.
- PageSize = DINA4-PORTRAIT = PDRectangle(PDPage.PAGE_SIZE_A4)
- rotation = 90
- element orientation = vertical??
- Textpositioning 0<x<596 and 0<y<843
2.
- PageSize = DINA4-LANDSCAPE = DINA4-PORTRAIT with flipped dimensions = PDRectangle(843,596)
- rotation = 0
- element orientation = horizontal
- Textpositioning 0<x<843 and 0<y<596
I fear the first case will be the right one. Especially if we talk about the NoRotate-flag, which is part of pdf since 1.3
I'm using the second one and it works, but it seems to be wrong. Your rotation.pdf has portrait-dimensions and a rotation of 90.
Back to our problem. Looking for answers about the rotation-behaviour of all non-text-elements,
I realized that all these elements are drawn different from text-elements. I created a simple
word-doc with two boxes and some text and generated a pdf-doc using Adobe PDFMaker.
If you try to show the pdf with the PDFReader from pdfbox the boxes are rotated but not the text.
One conclusion seems to be: of course the whole coords-flipping and moving thing in the PDFStreamEngine doesn't rotate the text.
We have to find out how the rotation is handled by non-text-elements. After solving this puzzle, we perhaps know how to proceed.
----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann -
- Sitz der Gesellschaft: Dortmund -
- Eingetragen beim Amtsgericht Dortmund -
- Handelsregister-Nr. HR B 21222 -
- USt.-IdNr. DE 2588 96 719 -