You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2008/10/31 20:57:44 UTC

[jira] Commented: (PDFBOX-363) Fixed Page rotation

    [ https://issues.apache.org/jira/browse/PDFBOX-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644414#action_12644414 ] 

Brian Carrier commented on PDFBOX-363:
--------------------------------------

What additional information is needed for a commiter to make a decision on which approach will be used?  There are some additional text rotation problems that are not addressed by either this patch or the one associated with PDFBOX-374 and the fix requires taking the text matrix into account. The current fix that I am working on will be impacted by which approach is used. 

FWIW, the approach of not  rotating the text in PDFTextStripper and taking the rotation into account in TextPositionComparator will be easiest for the next fix as well (which will allow page rotation as well as having text on a page going in a different direction - such as on the y-axis of a graph). 


> Fixed Page rotation
> -------------------
>
>                 Key: PDFBOX-363
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-363
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Jukka Zitting
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1977429&group_id=78314&atid=552834
> Hi all,
> Daniel asked me for my patch for the rotation-issue described in
> https://sourceforge.net/forum/message.php?msg_id=4992032
> Attention, I didn't apply the newest patches to the classes PDFStreamEngine
> and PageDrawer.
> There are 4 more probably affected classes calling the page.findRotation
> method which I didn't change, because I'm didn't have to use them (until
> now).
> org.pdfbox.util.operator.pagedrawer.Invoke
> org.pdfbox.util.TextPositionComparator
> org.pdfbox.examples.pdmodel.PrintURLs
> org.pdfbox.examples.util.PrintImageLocations
> I've attached a pdf in DINA4-landscape. The text is missplaced whenever I
> try to print or display (using the pdfbox-PDFReader and convertToImage
> within my application) it with pdfbox. The acrobat reader has no problems
> with my documents.
> After my patch everything works fine. Perhaps it is a point of discussion,
> if the convertToImage method has to rotate the image or if the user has to
> do it. The PDFPagePanel didn't do it (yet).
> Andreas
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279404&aid=1977429
> [Comment from SourceForge]
> Date: 2008-05-29 12:42
> Sender: danielwilson
> Logged In: YES 
> user_id=1737686
> Originator: NO
> I've just tried your sample PDF w/ the latest code -- prior to application
> of your patch.  It doesn't work.
> I'll work on incorporating your change for a full regression test in the
> next hour or so.
> [Comment from SourceForge]
> Date: 2008-05-29 15:16
> Sender: lehmialk
> Logged In: YES 
> user_id=2069622
> Originator: YES
> Hi Daniel,
> I've just added my patch to the newest sources you send me earlier this
> day. I guess it works. During testing I've found another problem concernign
> graphics within landscape-docs. I found the solution in patching the class
> org.pdfbox.util.operator.pagedrawer.Invoke in the same way I've patched the
> others. And consequently to be strict I've also patched the new methods in
> org.pdfbox.pdfviewer.PageDrawer
> For my everthings works fine inlc. the 4PP-pdf.
> I've attached the patched files and another testpdf with a embedded
> graphic.
> Andreas
> File Added: pdfbox_rotation_patch_2.zip
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279471&aid=1977429
> [Comment from SourceForge]
> Date: 2008-05-29 18:12
> Sender: danielwilson
> Logged In: YES 
> user_id=1737686
> Originator: NO
> Your code works w/ the 4PP test ... and with the other rendering stuff
> I've tried so far.
> However ... the text extraction test fails with it.  I can't figure that
> one out ... ideas?
> [Comment from SourceForge]
> Date: 2008-05-29 18:19
> Sender: lehmialk
> Logged In: YES 
> user_id=2069622
> Originator: YES
> Can you give me some more details? I never do any textextractions with
> pdfbox. Perhaps you'll provide with the code for test program, or is it
> part of pdfbox, so that I can find it in the cvs?
> However, it has to wait until tomorrow
> [Comment from SourceForge]
> Date: 2008-05-29 18:39
> Sender: danielwilson
> Logged In: YES 
> user_id=1737686
> Originator: NO
> If you've got the whole project set up, try
> ant testextract
> I'll see if I can narrow it down some.
> [Comment from SourceForge]
> Date: 2008-05-29 21:00
> Sender: danielwilson
> Logged In: YES 
> user_id=1737686
> Originator: NO
> The extraction problem seems to have to do w/ the changes to
> PDFStreamEngine.
> If I revert that file, extraction succeeds.  Unfortunately ... with that
> reverted but your other changes in place, image rendering hangs.
> Will work on it more ... probably tomorrow.
> [Comment from SourceForge]
> Date: 2008-05-29 21:12
> Sender: danielwilson
> Logged In: YES 
> user_id=1737686
> Originator: NO
> Correction ... it doesn't hang ... it's just slow on the first PDF to
> render ... maybe just due to the first one I'm sending it.
> Will look more tomorrow.
> [Comment from SourceForge]
> Date: 2008-05-30 07:11
> Sender: lehmialk
> Logged In: YES 
> user_id=2069622
> Originator: YES
> I've found one bug. While deleting the if rules for the rotation, I've
> deleted line 394 which is still needed.
> I've attached the corrected file
> File Added: PDFStreamEngine.java
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279559&aid=1977429
> [Comment from SourceForge]
> Date: 2008-05-30 07:43
> Sender: lehmialk
> Logged In: YES 
> user_id=2069622
> Originator: YES
> I forgot to mention that I can't run the test suite. When I try to get the
> whole project, I realized that I'm behind a firewall here in my office.
> Consequently my cvs-client doesn't work. I've to do it from home. :-(
> I've only tested one file: 601501018.pdf
> There are additional blanks and they disapper after adding the missing
> line. But starting at page 21, when the document orientation changes from
> portrait to landscape, there are additional cr or lf. Hmmmm ??
> [Comment from SourceForge]
> Date: 2008-05-30 08:25
> Sender: lehmialk
> Logged In: YES 
> user_id=2069622
> Originator: YES
> I've continued testing and I guess the problem is somewhere starting in
> org.pdfbox.util.PDFTextStripper.showCharacter(..). Obviously it handles the
> coordinates for rotated pages somehow in an other way than the
> implementation of the showCharacter() in org.pdfbox.pdfviewer.PageDrawer.
> But for the moment I don't understand what's happening in the
> TextStripper, perhaps I'll find out later. 
> I hope this hint helps ...
> [Comment from SourceForge]
> Date: 2008-05-30 16:20
> Sender: danielwilson
> Logged In: YES 
> user_id=1737686
> Originator: NO
> I've put a couple more hours into this, and I don't know the answer.
> I do know the text extraction is the more mature side of this library.
> For the moment, I'll be skipping over your changes to PDFStreamEngine.
> Thanks for the other changes!
> [Comment from SourceForge]
> Date: 2008-06-02 09:21
> Sender: lehmialk
> Logged In: YES 
> user_id=2069622
> Originator: YES
> Hi Daniel,
> I guess I've solved the problem. The textposition-handling has to be
> adjusted within the method PDFTextStripper.flushText(). Of course my former
> changes to the class PDFStreamEngine are needed. During debugging I found a
> bug in the class TextPositionComparator (line 82). I solved it by removing
> the rotation if-clauses. Whenever you compare two Textpositions, it is
> needless to look at the rotation because they are on the same page so that
> the comparison is independent of the rotation.
> Furthermore my PDFTextStripper-patch seems to correct some minor problems,
> which are described in
> https://sourceforge.net/forum/message.php?msg_id=4976730.
> I've tested the following cases:
> Garcia2003b__Correlative_exploration_of_EEG_Signals.pdf works 100%
> test_rotate_270.txt doesn't work 100%, but my patch corrected a bug in
> lines 251-257, 278/279, 502/503, 574/575 and the other differences are some
> kind of special-character-issues. I guess you have to correct the input at
> first.
> I've attached my changes based on the newest versions of both classes.
> [Comment from SourceForge]
> Date: 2008-06-02 09:22
> Sender: lehmialk
> Logged In: YES 
> user_id=2069622
> Originator: YES
> File Added: pdfbox_rotation_patch_3.zip
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279842&aid=1977429

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (PDFBOX-363) Fixed Page rotation

Posted by Brian Carrier <ca...@digital-evidence.org>.
On Nov 3, 2008, at 8:30 AM, Daniel Wilson wrote:

> I would say that the approach that fixes the most items w/o  
> breaking those
> that work is the better approach.
>
> What I have done is run the regression tests before committing.   
> If, though
> I've fixed something, I've broken a regression test, I haven't  
> committed
> that change.


The regression tests are failing for me (without any changes) because  
of the change that I previously submitted that replaced "null" with  
"?" for characters that could not be mapped. The files that are  
generating the failures are 10101-AR.pdf and  
Garcia2003b__Correlative...pdf.  Can someone update these?

thanks,
brian


Re: [jira] Commented: (PDFBOX-363) Fixed Page rotation

Posted by Daniel Wilson <wi...@gmail.com>.
I would say that the approach that fixes the most items w/o breaking those
that work is the better approach.

What I have done is run the regression tests before committing.  If, though
I've fixed something, I've broken a regression test, I haven't committed
that change.

We do have a regression test for the image rendering now (have since May,
IIRC) and 1 or 2 of those tests involve rotation.

Daniel Wilson

On 10/31/08, Brian Carrier (JIRA) <ji...@apache.org> wrote:
>
>
>     [
> https://issues.apache.org/jira/browse/PDFBOX-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644414#action_12644414]
>
>
> Brian Carrier commented on PDFBOX-363:
> --------------------------------------
>
>
> What additional information is needed for a commiter to make a decision on
> which approach will be used?  There are some additional text rotation
> problems that are not addressed by either this patch or the one associated
> with PDFBOX-374 and the fix requires taking the text matrix into account.
> The current fix that I am working on will be impacted by which approach is
> used.
>
> FWIW, the approach of not  rotating the text in PDFTextStripper and taking
> the rotation into account in TextPositionComparator will be easiest for the
> next fix as well (which will allow page rotation as well as having text on a
> page going in a different direction - such as on the y-axis of a graph).
>
>
>
> > Fixed Page rotation
> > -------------------
> >
> >                 Key: PDFBOX-363
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-363
> >             Project: PDFBox
> >          Issue Type: Improvement
> >            Reporter: Jukka Zitting
> >
> > [Issue from SourceForge]
> >
> http://sourceforge.net/tracker/index.php?func=detail&aid=1977429&group_id=78314&atid=552834
> > Hi all,
> > Daniel asked me for my patch for the rotation-issue described in
> > https://sourceforge.net/forum/message.php?msg_id=4992032
> > Attention, I didn't apply the newest patches to the classes
> PDFStreamEngine
> > and PageDrawer.
> > There are 4 more probably affected classes calling the page.findRotation
> > method which I didn't change, because I'm didn't have to use them (until
> > now).
> > org.pdfbox.util.operator.pagedrawer.Invoke
> > org.pdfbox.util.TextPositionComparator
> > org.pdfbox.examples.pdmodel.PrintURLs
> > org.pdfbox.examples.util.PrintImageLocations
> > I've attached a pdf in DINA4-landscape. The text is missplaced whenever I
> > try to print or display (using the pdfbox-PDFReader and convertToImage
> > within my application) it with pdfbox. The acrobat reader has no problems
> > with my documents.
> > After my patch everything works fine. Perhaps it is a point of
> discussion,
> > if the convertToImage method has to rotate the image or if the user has
> to
> > do it. The PDFPagePanel didn't do it (yet).
> > Andreas
> >
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279404&aid=1977429
> > [Comment from SourceForge]
> > Date: 2008-05-29 12:42
> > Sender: danielwilson
> > Logged In: YES
> > user_id=1737686
> > Originator: NO
> > I've just tried your sample PDF w/ the latest code -- prior to
> application
> > of your patch.  It doesn't work.
> > I'll work on incorporating your change for a full regression test in the
> > next hour or so.
> > [Comment from SourceForge]
> > Date: 2008-05-29 15:16
> > Sender: lehmialk
> > Logged In: YES
> > user_id=2069622
> > Originator: YES
> > Hi Daniel,
> > I've just added my patch to the newest sources you send me earlier this
> > day. I guess it works. During testing I've found another problem
> concernign
> > graphics within landscape-docs. I found the solution in patching the
> class
> > org.pdfbox.util.operator.pagedrawer.Invoke in the same way I've patched
> the
> > others. And consequently to be strict I've also patched the new methods
> in
> > org.pdfbox.pdfviewer.PageDrawer
> > For my everthings works fine inlc. the 4PP-pdf.
> > I've attached the patched files and another testpdf with a embedded
> > graphic.
> > Andreas
> > File Added: pdfbox_rotation_patch_2.zip
> >
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279471&aid=1977429
> > [Comment from SourceForge]
> > Date: 2008-05-29 18:12
> > Sender: danielwilson
> > Logged In: YES
> > user_id=1737686
> > Originator: NO
> > Your code works w/ the 4PP test ... and with the other rendering stuff
> > I've tried so far.
> > However ... the text extraction test fails with it.  I can't figure that
> > one out ... ideas?
> > [Comment from SourceForge]
> > Date: 2008-05-29 18:19
> > Sender: lehmialk
> > Logged In: YES
> > user_id=2069622
> > Originator: YES
> > Can you give me some more details? I never do any textextractions with
> > pdfbox. Perhaps you'll provide with the code for test program, or is it
> > part of pdfbox, so that I can find it in the cvs?
> > However, it has to wait until tomorrow
> > [Comment from SourceForge]
> > Date: 2008-05-29 18:39
> > Sender: danielwilson
> > Logged In: YES
> > user_id=1737686
> > Originator: NO
> > If you've got the whole project set up, try
> > ant testextract
> > I'll see if I can narrow it down some.
> > [Comment from SourceForge]
> > Date: 2008-05-29 21:00
> > Sender: danielwilson
> > Logged In: YES
> > user_id=1737686
> > Originator: NO
> > The extraction problem seems to have to do w/ the changes to
> > PDFStreamEngine.
> > If I revert that file, extraction succeeds.  Unfortunately ... with that
> > reverted but your other changes in place, image rendering hangs.
> > Will work on it more ... probably tomorrow.
> > [Comment from SourceForge]
> > Date: 2008-05-29 21:12
> > Sender: danielwilson
> > Logged In: YES
> > user_id=1737686
> > Originator: NO
> > Correction ... it doesn't hang ... it's just slow on the first PDF to
> > render ... maybe just due to the first one I'm sending it.
> > Will look more tomorrow.
> > [Comment from SourceForge]
> > Date: 2008-05-30 07:11
> > Sender: lehmialk
> > Logged In: YES
> > user_id=2069622
> > Originator: YES
> > I've found one bug. While deleting the if rules for the rotation, I've
> > deleted line 394 which is still needed.
> > I've attached the corrected file
> > File Added: PDFStreamEngine.java
> >
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279559&aid=1977429
> > [Comment from SourceForge]
> > Date: 2008-05-30 07:43
> > Sender: lehmialk
> > Logged In: YES
> > user_id=2069622
> > Originator: YES
> > I forgot to mention that I can't run the test suite. When I try to get
> the
> > whole project, I realized that I'm behind a firewall here in my office.
> > Consequently my cvs-client doesn't work. I've to do it from home. :-(
> > I've only tested one file: 601501018.pdf
> > There are additional blanks and they disapper after adding the missing
> > line. But starting at page 21, when the document orientation changes from
> > portrait to landscape, there are additional cr or lf. Hmmmm ??
> > [Comment from SourceForge]
> > Date: 2008-05-30 08:25
> > Sender: lehmialk
> > Logged In: YES
> > user_id=2069622
> > Originator: YES
> > I've continued testing and I guess the problem is somewhere starting in
> > org.pdfbox.util.PDFTextStripper.showCharacter(..). Obviously it handles
> the
> > coordinates for rotated pages somehow in an other way than the
> > implementation of the showCharacter() in org.pdfbox.pdfviewer.PageDrawer.
> > But for the moment I don't understand what's happening in the
> > TextStripper, perhaps I'll find out later.
> > I hope this hint helps ...
> > [Comment from SourceForge]
> > Date: 2008-05-30 16:20
> > Sender: danielwilson
> > Logged In: YES
> > user_id=1737686
> > Originator: NO
> > I've put a couple more hours into this, and I don't know the answer.
> > I do know the text extraction is the more mature side of this library.
> > For the moment, I'll be skipping over your changes to PDFStreamEngine.
> > Thanks for the other changes!
> > [Comment from SourceForge]
> > Date: 2008-06-02 09:21
> > Sender: lehmialk
> > Logged In: YES
> > user_id=2069622
> > Originator: YES
> > Hi Daniel,
> > I guess I've solved the problem. The textposition-handling has to be
> > adjusted within the method PDFTextStripper.flushText(). Of course my
> former
> > changes to the class PDFStreamEngine are needed. During debugging I found
> a
> > bug in the class TextPositionComparator (line 82). I solved it by
> removing
> > the rotation if-clauses. Whenever you compare two Textpositions, it is
> > needless to look at the rotation because they are on the same page so
> that
> > the comparison is independent of the rotation.
> > Furthermore my PDFTextStripper-patch seems to correct some minor
> problems,
> > which are described in
> > https://sourceforge.net/forum/message.php?msg_id=4976730.
> > I've tested the following cases:
> > Garcia2003b__Correlative_exploration_of_EEG_Signals.pdf works 100%
> > test_rotate_270.txt doesn't work 100%, but my patch corrected a bug in
> > lines 251-257, 278/279, 502/503, 574/575 and the other differences are
> some
> > kind of special-character-issues. I guess you have to correct the input
> at
> > first.
> > I've attached my changes based on the newest versions of both classes.
> > [Comment from SourceForge]
> > Date: 2008-06-02 09:22
> > Sender: lehmialk
> > Logged In: YES
> > user_id=2069622
> > Originator: YES
> > File Added: pdfbox_rotation_patch_3.zip
> >
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552834&file_id=279842&aid=1977429
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>