You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2008/08/04 20:20:44 UTC

[jira] Created: (PDFBOX-358) Vertical text extraction splitting text

Vertical text extraction splitting text
---------------------------------------

                 Key: PDFBOX-358
                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
            Reporter: Jukka Zitting


[Issue from SourceForge]
http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832

Vertical text gets splitted during extraction using PDFTextStripper.

"Specification" gives:
Spécif
ic
ations

This is made worse when sorted by position, as it gets mixed up with the
horizontal text:
ic
ations
[CLASSIFIED INFO]
[CLASSIFIED INFO]
Spécif [CLASSIFIED INFO]
[CLASSIFIED INFO]

I'm afraid I can't provide the PDF in question due to confidentiality
requirements. It's a PDF obtained from the conversion to PDF of a Windows
Word document. According to the forums I'm not the only one with this
problem.

[Comment on SourceForge]
Date: 2008-06-02 09:11
Sender: totoll
Logged In: YES 
user_id=2096423
Originator: YES

To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30

[Comment on SourceForge]
Sender: totoll
Logged In: YES 
user_id=2096423
Originator: YES

I have attached an admittedly very complicated PDF document which (as far
as I can tell) features 90° and 135° rotated text in a 90° rotated page.


Position-ordered text extraction gives horrible results. 

Normal text extraction is also very messy, although in this second case
the results are almost understandable. 

This is not the document I need to treat, but i think that if text can be
correctly extracted from that PDF, it should work for almost every other
existing PDF.
File Added: Flyer2.pdf
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-358) Vertical text extraction splitting text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707780#action_12707780 ] 

Andreas Lehmkühler edited comment on PDFBOX-358 at 5/10/09 4:55 AM:
--------------------------------------------------------------------

Hi Daniel,

I've the same effect on converting flyer2.pdf and mtxFidelity.pdf from PDFBOX-51. The problem was the AffineTransform which was used to rotate the page. I've exchanged that code with version 773325 and now converting works for both documents.

      was (Author: lehmi):
    Hi Daniel,

I've the same effect on converting flyer2.pdf and mtxFidelity.pdf from PDFBOX51. The problem was the AffineTransform which was used to rotate the page. I've exchanged that code with version 773325 and now converting works for both documents.
  
> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>             Fix For: 0.8.0-incubator
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-358) Vertical text extraction splitting text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662094#action_12662094 ] 

Andreas Lehmkühler commented on PDFBOX-358:
-------------------------------------------

I've tested the stripping-part, too and I guess it's ok. Because of the complicated layout of Flyer2.pdf it was a little bit difficult to check the test-output. Finally I've tested mtxFidelity.pdf from PDFBOX-51 too and it works.

> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-358) Vertical text extraction splitting text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661242#action_12661242 ] 

Andreas Lehmkühler commented on PDFBOX-358:
-------------------------------------------

Version 732038 contains a patch to solve some displaying issues if the rotation-angle is not a multiple of 90 degrees.
I'll try the stripping-part later.

> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-358) Vertical text extraction splitting text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707780#action_12707780 ] 

Andreas Lehmkühler commented on PDFBOX-358:
-------------------------------------------

Hi Daniel,

I've the same effect on converting flyer2.pdf and mtxFidelity.pdf from PDFBOX51. The problem was the AffineTransform which was used to rotate the page. I've exchanged that code with version 773325 and now converting works for both documents.

> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>             Fix For: 0.8.0-incubator
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-358) Vertical text extraction splitting text

Posted by "Daniel Wilson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707730#action_12707730 ] 

Daniel Wilson commented on PDFBOX-358:
--------------------------------------

Andreas,
My testing w/ Flyer2.PDF has it erroring out with a RasterFormatException when the rotation is attempted.

Are you seeing the same thing?

I can trap the error & go on (PDPage line 677), but the document remains sideways in that case.  Again, are you seeing the same?

Thanks!

> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>             Fix For: 0.8.0-incubator
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-358) Vertical text extraction splitting text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-358.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator

> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>             Fix For: 0.8.0-incubator
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.