You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Chris Chadwick (JIRA)" <ji...@apache.org> on 2010/06/16 15:45:23 UTC

[jira] Created: (PDFBOX-751) Text Extraction truncates last character when image page has sideways text

Text Extraction truncates last character when image page has sideways text
--------------------------------------------------------------------------

                 Key: PDFBOX-751
                 URL: https://issues.apache.org/jira/browse/PDFBOX-751
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.1.0
         Environment: HP UX 11iV1
            Reporter: Chris Chadwick


When using unsorted text extraction on a PDF that contains a horizontal page (normal orienation text) followed by a page where all the text is rotated 90 degrees (landscape) , the last character of each word is forced onto a new line. For example

Thi
s
erro
r
wa
s
logge
d
toda
y

It is only the last letter of each phrase that is affected, and it is only affected on the rotated page.

Selecting the text from the image directly - in adobe do 'Select All', cut  - produces the correct results, as do other tools, so the text layer appears correct in the PDF file.

Also please could you publish when V1.2 be ready as this may resolve this issue. Is it available as beta?
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-751) Text Extraction truncates last character when image page has sideways text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879673#action_12879673 ] 

Andreas Lehmkühler commented on PDFBOX-751:
-------------------------------------------

Can you provide us with a sample pdf?

> Text Extraction truncates last character when image page has sideways text
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-751
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-751
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: HP UX 11iV1
>            Reporter: Chris Chadwick
>
> When using unsorted text extraction on a PDF that contains a horizontal page (normal orienation text) followed by a page where all the text is rotated 90 degrees (landscape) , the last character of each word is forced onto a new line. For example
> Thi
> s
> erro
> r
> wa
> s
> logge
> d
> toda
> y
> It is only the last letter of each phrase that is affected, and it is only affected on the rotated page.
> Selecting the text from the image directly - in adobe do 'Select All', cut  - produces the correct results, as do other tools, so the text layer appears correct in the PDF file.
> Also please could you publish when V1.2 be ready as this may resolve this issue. Is it available as beta?
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-751) Text Extraction truncates last character when image page has sideways text

Posted by "Chris Chadwick (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879791#action_12879791 ] 

Chris Chadwick commented on PDFBOX-751:
---------------------------------------

Hi, I have asked our customer whether we can include the image or not. In th meantime can you comment as to whether this issue has been seen before?

> Text Extraction truncates last character when image page has sideways text
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-751
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-751
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: HP UX 11iV1
>            Reporter: Chris Chadwick
>
> When using unsorted text extraction on a PDF that contains a horizontal page (normal orienation text) followed by a page where all the text is rotated 90 degrees (landscape) , the last character of each word is forced onto a new line. For example
> Thi
> s
> erro
> r
> wa
> s
> logge
> d
> toda
> y
> It is only the last letter of each phrase that is affected, and it is only affected on the rotated page.
> Selecting the text from the image directly - in adobe do 'Select All', cut  - produces the correct results, as do other tools, so the text layer appears correct in the PDF file.
> Also please could you publish when V1.2 be ready as this may resolve this issue. Is it available as beta?
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.