You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Mario Sangiorgio (JIRA)" <ji...@apache.org> on 2010/03/12 01:14:27 UTC

[jira] Created: (PDFBOX-659) Newlines added in the middle of words

Newlines added in the middle of words
-------------------------------------

                 Key: PDFBOX-659
                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 0.8.0-incubator
            Reporter: Mario Sangiorgio


I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
An example of what I am getting is the following text:

An
Asp
e
ct-Orien
ted
F
ramew
o
rk
for
S
ervice
A
d
aptation

rather than the expected "An Aspect-Oriented Framework for Service Adaptation".

Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Mario Sangiorgio (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844298#action_12844298 ] 

Mario Sangiorgio commented on PDFBOX-659:
-----------------------------------------

I also tried the 1.0 version of PDFBox, but unfortunately the results I am getting are even worse.
The title is screwed up and also most of the words of the body that were properly converted now have some defects.

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-659) Newlines added in the middle of words

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mel Martinez updated PDFBOX-659:
--------------------------------

    Attachment: patch_pdfbox_659.txt

The attached patch fixes the problem of incorrectly inserted newlines.

The problem was (as described above) due to TextPosition coords using negative space and the code incorrectly using a reset comparision value of '-1.0'.

This patch does not fix some additional problems that surface with the example .pdf file that include the following:

Missing space characters (words are arbitrarily catenated together) and missing characters.

The missing space characters can be recovered by setting the value of:

PDFTextStripper.setSpacingTolerance(float tolerance)

To a value smaller than the default (0.5).  I had to drop it quite a bit with this document and still did not recover all the spaces.

The missing characters are caused by the default mode of suppressing what the code believes to be duplicate, overlapping characters.  This can occur with MS Word-generated PDFs.  You can stop that behavior by setting the attribute:

PDFTextStripper.setSuppressDuplicateOverlappingText(boolean suppress);

to a false.

That said, the logic used when this is set to 'true' looks flawed.  I will open a separate bug for that.

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf, page.png, patch_pdfbox_659.txt
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844725#action_12844725 ] 

Mel Martinez commented on PDFBOX-659:
-------------------------------------

Okay - Villu's comment plus some things I'm seeing suggest its a combo effect.

First off the document renders perfectly in Acrobat Reader so that suggests that whatever is going on in the document is probably 'legal' or at least something we should be able to handle.

Villu's comment indicates that coordinates are shifted.

When I step through the 'rendering' of the individual TextPosition objects I note that for the messed up text, the Y coordinates are shifted into negative space.

This shouldn't be a problem - we should be rendering them against an offset origin - all that matters is their relative positions.  That's why Acrobat Reader renders them correctly.

However, when we do our text extraction, our 'text rendering' process includes a step to determine if a TextPosition object is still on the same line as the prior TextPosition object.  To do this, it compares the current Y position and Y height to the prior Y position and height.   This is fine, except for the first time you go through it, it needs some sort of default that it can compare to.  The code uses -1.0 as the default 'last' Y position.  From that point, as it iterates through, if the current position is above the last Y position, it resets the last Y position variable to the current position.

Do you see the problem?  If all the text is being renderded in negative Y space, then ALL the Y values are never 'above' the -1.0 value used as the default to start the iteration.  So it never properly resets the 'last Y position'.  This causes it to incorrectly think it is on a new line when it really isn't.  Hence it inserts the newline characters.

I'll have to think this through a bit to make sure the solution is a bit more robust.  But I should be able to post a patch early next week.

This also affects my PDFTextStripper2 class ( PDFBOX-521 ) so I will patch that at the same time.

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf, page.png
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-659) Newlines added in the middle of words

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-659.
---------------------------------------

    Fix Version/s: 1.2.0
       Resolution: Fixed

I've applied the patch with version 944887.

Thanks to Mel fpr the contribution

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>             Fix For: 1.2.0
>
>         Attachments: fulltext.pdf, page.png, patch_pdfbox_659.txt
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844750#action_12844750 ] 

Ted Dunning commented on PDFBOX-659:
------------------------------------


Sounds like (- Double.MAX_VALUE) is a reasonable candidate in place of -1.

And remember, I have a patent on thinking that Double.MIN_VALUE is the most negative double precision number available.  You guys aren't allowed to make that mistake.



> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf, page.png
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844518#action_12844518 ] 

Mel Martinez commented on PDFBOX-659:
-------------------------------------

Answering my own question:

creator: LaTeX with hyperref package
creation-date: 2006-10-23T04:28:18-0400
modification-date: 2006-11-22T22:29:18-0500
producer: Acrobat Distiller 7.0 (Windows)

Nothing unusual there.  



> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844512#action_12844512 ] 

Mel Martinez commented on PDFBOX-659:
-------------------------------------

It's worse than just the insertion of newlines.  Characters are getting dropped and/or replaced with wrong characters.

I haven't seen this before with any other PDFs.  How was this one created?

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Mario Sangiorgio (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844680#action_12844680 ] 

Mario Sangiorgio commented on PDFBOX-659:
-----------------------------------------

Does it mean that it is actually an issue of the document rather than an issue of the library?
Do you have any suggestion on how could I fix the document?

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf, page.png
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844510#action_12844510 ] 

Mel Martinez commented on PDFBOX-659:
-------------------------------------

Thanks for uploading the example file.

I've reproduced the behavior with it.

I should be able to spend some time today looking at this one.

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-659) Newlines added in the middle of words

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844388#action_12844388 ] 

Andreas Lehmkühler commented on PDFBOX-659:
-------------------------------------------

Is it possible to provide us with a sample pdf showing that behaviour?

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-659) Newlines added in the middle of words

Posted by "Villu Ruusmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Villu Ruusmann updated PDFBOX-659:
----------------------------------

    Attachment: page.png

I ran this document through my PDF debugging utility and I noticed that the page contents is shifted up and right. The text that falls "off the page" becomes misplaced.

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf, page.png
>
>
> I am experiencing issues getting the text from a PDF document. The document I want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.