You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2014/12/21 16:37:13 UTC
[jira] [Commented] (PDFBOX-1874) PDFTextStripper.isParagraphSeparation(...)

    [ https://issues.apache.org/jira/browse/PDFBOX-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255190#comment-14255190 ] 

ASF subversion and git services commented on PDFBOX-1874:
---------------------------------------------------------

Commit 1647158 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1647158 ]

PDFBOX-1874: adjust precision to avoid false results when comparing floats as proposed by Yuri Burrows

> PDFTextStripper.isParagraphSeparation(...)
> ------------------------------------------
>
>                 Key: PDFBOX-1874
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1874
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>         Environment: Eclipse
>            Reporter: Yuri Burrows
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>              Labels: patch
>
> PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it finds Y text indentation.
> PROBLEM:
> I believe the issue is due to precision in the the following logic:
> {code}
>             float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
>                     lastPosition.getTextPosition().getYDirAdj());
>             float xGap = (position.getTextPosition().getXDirAdj()-
>                     lastLineStartPosition.getTextPosition().getXDirAdj());
>             if(yGap > (getDropThreshold()*maxHeightForLine))
>             {
>                         result = true;
> {code}
> yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) has a precision to 100,000th. Resulting in the following comparison (example):
> 16.018 > 16.018005
> which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".
> EFFECT OF THE PROBLEM:
> every line in the output is marked as "isParagraphStart = true" and "writeParagraphEnd() ... = true".
> I.E. 
> |||NEW_LINE|||
> |||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data|||NEW_LINE|||
> contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> |||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the|||NEW_LINE|||
> COS Model). While it's possible to create any desired interactions with a PDF document using only these|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> In the source PDF these lines appear as such:
> "PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data
> contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,
> strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the
> COS Model). While it's possible to create any desired interactions with a PDF document using only these"
> MY WORKAROUND:
> NOTE: there is a small performance hit with this workaround.
> {code}
> 	 float yGap = Math.abs(position.getTextPosition().getYDirAdj()
> 	 - lastPosition.getTextPosition().getYDirAdj());
> 	
> 	 DecimalFormat df = new DecimalFormat("#.00");
> 	 float yGapTruncated = Float.valueOf(df.format(yGap));
> 	
> 	 float newYVal = Float.valueOf(df.format(getDropThreshold()
> 	 * maxHeightForLine));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)