You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2014/12/21 16:37:13 UTC
[jira] [Commented] (PDFBOX-1874)
PDFTextStripper.isParagraphSeparation(...)
[ https://issues.apache.org/jira/browse/PDFBOX-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255190#comment-14255190 ]
ASF subversion and git services commented on PDFBOX-1874:
---------------------------------------------------------
Commit 1647158 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1647158 ]
PDFBOX-1874: adjust precision to avoid false results when comparing floats as proposed by Yuri Burrows
> PDFTextStripper.isParagraphSeparation(...)
> ------------------------------------------
>
> Key: PDFBOX-1874
> URL: https://issues.apache.org/jira/browse/PDFBOX-1874
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.3
> Environment: Eclipse
> Reporter: Yuri Burrows
> Assignee: Andreas Lehmkühler
> Priority: Minor
> Labels: patch
>
> PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it finds Y text indentation.
> PROBLEM:
> I believe the issue is due to precision in the the following logic:
> {code}
> float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
> lastPosition.getTextPosition().getYDirAdj());
> float xGap = (position.getTextPosition().getXDirAdj()-
> lastLineStartPosition.getTextPosition().getXDirAdj());
> if(yGap > (getDropThreshold()*maxHeightForLine))
> {
> result = true;
> {code}
> yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine) has a precision to 100,000th. Resulting in the following comparison (example):
> 16.018 > 16.018005
> which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".
> EFFECT OF THE PROBLEM:
> every line in the output is marked as "isParagraphStart = true" and "writeParagraphEnd() ... = true".
> I.E.
> |||NEW_LINE|||
> |||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data|||NEW_LINE|||
> contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> |||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the|||NEW_LINE|||
> COS Model). While it's possible to create any desired interactions with a PDF document using only these|||NEW_LINE|||
> |||PARAGRAPH_END||||||NEW_LINE|||
> In the source PDF these lines appear as such:
> "PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data
> contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers,
> strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the
> COS Model). While it's possible to create any desired interactions with a PDF document using only these"
> MY WORKAROUND:
> NOTE: there is a small performance hit with this workaround.
> {code}
> float yGap = Math.abs(position.getTextPosition().getYDirAdj()
> - lastPosition.getTextPosition().getYDirAdj());
>
> DecimalFormat df = new DecimalFormat("#.00");
> float yGapTruncated = Float.valueOf(df.format(yGap));
>
> float newYVal = Float.valueOf(df.format(getDropThreshold()
> * maxHeightForLine));
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)