You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 03:16:34 UTC
[jira] [Updated] (PDFBOX-731) Inconsistencies in
TextPositionComparator and sortByPosition
[ https://issues.apache.org/jira/browse/PDFBOX-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson updated PDFBOX-731:
-------------------------------
Fix Version/s: 2.0.0
> Inconsistencies in TextPositionComparator and sortByPosition
> ------------------------------------------------------------
>
> Key: PDFBOX-731
> URL: https://issues.apache.org/jira/browse/PDFBOX-731
> Project: PDFBox
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 1.1.0, 2.0.0
> Environment: Any / all
> Reporter: Michael van Rooyen
> Fix For: 2.0.0
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Specifying sortByPosition on PDFTextStripper can result in scrambling of text. The problem is caused largely by inconsistencies in TextPositionComparator, which does not always satisfy the required comparator constraint that if a < b and b < c, then a < c. As a result, a true sort is sometimes not achievable. This is caused by the comparator being too flexible with what is regarded as being on the same "line".
> I modified the comparator to be more strict when deciding which characters are on the same line, specifically:
> 1. Two pieces of text can't be on the same line if one's font is double or more the size of the other's.
> 2. Two pieces of text can't be on the same line if one's baseline is more than half the smaller font point size from the other's baseline.
> I'm sure there are probably (superscript?) cases where these two conditions may be too strict, but at least they should (I think but haven't tried to prove :) result in a < b < c. The comparator source I have used is below, feel free to use or modify it in any way.
> Finally, PDFTextStripper needs to be more discriminating in inserting line breaks. Specifically, if the x position of a text segment is < the x position of the last text segment, the there is an implicit line-break. To fix this, I changed:
> if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
> to:
> if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine) || (sortByPosition && positionX < lastPosition.getXDirAdj()))
> Revised comparator source:
> public class TextPositionComparator implements Comparator
> {
> private int strictCompare(Object o1, Object o2)
> {
> TextPosition pos1 = (TextPosition)o1;
> TextPosition pos2 = (TextPosition)o2;
>
> // Get the text direction adjusted coordinates
>
> float pos1YBottom = pos1.getYDirAdj();
> float pos2YBottom = pos2.getYDirAdj();
> if (pos1YBottom < pos2YBottom)
> return -1;
> else if (pos1YBottom > pos2YBottom)
> return 1;
>
> float x1 = pos1.getXDirAdj();
> float x2 = pos2.getXDirAdj();
>
> if (x1 < x2)
> return -1;
> else if (x1 > x2)
> return 1;
>
> return 0;
> }
>
> public int compare(Object o1, Object o2)
> {
> TextPosition pos1 = (TextPosition)o1;
> TextPosition pos2 = (TextPosition)o2;
> /* Only compare text that is in the same direction. */
> if (pos1.getDir() < pos2.getDir())
> return -1;
> else if (pos1.getDir() > pos2.getDir())
> return 1;
> float size1 = pos1.getFontSize();
> float size2 = pos2.getFontSize();
>
> if (size1 <= size2/2 || size1 >= size2*2)
> return strictCompare(o1, o2);
> float fontsize = size1;
>
> if (size2 < size1)
> fontsize = size2;
>
> float pos1YBottom = pos1.getYDirAdj();
> float pos2YBottom = pos2.getYDirAdj();
> if (pos1YBottom <= pos2YBottom - fontsize/2 || pos1YBottom >= pos2YBottom + fontsize/2)
> return strictCompare(o1, o2);
>
> // Get the text direction adjusted coordinates
> float x1 = pos1.getXDirAdj();
> float x2 = pos2.getXDirAdj();
> if (x1 < x2)
> return -1;
> else if (x1 > x2)
> return 1;
>
> return 0;
> }
> }
> YMMV.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)