You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 03:16:34 UTC

[jira] [Updated] (PDFBOX-731) Inconsistencies in TextPositionComparator and sortByPosition

     [ https://issues.apache.org/jira/browse/PDFBOX-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson updated PDFBOX-731:
-------------------------------
    Fix Version/s: 2.0.0

> Inconsistencies in TextPositionComparator and sortByPosition
> ------------------------------------------------------------
>
>                 Key: PDFBOX-731
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-731
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.1.0, 2.0.0
>         Environment: Any / all
>            Reporter: Michael van Rooyen
>             Fix For: 2.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Specifying sortByPosition on PDFTextStripper can result in scrambling of text.  The problem is caused largely by inconsistencies in TextPositionComparator, which does not always satisfy the required comparator constraint that if a < b and b < c, then a < c.  As a result, a true sort is sometimes not achievable.  This is caused by the comparator being too flexible with what is regarded as being on the same "line".
> I modified the comparator to be more strict when deciding which characters are on the same line, specifically:
> 1. Two pieces of text can't be on the same line if one's font is double or more the size of the other's.
> 2. Two pieces of text can't be on the same line if one's baseline is more than half the smaller font point size from the other's baseline.
> I'm sure there are probably (superscript?) cases where these two conditions may be too strict, but at least they should (I think but haven't tried to prove :) result in a < b < c.  The comparator source I have used is below, feel free to use or modify it in any way.
> Finally, PDFTextStripper needs to be more discriminating in inserting line breaks.  Specifically, if the x position of a text segment is < the x position of the last text segment, the there is an implicit line-break.  To fix this, I changed:
>      if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
> to:
>      if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine) || (sortByPosition && positionX < lastPosition.getXDirAdj()))
> Revised comparator source:
> public class TextPositionComparator implements Comparator
> {
>         private int strictCompare(Object o1, Object o2)
> 	{
> 		TextPosition pos1 = (TextPosition)o1;
>         TextPosition pos2 = (TextPosition)o2;
>         
>         // Get the text direction adjusted coordinates
>         
>         float pos1YBottom = pos1.getYDirAdj();
>         float pos2YBottom = pos2.getYDirAdj();
>         if (pos1YBottom < pos2YBottom)
>         	return -1;
>         else if (pos1YBottom > pos2YBottom)
>         	return 1;
>         
>         float x1 = pos1.getXDirAdj();
>         float x2 = pos2.getXDirAdj();
>         
>         if (x1 < x2)
>         	return -1;
>         else if (x1 > x2)
>         	return 1;
>         
>         return 0;
> 	}
> 	
> 	public int compare(Object o1, Object o2)
> 	{
> 		TextPosition pos1 = (TextPosition)o1;
>         TextPosition pos2 = (TextPosition)o2;
>         /* Only compare text that is in the same direction. */
>         if (pos1.getDir() < pos2.getDir())
>             return -1;
>         else if (pos1.getDir() > pos2.getDir())
>             return 1;
>         float size1 = pos1.getFontSize();
>         float size2 = pos2.getFontSize();
>         
>         if (size1 <= size2/2 || size1 >= size2*2)
>         	return strictCompare(o1, o2);
>         float fontsize = size1;
>         
>         if (size2 < size1)
>         	fontsize = size2;
>         
>         float pos1YBottom = pos1.getYDirAdj();
>         float pos2YBottom = pos2.getYDirAdj();
>         if (pos1YBottom <= pos2YBottom - fontsize/2 || pos1YBottom >= pos2YBottom + fontsize/2)
>         	return strictCompare(o1, o2);
>         
>         // Get the text direction adjusted coordinates
>         float x1 = pos1.getXDirAdj();
>         float x2 = pos2.getXDirAdj();
>         if (x1 < x2)
>         	return -1;
>         else if (x1 > x2)
>         	return 1;
>         
>         return 0;
> 	}
> }
> YMMV.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)