You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael van Rooyen (JIRA)" <ji...@apache.org> on 2010/05/21 15:16:17 UTC

[jira] Created: (PDFBOX-731) Inconsistencies in TextPositionComparator and sortByPosition

Inconsistencies in TextPositionComparator and sortByPosition
------------------------------------------------------------

                 Key: PDFBOX-731
                 URL: https://issues.apache.org/jira/browse/PDFBOX-731
             Project: PDFBox
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 1.1.0
         Environment: Any / all
            Reporter: Michael van Rooyen


Specifying sortByPosition on PDFTextStripper can result in scrambling of text.  The problem is caused largely by inconsistencies in TextPositionComparator, which does not always satisfy the required comparator constraint that if a < b and b < c, then a < c.  As a result, a true sort is sometimes not achievable.  This is caused by the comparator being too flexible with what is regarded as being on the same "line".

I modified the comparator to be more strict when deciding which characters are on the same line, specifically:

1. Two pieces of text can't be on the same line if one's font is double or more the size of the other's.
2. Two pieces of text can't be on the same line if one's baseline is more than half the smaller font point size from the other's baseline.

I'm sure there are probably (superscript?) cases where these two conditions may be too strict, but at least they should (I think but haven't tried to prove :) result in a < b < c.  The comparator source I have used is below, feel free to use or modify it in any way.

Finally, PDFTextStripper needs to be more discriminating in inserting line breaks.  Specifically, if the x position of a text segment is < the x position of the last text segment, the there is an implicit line-break.  To fix this, I changed:

     if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))

to:

     if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine) || (sortByPosition && positionX < lastPosition.getXDirAdj()))

Revised comparator source:

public class TextPositionComparator implements Comparator
{
        private int strictCompare(Object o1, Object o2)
	{
		TextPosition pos1 = (TextPosition)o1;
        TextPosition pos2 = (TextPosition)o2;
        
        // Get the text direction adjusted coordinates
        
        float pos1YBottom = pos1.getYDirAdj();
        float pos2YBottom = pos2.getYDirAdj();

        if (pos1YBottom < pos2YBottom)
        	return -1;
        else if (pos1YBottom > pos2YBottom)
        	return 1;
        
        float x1 = pos1.getXDirAdj();
        float x2 = pos2.getXDirAdj();
        
        if (x1 < x2)
        	return -1;
        else if (x1 > x2)
        	return 1;
        
        return 0;
	}
	
	public int compare(Object o1, Object o2)
	{
		TextPosition pos1 = (TextPosition)o1;
        TextPosition pos2 = (TextPosition)o2;

        /* Only compare text that is in the same direction. */
        if (pos1.getDir() < pos2.getDir())
            return -1;
        else if (pos1.getDir() > pos2.getDir())
            return 1;

        float size1 = pos1.getFontSize();
        float size2 = pos2.getFontSize();
        
        if (size1 <= size2/2 || size1 >= size2*2)
        	return strictCompare(o1, o2);

        float fontsize = size1;
        
        if (size2 < size1)
        	fontsize = size2;
        
        float pos1YBottom = pos1.getYDirAdj();
        float pos2YBottom = pos2.getYDirAdj();

        if (pos1YBottom <= pos2YBottom - fontsize/2 || pos1YBottom >= pos2YBottom + fontsize/2)
        	return strictCompare(o1, o2);
        
        // Get the text direction adjusted coordinates
        float x1 = pos1.getXDirAdj();
        float x2 = pos2.getXDirAdj();

        if (x1 < x2)
        	return -1;
        else if (x1 > x2)
        	return 1;
        
        return 0;
	}
}

YMMV.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.