You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2014/10/12 14:29:34 UTC

[jira] [Commented] (PDFBOX-731) Inconsistencies in TextPositionComparator and sortByPosition

    [ https://issues.apache.org/jira/browse/PDFBOX-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168625#comment-14168625 ] 

Andreas Lehmkühler commented on PDFBOX-731:
-------------------------------------------

{quote}
1. Two pieces of text can't be on the same line if one's font is double or more the size of the other's.
2. Two pieces of text can't be on the same line if one's baseline is more than half the smaller font point size from the other's baseline.
{quote}
Both conclusions may work in some cases but it doesn't work in general, e.g. the cweb.pdf file from our testarena fails using the proposed patch. 

> Inconsistencies in TextPositionComparator and sortByPosition
> ------------------------------------------------------------
>
>                 Key: PDFBOX-731
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-731
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.1.0, 2.0.0
>         Environment: Any / all
>            Reporter: Michael van Rooyen
>            Assignee: Andreas Lehmkühler
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Specifying sortByPosition on PDFTextStripper can result in scrambling of text.  The problem is caused largely by inconsistencies in TextPositionComparator, which does not always satisfy the required comparator constraint that if a < b and b < c, then a < c.  As a result, a true sort is sometimes not achievable.  This is caused by the comparator being too flexible with what is regarded as being on the same "line".
> I modified the comparator to be more strict when deciding which characters are on the same line, specifically:
> 1. Two pieces of text can't be on the same line if one's font is double or more the size of the other's.
> 2. Two pieces of text can't be on the same line if one's baseline is more than half the smaller font point size from the other's baseline.
> I'm sure there are probably (superscript?) cases where these two conditions may be too strict, but at least they should (I think but haven't tried to prove :) result in a < b < c.  The comparator source I have used is below, feel free to use or modify it in any way.
> Finally, PDFTextStripper needs to be more discriminating in inserting line breaks.  Specifically, if the x position of a text segment is < the x position of the last text segment, the there is an implicit line-break.  To fix this, I changed:
> {code}
>      if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
> {code}
> to:
> {code}
>      if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine) || (sortByPosition && positionX < lastPosition.getXDirAdj()))
> {code}
> Revised comparator source:
> {code}
> public class TextPositionComparator implements Comparator
> {
>         private int strictCompare(Object o1, Object o2)
> 	{
> 		TextPosition pos1 = (TextPosition)o1;
>         TextPosition pos2 = (TextPosition)o2;
>         
>         // Get the text direction adjusted coordinates
>         
>         float pos1YBottom = pos1.getYDirAdj();
>         float pos2YBottom = pos2.getYDirAdj();
>         if (pos1YBottom < pos2YBottom)
>         	return -1;
>         else if (pos1YBottom > pos2YBottom)
>         	return 1;
>         
>         float x1 = pos1.getXDirAdj();
>         float x2 = pos2.getXDirAdj();
>         
>         if (x1 < x2)
>         	return -1;
>         else if (x1 > x2)
>         	return 1;
>         
>         return 0;
> 	}
> 	
> 	public int compare(Object o1, Object o2)
> 	{
> 		TextPosition pos1 = (TextPosition)o1;
>         TextPosition pos2 = (TextPosition)o2;
>         /* Only compare text that is in the same direction. */
>         if (pos1.getDir() < pos2.getDir())
>             return -1;
>         else if (pos1.getDir() > pos2.getDir())
>             return 1;
>         float size1 = pos1.getFontSize();
>         float size2 = pos2.getFontSize();
>         
>         if (size1 <= size2/2 || size1 >= size2*2)
>         	return strictCompare(o1, o2);
>         float fontsize = size1;
>         
>         if (size2 < size1)
>         	fontsize = size2;
>         
>         float pos1YBottom = pos1.getYDirAdj();
>         float pos2YBottom = pos2.getYDirAdj();
>         if (pos1YBottom <= pos2YBottom - fontsize/2 || pos1YBottom >= pos2YBottom + fontsize/2)
>         	return strictCompare(o1, o2);
>         
>         // Get the text direction adjusted coordinates
>         float x1 = pos1.getXDirAdj();
>         float x2 = pos2.getXDirAdj();
>         if (x1 < x2)
>         	return -1;
>         else if (x1 > x2)
>         	return 1;
>         
>         return 0;
> 	}
> }
> {code}
> YMMV.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)