You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2014/06/12 12:48:03 UTC

[jira] [Commented] (PDFBOX-1512) TextPositionComparator is not compatible with Java 7

    [ https://issues.apache.org/jira/browse/PDFBOX-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029015#comment-14029015 ] 

Andreas Lehmkühler commented on PDFBOX-1512:
--------------------------------------------

To avoid misunderstandings, IMHO the comparison itself isn't broken it works well, but it breaks the contract of the sort algorithm of the Collections framework.

The issue is that PDFBox not only uses the x,y values of a text position. In some cases the context is taken into account if two positions are compared which are neighbors. So that there are cases where there same combination of x,y values may lead to another result if the sorting is done in another order.

So, it should be possible to replace the Collections.sort() call with our own sort implementation (e.g. based on quicksort) using the very same TestPositionComparator.

Maybe there is some place for an improvement: 
The whole text is splitted into text postition, one for each character, so that we have to sort all single characters. The information of text chunks/whole words/lines of text got lost. We could preserve that information within the TextPosition (number of chunk/ index within the chunk) to simplify the comparison.


> TextPositionComparator is not compatible with Java 7
> ----------------------------------------------------
>
>                 Key: PDFBOX-1512
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1512
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Java 7
>            Reporter: Benjamin Papez
>            Assignee: Andreas Lehmkühler
>         Attachments: FOP-2252.pdf, TextPositionComparator.java, Topo.pdf, Topo.txt, TopoContained.pdf, TopoContained.txt, TopoOverlap.pdf, TopoOverlap.txt, WFI_PDFParser_TextPostionComparator.txt, illustration-of-inconsistent-sorting.png, immo-kurier_arsenal_93x62.pdf
>
>
> The TextPostionCompartor causes the following exception running on Java 7: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison method violates its general contract!
> I think the problem is with this check:
> if ( yDifference < .1 ||
>     (pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom) ||
>     (pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom))
> as it violates the contract requirement:
> The implementor must also ensure that the relation is transitive: ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.
> Finally, the implementor must ensure that compare(x, y)==0 implies that sgn(compare(x, z))==sgn(compare(y, z)) for all z.
> Java 7 now is strict and throws exceptions when the contract is violated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)