You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/11/14 09:44:00 UTC
[jira] [Comment Edited] (PDFBOX-5308) Prefer MergeSort over QuickSort and try native TimSort first (with explanation)
[ https://issues.apache.org/jira/browse/PDFBOX-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443266#comment-17443266 ]
Tilman Hausherr edited comment on PDFBOX-5308 at 11/14/21, 9:43 AM:
--------------------------------------------------------------------
Commit 1895012 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1895012 ]
PDFBOX-5308: try TimSort first, use QuickSort as fallback, as suggested by Alistair Oldfield
was (Author: jira-bot):
Commit 1895012 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1895012 ]
PDFBOX-5308: try TimSort first, use QuickSort as fallback
> Prefer MergeSort over QuickSort and try native TimSort first (with explanation)
> -------------------------------------------------------------------------------
>
> Key: PDFBOX-5308
> URL: https://issues.apache.org/jira/browse/PDFBOX-5308
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.24
> Reporter: Alistair Oldfield
> Priority: Minor
> Fix For: 2.0.25, 3.0.0 PDFBox
>
> Attachments: sort-benchmark-02.txt, sort-benchmark.txt
>
>
> Hello!
> I have 2 related proposals for improvement (so I am logging as one ticket, I am happy to split them if requested, however).
> 1)
> Propose to use an iterative MergeSort implementation over an iterative QuickSort implementation in TextPosition sorting during text extraction.
> The reason: while QuickSort and MergeSort generally perform O(NlogN), and while MergeSort uses slightly more space (O(N)), QuickSort's worst-case performance is O(n2) and this case will, unfortunately, happen in (arguably) very common cases during text extraction: when the list is already sorted (or nearly sorted). Many PDFs will already stream text in the order of sorting. In many use-cases, QuickSort is among the worst-performing sorting algorithms for this particular task.
> This is why java's native sort on Collections (using TimSort - which assumes a high frequency of pre-sorted lists in real-world data) is ideal for many PDF text extraction sorting scenarios, and is a shame it is not being used. This brings me to the next part of the improvement proposal:
> 2)
> As the TextPositionComparator is not transitive, the current implementation to avoid JDK7+ enforcing transitivity, and throwing the "Comparison method violates its general contract" IllegalArgumentException is to check for java version, and if lower than 7, then ALWAYS use QuickSort.
> As TimSort often "approaches" O(N) for many PDF extraction use-cases, and as TextPositionComparator will not always cause an IllegalArgumentException during many sorting tasks, I propose to use java's native sort (which is relatively cheap), and catch the IllegalArgumentException (in many cases will be thrown early - not late), and then use whichever preferred "legacy" sorting algorithm on that failure.
> Here is my implementation to show what I mean:
>
>
> {code:java}
> import java.util.Collections;
> import java.util.Comparator;
> import java.util.List;
> public class SortUtils {
> public static <T> void sort(List<T> list, Comparator<T> comparator) {
> try{
> Collections.sort(list, comparator);
> }catch(java.lang.IllegalArgumentException e) {
> MergeSort.sort(list, comparator);
> }
> }
> }
>
> {code}
> This approach (even when used on jdk7+) has a significant impact on sorting costs on real-world PDFs (even with the occasional "double-attempt").
> I am also happy to share the MergeSort implementation I am using (which follows the sort contract for generics). It is iterative (not recursive). It is borrowed from here:
> [https://github.com/vlab-cs-ucsb/cashew/blob/master/jpf-security/src/examples/benchmarks/MergeSortIterative.java]
>
> It has some slight tweaks (uses System.arraycopy, etc,) and is "generic-ified". Please just let me know.
> I am also curious to know any reasons (which I may have missed), why QuickSort is the preferred sort (if the fact that many lists are mostly pre-sorted is already known and considered)?
> Thanks!
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org