You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by GitBox <gi...@apache.org> on 2019/02/15 14:03:43 UTC

[GitHub] grahamtriggs opened a new pull request #533: Feature/sdborder

grahamtriggs opened a new pull request #533: Feature/sdborder
URL: https://github.com/apache/jena/pull/533

This patch combines a few enhancements to the way SQL queries are formed:

1) The order of variables is retained in the ScopeBase (a HashSet is replaced with LinkedHashSet). When select a subject / property / object from the Triples/Quads table, the order in which the columns are selected can make a difference as to whether an SQL engine uses an index or not, and this at least makes it predictable (and in practice, the order will be subject - predicate - order, and allow the SubjPredObj index to be used - previously, with the HashSet losing the order in which the scopes were added, you might get predicate - subject - object, which isn't in an existing index).

2) Replace some DISTINCT clauses with GROUP BY statements, and fix the order in which columns are grouped to match an index. SQL Engines typically perform GROUP BY faster than a DISTINCT (e.g. in MySQL the optimizer explicitly rewrites DISTINCT to GROUP BY when it knows that it can). Again, fixing the order of the columns in the GROUP BY improves the likelihood that an index will be used for that part of the query.

3) Allow simple ORDER BY clauses to be pushed into the SQL - by simple, I mean ordering by a bound variable and without the use of a function.

This is DISABLED by default, because the order will be different to the order that the ordering iterator will return, and can be enabled with the "optimizeOrderClause" option. Whilst the ordering does not include the comparisons of the iterator, the order generated by the database should still be consistent with the SPARQL definition of ordering.

If you are returning the entire set of results from a query, there is generally negligible performance difference between the Java iterator and the SQL clause - it shifts a some of the execution time from the JVM to SQL.

However, passing the ORDER BY into SQL means that any LIMIT / OFFSET can also be passed into SQL. So in cases where you are LIMITing the amount of rows returned by SPARQL, the SQL version will be substantially faster. e.g.

SELECT ?s ?p ?o WHERE { ?s ?p ?o } ORDER BY ?s ?p ?o LIMIT 20

on a 500,000 triple store, this takes 12 seconds using the Java iterator, and only 1.5 seconds using the SQL ordering.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services