You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by anuj kumar <an...@gmail.com> on 2017/11/29 12:07:48 UTC

ARQ Sparql Algebra Extension

Hi,
 So I am working on a performance issue with our Triple Store (which is
based on HBase)
To give a background, the query I am executing looks like:

SELECT ?s
> WHERE {
>     ?s a file:File .
>     ?s ex:modified ?modified .
>     FILTER(?modified >="2017-11-05T00:00:00.00000"^^<http://
> www.w3.org/2001/XMLSchema#dateTime>)
> }


Looking at the ARQ Execution plan, it is like this:

(slice 0 1000
>     (project (?s)
>       (filter (>= ?modified "2017-1105T00:00:00.00000"^^<http://
> www.w3.org/2001/XMLSchema#dateTime>)
>         (bgp
>           (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://www.example.com/File#File>)
>           (triple ?s <http://www.example.com/common#modified> ?modified)
>         ))))


AND I have around 45000 File Objects in my Triple Store.

As you can see from the above execution plan, I first get the Subject ID
for these 45000 File objects and then I fire a query per File Id to get the
odified date for the same. This clearly is not performant.

My Questions:

1. Is there a better way to create a SELECT query to have a good execution
plan.
2. If not, then can I somehow change the generation of execution plan?
3. Is it advisable to re-write the ARQ Execution Plan to suite our need and
how complicated this might be.

Thanks and please let me know if you need more information.

Thanks,
Anuj Kumar
-- 
*Anuj Kumar*

Re: ARQ Sparql Algebra Extension

Posted by Andy Seaborne <an...@apache.org>.


On 29/11/17 12:07, anuj kumar wrote:
> Hi,done 
>   So I am working on a performance issue with our Triple Store (which is
> based on HBase)
> To give a background, the query I am executing looks like:
> 
> SELECT ?s
>> WHERE {
>>      ?s a file:File .
>>      ?s ex:modified ?modified .
>>      FILTER(?modified >="2017-11-05T00:00:00.00000"^^<http://
>> www.w3.org/2001/XMLSchema#dateTime>)
>> }
> 
> 
> Looking at the ARQ Execution plan, it is like this:

It's an algebra expression - it may not may not have been through the 
optimizer. In this case the high-level 9algebra) optimize doesn't do 
much with this query.

This does not stop your system doing some more optimziation in its own 
OpExecutor.

> 
> (slice 0 1000

Not in your query.

>>      (project (?s)
>>        (filter (>= ?modified "2017-1105T00:00:00.00000"^^<http://
>> www.w3.org/2001/XMLSchema#dateTime>)
>>          (bgp
>>            (triple ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
>> http://www.example.com/File#File>)
>>            (triple ?s <http://www.example.com/common#modified> ?modified)
>>          ))))
> 
> 
> AND I have around 45000 File Objects in my Triple Store.
> 
> As you can see from the above execution plan, I first get the Subject ID
> for these 45000 File objects and then I fire a query per File Id to get the
> odified date for the same. This clearly is not performant.

Not good for two reasons:

All the round triples to get the "ex:modified" when it should be server 
side (OK - that means putting something in the Hbase machine)

And also, it could do a range scan:
(think of hat as a physical execution plan and the algebra as a logical 
execution plan)




> 
> My Questions:
> 
> 1. Is there a better way to create a SELECT query to have a good execution
> plan.

Ideally, no but try this

  SELECT ?s
  WHERE {
       ?s ex:modified ?modified .
       FILTER(?modified >="2017-11-05T00:00:00.00000"^^xsd;dateTime)
       ?s a file:File .
  }


changing the BGP order and doing filter placement to get:

(project (?s)
   (sequence
     (filter (>= ?modified "2017-11-05T00:00:00.00000"^^xsd:dateTime)
       (bgp (triple ?s ex:modified ?modified)))
     (bgp (triple ?s rdf:type :File>))))


then in your code do:

     (filter (>= ?modified "2017-11-05T00:00:00.00000"^^xsd:dateTime)
       (bgp (triple ?s ex:modified ?modified)))


all in HBase (its a single range scan)

Subclass OpExecutor and implement OpFilter to spot such cases.

> 2. If not, then can I somehow change the generation of execution plan?
> 3. Is it advisable to re-write the ARQ Execution Plan to suite our need and
> how complicated this might be.

How sophisticated do you want it to be?!

It's an open ended question - more work, better optimization!

> 
> Thanks and please let me know if you need more information.
> 
> Thanks,
> Anuj Kumar
>