You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Mike Welch <mj...@verizonmedia.com.INVALID> on 2020/03/28 23:30:45 UTC

Best practice to combine simple queries

Hi everyone,

We have a use case like the following:  given a set of M known URIs
(typically 10s to a few hundred), fetch the same N (typically  2-10)
properties for each of them via Fuseki (TDB2).

Does anyone have any benchmarks, rules of thumb, hunches, etc of what would
be the most optimal approach for packaging and/or parallelizing requests?
We currently send 1 query per subject URI, so M parallel requests, each
request containing a UNION of N individual simple patterns:

{ ns:subj1 ns:prop1 ?a }
UNION
{ ns:subj1 ns:prop2 ?b }
...

This seems to work pretty well - better than M parallel queries with all
OPTIONAL patterns, M*N parallel single pattern requests, or 1 giant UNION
for all M*N things at once, but it was still a somewhat arbitrary choice
amongst many, many other possibilities.  Does anyone have
any suggestions of what might work better or be more friendly to the
internal optimizers, with the main goal of minimizing overall latency to
fetch all M*N things?

Thanks!
- Mike

Re: Best practice to combine simple queries

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
looking into the code [1] I think it's not possible to have multiple
VALUE clauses, so it might be a limitation of the SelectBuilder.

Claude Warren and others should know better than me though.


[1]
https://github.com/apache/jena/blob/master/jena-extras/jena-querybuilder/src/main/java/org/apache/jena/arq/querybuilder/handlers/ValuesHandler.java#L62

On 14.04.20 17:10, Jim Balhoff wrote:
> I use ParameterizedSparqlString to build my SPARQL queries.
>
>> On Apr 12, 2020, at 5:54 PM, Mike Welch <mj...@verizonmedia.com.INVALID> wrote:
>>
>> Thanks Jim, I did a bit of perf comparison (detail below).  Now my question
>> is: how do I use the Jena SelectBuilder interfaces to formulate such a
>> query?  Calling addWhereValueVar(..) twice, once for ?s and once for ?p,
>> fails since the two lists are not the same length.  It's trying to build a
>> single table like VALUES { ?s ?p } vs. two separate VALUES statements.
>>
>> Thanks,
>> - Mike
>>
>> Simple perf test on a 48 core, 512gb ram machine with SSDs.
>>
>> Empirically it seems like if batching requests by subject, the VALUES
>> approach and UNION have negligible performance differences for a reasonable
>> number of props.  If you start to increase the parallelism / thruput target
>> by batching multiple subjectIds, however, it does seem like VALUES
>> outperforms UNION.  For those interested:
>>
>> VALUES (1 subj * 8 props) * 5 parallel requests ~ 19ms at 95%
>> VALUES (1 subj * 8 props) * 20 parallel requests ~ 144ms at 95%
>> VALUES (4 subj * 8 props) * 5 parallel requests ~ 35ms at 95%
>> VALUES (4 subj * 8 props) * 10 parallel requests ~ 75ms at 95%
>>
>> UNION (1 subj * 8 props) * 5 parallel requests ~ 17ms at 95%
>> UNION (1 subj * 8 props) * 20 parallel requests ~ 136ms at 95%
>> UNION (4 subj * 8 props) * 5 parallel requests ~ 52ms at 95%
>> UNION (4 subj * 8 props) * 10 parallel requests ~ 124ms at 95%
>>
>>
>> On Sun, Mar 29, 2020 at 3:10 AM Balhoff, Jim <ba...@renci.org> wrote:
>>
>>> I usually do this sort of thing in one query using VALUES. For example
>>>
>>> SELECT ?s ?p ?o
>>> WHERE {
>>> VALUES ?s { ns:subj1 ns:subj2 ns:subj3 }
>>> VALUES ?p { ns:prop1 ns:prop2 }
>>> ?s ?p ?o .
>>> }
>>>
>>> Best regards,
>>> Jim
>>>
>>>
>>>> On Mar 28, 2020, at 7:30 PM, Mike Welch <mj...@verizonmedia.com.INVALID>
>>> wrote:
>>>> Hi everyone,
>>>>
>>>> We have a use case like the following:  given a set of M known URIs
>>>> (typically 10s to a few hundred), fetch the same N (typically  2-10)
>>>> properties for each of them via Fuseki (TDB2).
>>>>
>>>> Does anyone have any benchmarks, rules of thumb, hunches, etc of what
>>> would
>>>> be the most optimal approach for packaging and/or parallelizing requests?
>>>> We currently send 1 query per subject URI, so M parallel requests, each
>>>> request containing a UNION of N individual simple patterns:
>>>>
>>>> { ns:subj1 ns:prop1 ?a }
>>>> UNION
>>>> { ns:subj1 ns:prop2 ?b }
>>>> ...
>>>>
>>>> This seems to work pretty well - better than M parallel queries with all
>>>> OPTIONAL patterns, M*N parallel single pattern requests, or 1 giant UNION
>>>> for all M*N things at once, but it was still a somewhat arbitrary choice
>>>> amongst many, many other possibilities.  Does anyone have
>>>> any suggestions of what might work better or be more friendly to the
>>>> internal optimizers, with the main goal of minimizing overall latency to
>>>> fetch all M*N things?
>>>>
>>>> Thanks!
>>>> - Mike
>>>
>

Re: Best practice to combine simple queries

Posted by Jim Balhoff <ba...@gmail.com>.
I use ParameterizedSparqlString to build my SPARQL queries.

> On Apr 12, 2020, at 5:54 PM, Mike Welch <mj...@verizonmedia.com.INVALID> wrote:
> 
> Thanks Jim, I did a bit of perf comparison (detail below).  Now my question
> is: how do I use the Jena SelectBuilder interfaces to formulate such a
> query?  Calling addWhereValueVar(..) twice, once for ?s and once for ?p,
> fails since the two lists are not the same length.  It's trying to build a
> single table like VALUES { ?s ?p } vs. two separate VALUES statements.
> 
> Thanks,
> - Mike
> 
> Simple perf test on a 48 core, 512gb ram machine with SSDs.
> 
> Empirically it seems like if batching requests by subject, the VALUES
> approach and UNION have negligible performance differences for a reasonable
> number of props.  If you start to increase the parallelism / thruput target
> by batching multiple subjectIds, however, it does seem like VALUES
> outperforms UNION.  For those interested:
> 
> VALUES (1 subj * 8 props) * 5 parallel requests ~ 19ms at 95%
> VALUES (1 subj * 8 props) * 20 parallel requests ~ 144ms at 95%
> VALUES (4 subj * 8 props) * 5 parallel requests ~ 35ms at 95%
> VALUES (4 subj * 8 props) * 10 parallel requests ~ 75ms at 95%
> 
> UNION (1 subj * 8 props) * 5 parallel requests ~ 17ms at 95%
> UNION (1 subj * 8 props) * 20 parallel requests ~ 136ms at 95%
> UNION (4 subj * 8 props) * 5 parallel requests ~ 52ms at 95%
> UNION (4 subj * 8 props) * 10 parallel requests ~ 124ms at 95%
> 
> 
> On Sun, Mar 29, 2020 at 3:10 AM Balhoff, Jim <ba...@renci.org> wrote:
> 
>> I usually do this sort of thing in one query using VALUES. For example
>> 
>> SELECT ?s ?p ?o
>> WHERE {
>> VALUES ?s { ns:subj1 ns:subj2 ns:subj3 }
>> VALUES ?p { ns:prop1 ns:prop2 }
>> ?s ?p ?o .
>> }
>> 
>> Best regards,
>> Jim
>> 
>> 
>>> On Mar 28, 2020, at 7:30 PM, Mike Welch <mj...@verizonmedia.com.INVALID>
>> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> We have a use case like the following:  given a set of M known URIs
>>> (typically 10s to a few hundred), fetch the same N (typically  2-10)
>>> properties for each of them via Fuseki (TDB2).
>>> 
>>> Does anyone have any benchmarks, rules of thumb, hunches, etc of what
>> would
>>> be the most optimal approach for packaging and/or parallelizing requests?
>>> We currently send 1 query per subject URI, so M parallel requests, each
>>> request containing a UNION of N individual simple patterns:
>>> 
>>> { ns:subj1 ns:prop1 ?a }
>>> UNION
>>> { ns:subj1 ns:prop2 ?b }
>>> ...
>>> 
>>> This seems to work pretty well - better than M parallel queries with all
>>> OPTIONAL patterns, M*N parallel single pattern requests, or 1 giant UNION
>>> for all M*N things at once, but it was still a somewhat arbitrary choice
>>> amongst many, many other possibilities.  Does anyone have
>>> any suggestions of what might work better or be more friendly to the
>>> internal optimizers, with the main goal of minimizing overall latency to
>>> fetch all M*N things?
>>> 
>>> Thanks!
>>> - Mike
>> 
>> 


Re: Best practice to combine simple queries

Posted by Mike Welch <mj...@verizonmedia.com.INVALID>.
Thanks Jim, I did a bit of perf comparison (detail below).  Now my question
is: how do I use the Jena SelectBuilder interfaces to formulate such a
query?  Calling addWhereValueVar(..) twice, once for ?s and once for ?p,
fails since the two lists are not the same length.  It's trying to build a
single table like VALUES { ?s ?p } vs. two separate VALUES statements.

Thanks,
- Mike

Simple perf test on a 48 core, 512gb ram machine with SSDs.

Empirically it seems like if batching requests by subject, the VALUES
approach and UNION have negligible performance differences for a reasonable
number of props.  If you start to increase the parallelism / thruput target
by batching multiple subjectIds, however, it does seem like VALUES
outperforms UNION.  For those interested:

VALUES (1 subj * 8 props) * 5 parallel requests ~ 19ms at 95%
VALUES (1 subj * 8 props) * 20 parallel requests ~ 144ms at 95%
VALUES (4 subj * 8 props) * 5 parallel requests ~ 35ms at 95%
VALUES (4 subj * 8 props) * 10 parallel requests ~ 75ms at 95%

UNION (1 subj * 8 props) * 5 parallel requests ~ 17ms at 95%
UNION (1 subj * 8 props) * 20 parallel requests ~ 136ms at 95%
UNION (4 subj * 8 props) * 5 parallel requests ~ 52ms at 95%
UNION (4 subj * 8 props) * 10 parallel requests ~ 124ms at 95%


On Sun, Mar 29, 2020 at 3:10 AM Balhoff, Jim <ba...@renci.org> wrote:

> I usually do this sort of thing in one query using VALUES. For example
>
> SELECT ?s ?p ?o
> WHERE {
> VALUES ?s { ns:subj1 ns:subj2 ns:subj3 }
> VALUES ?p { ns:prop1 ns:prop2 }
> ?s ?p ?o .
> }
>
> Best regards,
> Jim
>
>
> > On Mar 28, 2020, at 7:30 PM, Mike Welch <mj...@verizonmedia.com.INVALID>
> wrote:
> >
> > Hi everyone,
> >
> > We have a use case like the following:  given a set of M known URIs
> > (typically 10s to a few hundred), fetch the same N (typically  2-10)
> > properties for each of them via Fuseki (TDB2).
> >
> > Does anyone have any benchmarks, rules of thumb, hunches, etc of what
> would
> > be the most optimal approach for packaging and/or parallelizing requests?
> > We currently send 1 query per subject URI, so M parallel requests, each
> > request containing a UNION of N individual simple patterns:
> >
> > { ns:subj1 ns:prop1 ?a }
> > UNION
> > { ns:subj1 ns:prop2 ?b }
> > ...
> >
> > This seems to work pretty well - better than M parallel queries with all
> > OPTIONAL patterns, M*N parallel single pattern requests, or 1 giant UNION
> > for all M*N things at once, but it was still a somewhat arbitrary choice
> > amongst many, many other possibilities.  Does anyone have
> > any suggestions of what might work better or be more friendly to the
> > internal optimizers, with the main goal of minimizing overall latency to
> > fetch all M*N things?
> >
> > Thanks!
> > - Mike
>
>

Re: Best practice to combine simple queries

Posted by "Balhoff, Jim" <ba...@renci.org>.
I usually do this sort of thing in one query using VALUES. For example

SELECT ?s ?p ?o
WHERE {
VALUES ?s { ns:subj1 ns:subj2 ns:subj3 }
VALUES ?p { ns:prop1 ns:prop2 }
?s ?p ?o .
}

Best regards,
Jim


> On Mar 28, 2020, at 7:30 PM, Mike Welch <mj...@verizonmedia.com.INVALID> wrote:
> 
> Hi everyone,
> 
> We have a use case like the following:  given a set of M known URIs
> (typically 10s to a few hundred), fetch the same N (typically  2-10)
> properties for each of them via Fuseki (TDB2).
> 
> Does anyone have any benchmarks, rules of thumb, hunches, etc of what would
> be the most optimal approach for packaging and/or parallelizing requests?
> We currently send 1 query per subject URI, so M parallel requests, each
> request containing a UNION of N individual simple patterns:
> 
> { ns:subj1 ns:prop1 ?a }
> UNION
> { ns:subj1 ns:prop2 ?b }
> ...
> 
> This seems to work pretty well - better than M parallel queries with all
> OPTIONAL patterns, M*N parallel single pattern requests, or 1 giant UNION
> for all M*N things at once, but it was still a somewhat arbitrary choice
> amongst many, many other possibilities.  Does anyone have
> any suggestions of what might work better or be more friendly to the
> internal optimizers, with the main goal of minimizing overall latency to
> fetch all M*N things?
> 
> Thanks!
> - Mike