You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Dave Griffith <da...@data.world> on 2019/06/07 18:18:26 UTC

Re: Batching federated calls using VALUES block

After a bit of work, I have what appears to be a working version of Jena
with batching SERVICE calls. It's sort of complex, so I'll be adding more
tests and documentation before submitting a pull request to y'all.  Is
there any contributor docs I should read, particularly around coding
standards, configurability, or level of testing expected.  I'd hate to get
the etiquette wrong here.  Around level of testing in particular, this is
as I say a pretty complex feature and deserves to be fully tested, but I'd
hate to slow down your (pretty darn fast) build.

Thanks.  It's been a delight extending your work.

Dave Griffith
Principal Engineer
data.world

On Wed, May 1, 2019 at 4:34 AM Andy Seaborne <an...@apache.org> wrote:

> Dave,
>
> By changing the order of parts of the query, the number of SERVICE calls
> can change.  Sometimes it is better to grab more data, once, than many
> small calls. And not just for performance if the remote endpoint is
> across the unreliable internet.
>
> As Rob says, batching for SERVICE calls would be good to have.
>
>      Andy
>
> On 01/05/2019 09:40, Rob Vesse wrote:
> > Dave
> >
> > Yes this is what is happening.  This stems from the fact that ARQ is
> designed as a lazy streaming evaluation engine i.e. It tries to do the
> least work possible to answer the query. This is why the underlying
> implementation is all iterator driven.  In some cases the engine does have
> to batch up everything in order to proceed e.g. DISTINCT/aggregation
> >
> > Introducing some degree of batching for SERVICE blocks might be a nice
> optimisation. I think this will definitely be valuable to the community,
> contributions are always appreciated
> >
> > Thanks,
> >
> > Rob
> >
> > On 30/04/2019, 18:31, "Dave Griffith" <da...@data.world> wrote:
> >
> >      I'm tracking down an issue with a very slow federated query.
> Looking
> >      through logs, Jena appears to be doing one call to the remote
> endpoint for
> >      every set of values that match locally.  This struck me as odd,
> since the
> >      SPARQL federation specs suggest that implementations may create
> "batched"
> >      queries to remote endpoints using VALUES blocks to pass multiple
> bindings.
> >      Looking through the source, it appears that Jena isn't doing that,
> but
> >      instead actually is issuing one remote call per binding.
> >
> >      Am I correct in assuming that this optimization isn't being done,
> or am I
> >      missing something?  Looking through the source, it looks like it
> wouldn't
> >      be _too_ difficult to change the QueryIterService class to batch up
> some
> >      number of results into an OpTable.  OpAsQuery.asQuery would then
> render
> >      that as a VALUES block before calling to the remote endpoint.
> There are a
> >      variety of issues to be resolved, most especially around batch
> size, but
> >      those don't appear insurmountable.  I haven't found any discussion
> of this
> >      possible optimization, but it's entirely possible I just didn't
> know where
> >      to look.  I'd be happy to do the work and submit a batch, but if
> there's a
> >      reason that people think this optimization shouldn't be done, I'd
> love to
> >      hear it before I start.
> >
> >      Thanks for reading, and I'd love to hear any thoughts on the matter.
> >
> >      Dave Griffith
> >      Principal Engineer
> >      data.world
> >
> >
> >
> >
> >
>

Re: Batching federated calls using VALUES block

Posted by Andy Seaborne <an...@apache.org>.

On 07/06/2019 19:18, Dave Griffith wrote:
> After a bit of work, I have what appears to be a working version of Jena
> with batching SERVICE calls. It's sort of complex, so I'll be adding more
> tests and documentation before submitting a pull request to y'all.  Is
> there any contributor docs I should read, particularly around coding
> standards, configurability, or level of testing expected.  I'd hate to get
> the etiquette wrong here.

https://github.com/apache/jena/blob/master/CONTRIBUTING.md

Style is more of a "we prefer" - the most important thing is the 
contribution!

If it is modifying existing code, follow the style of the class.
The codebase has a long history - different styles in different places.

> Around level of testing in particular, this is
> as I say a pretty complex feature and deserves to be fully tested, but I'd
> hate to slow down your (pretty darn fast) build.

Forking off a Fuseki server is not too expensive - the build does it 
multiple times already.

This is what the jena-integration-tests/ module is for - you do of 
course need Fuseki built to launch it and client-server testing ends up 
in this integration tests module.

> 
> Thanks.  It's been a delight extending your work.

Thank you!
Looking forward to the PR,

     Andy

> 
> Dave Griffith
> Principal Engineer
> data.world
> 
> On Wed, May 1, 2019 at 4:34 AM Andy Seaborne <an...@apache.org> wrote:
> 
>> Dave,
>>
>> By changing the order of parts of the query, the number of SERVICE calls
>> can change.  Sometimes it is better to grab more data, once, than many
>> small calls. And not just for performance if the remote endpoint is
>> across the unreliable internet.
>>
>> As Rob says, batching for SERVICE calls would be good to have.
>>
>>       Andy
>>
>> On 01/05/2019 09:40, Rob Vesse wrote:
>>> Dave
>>>
>>> Yes this is what is happening.  This stems from the fact that ARQ is
>> designed as a lazy streaming evaluation engine i.e. It tries to do the
>> least work possible to answer the query. This is why the underlying
>> implementation is all iterator driven.  In some cases the engine does have
>> to batch up everything in order to proceed e.g. DISTINCT/aggregation
>>>
>>> Introducing some degree of batching for SERVICE blocks might be a nice
>> optimisation. I think this will definitely be valuable to the community,
>> contributions are always appreciated
>>>
>>> Thanks,
>>>
>>> Rob
>>>
>>> On 30/04/2019, 18:31, "Dave Griffith" <da...@data.world> wrote:
>>>
>>>       I'm tracking down an issue with a very slow federated query.
>> Looking
>>>       through logs, Jena appears to be doing one call to the remote
>> endpoint for
>>>       every set of values that match locally.  This struck me as odd,
>> since the
>>>       SPARQL federation specs suggest that implementations may create
>> "batched"
>>>       queries to remote endpoints using VALUES blocks to pass multiple
>> bindings.
>>>       Looking through the source, it appears that Jena isn't doing that,
>> but
>>>       instead actually is issuing one remote call per binding.
>>>
>>>       Am I correct in assuming that this optimization isn't being done,
>> or am I
>>>       missing something?  Looking through the source, it looks like it
>> wouldn't
>>>       be _too_ difficult to change the QueryIterService class to batch up
>> some
>>>       number of results into an OpTable.  OpAsQuery.asQuery would then
>> render
>>>       that as a VALUES block before calling to the remote endpoint.
>> There are a
>>>       variety of issues to be resolved, most especially around batch
>> size, but
>>>       those don't appear insurmountable.  I haven't found any discussion
>> of this
>>>       possible optimization, but it's entirely possible I just didn't
>> know where
>>>       to look.  I'd be happy to do the work and submit a batch, but if
>> there's a
>>>       reason that people think this optimization shouldn't be done, I'd
>> love to
>>>       hear it before I start.
>>>
>>>       Thanks for reading, and I'd love to hear any thoughts on the matter.
>>>
>>>       Dave Griffith
>>>       Principal Engineer
>>>       data.world
>>>
>>>
>>>
>>>
>>>
>>
> 

Re: Batching federated calls using VALUES block

Posted by Rob Vesse <rv...@dotnetrdf.org>.
Dave

Thanks for continuing to look at this.  We don't have strict code standards as with such an old code base there are a wide variety of code styles so the general rule is to follow the surrounding code style. https://jena.apache.org/getting_involved/reviewing_contributions.html details our reviewing guidelines but these are fairly flexibly enforced.

Yes good test coverage for something like this would be a must.  If the tests are particularly slow they can always be put in a separate module and excluded from the faster "dev" profile.  You may want to create a separate tests module anyway because then you'd able to depend on Fuseki embedded and bring up Fuseki servers as part of your tests.  Since the SERVICE logic lives in ARQ trying to depend on Fuseki from there would create a circular dependency.  See https://jena.apache.org/documentation/fuseki2/fuseki-run.html#fuseki-main, particularly the bit on Fuseki as a Configurable and Embeddable SPARQL Server

Rob

On 07/06/2019, 19:19, "Dave Griffith" <da...@data.world> wrote:

    After a bit of work, I have what appears to be a working version of Jena
    with batching SERVICE calls. It's sort of complex, so I'll be adding more
    tests and documentation before submitting a pull request to y'all.  Is
    there any contributor docs I should read, particularly around coding
    standards, configurability, or level of testing expected.  I'd hate to get
    the etiquette wrong here.  Around level of testing in particular, this is
    as I say a pretty complex feature and deserves to be fully tested, but I'd
    hate to slow down your (pretty darn fast) build.
    
    Thanks.  It's been a delight extending your work.
    
    Dave Griffith
    Principal Engineer
    data.world
    
    On Wed, May 1, 2019 at 4:34 AM Andy Seaborne <an...@apache.org> wrote:
    
    > Dave,
    >
    > By changing the order of parts of the query, the number of SERVICE calls
    > can change.  Sometimes it is better to grab more data, once, than many
    > small calls. And not just for performance if the remote endpoint is
    > across the unreliable internet.
    >
    > As Rob says, batching for SERVICE calls would be good to have.
    >
    >      Andy
    >
    > On 01/05/2019 09:40, Rob Vesse wrote:
    > > Dave
    > >
    > > Yes this is what is happening.  This stems from the fact that ARQ is
    > designed as a lazy streaming evaluation engine i.e. It tries to do the
    > least work possible to answer the query. This is why the underlying
    > implementation is all iterator driven.  In some cases the engine does have
    > to batch up everything in order to proceed e.g. DISTINCT/aggregation
    > >
    > > Introducing some degree of batching for SERVICE blocks might be a nice
    > optimisation. I think this will definitely be valuable to the community,
    > contributions are always appreciated
    > >
    > > Thanks,
    > >
    > > Rob
    > >
    > > On 30/04/2019, 18:31, "Dave Griffith" <da...@data.world> wrote:
    > >
    > >      I'm tracking down an issue with a very slow federated query.
    > Looking
    > >      through logs, Jena appears to be doing one call to the remote
    > endpoint for
    > >      every set of values that match locally.  This struck me as odd,
    > since the
    > >      SPARQL federation specs suggest that implementations may create
    > "batched"
    > >      queries to remote endpoints using VALUES blocks to pass multiple
    > bindings.
    > >      Looking through the source, it appears that Jena isn't doing that,
    > but
    > >      instead actually is issuing one remote call per binding.
    > >
    > >      Am I correct in assuming that this optimization isn't being done,
    > or am I
    > >      missing something?  Looking through the source, it looks like it
    > wouldn't
    > >      be _too_ difficult to change the QueryIterService class to batch up
    > some
    > >      number of results into an OpTable.  OpAsQuery.asQuery would then
    > render
    > >      that as a VALUES block before calling to the remote endpoint.
    > There are a
    > >      variety of issues to be resolved, most especially around batch
    > size, but
    > >      those don't appear insurmountable.  I haven't found any discussion
    > of this
    > >      possible optimization, but it's entirely possible I just didn't
    > know where
    > >      to look.  I'd be happy to do the work and submit a batch, but if
    > there's a
    > >      reason that people think this optimization shouldn't be done, I'd
    > love to
    > >      hear it before I start.
    > >
    > >      Thanks for reading, and I'd love to hear any thoughts on the matter.
    > >
    > >      Dave Griffith
    > >      Principal Engineer
    > >      data.world
    > >
    > >
    > >
    > >
    > >
    >