You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Adrian Gschwend <ml...@netlabs.org> on 2023/06/14 09:47:04 UTC

State of Elastic/Open Search support in Fuseki

According to https://jena.apache.org/documentation/query/text-query.html 
there was support for text search using Elastic instead of Lucene in 
Fuseki at some point at least. But from what I can see it was removed 
(?) in 4.x.

We have a use-case where faceted search is important and this is quite 
hard in SPARQL 1.1, paging & counting is less than ideal. Either the 
queries get very complex or the counts are wrong.

What was the reason for removing that code, lack of maintenance? If so, 
any ideas on how much work it would be to bring this to the 4.x codebase 
again? I guess it might make sense to switch to OpenSearch as well 
instead with the Elastic license issues.

Anyone has or had Elastic in use in Fuseki & can share some experience? 
I found some posts here and there but not much details about how the 
integration worked.

regards

Adrian


Re: State of Elastic/Open Search support in Fuseki

Posted by Nicholas Car <ni...@kurrawong.net>.
We use Lucene in Fuseki 4.x quite successfully. Perhaps the removal of support for Elastic was simply that Lucene is supported and that is fine for most use cases.

Lucene support seems not to make faceting available (as recently discussed here by David in my company) so there is likely Lucene improvements that can be made.

Can you articulate what advantages you see in Elastic/OpenSearch support over Lucene?

Nick

On Wed, Jun 14, 2023 at 7:47 pm, Adrian Gschwend <[ml-ktk@netlabs.org](mailto:On Wed, Jun 14, 2023 at 7:47 pm, Adrian Gschwend <<a href=)> wrote:

> According to https://jena.apache.org/documentation/query/text-query.html
> there was support for text search using Elastic instead of Lucene in
> Fuseki at some point at least. But from what I can see it was removed
> (?) in 4.x.
>
> We have a use-case where faceted search is important and this is quite
> hard in SPARQL 1.1, paging & counting is less than ideal. Either the
> queries get very complex or the counts are wrong.
>
> What was the reason for removing that code, lack of maintenance? If so,
> any ideas on how much work it would be to bring this to the 4.x codebase
> again? I guess it might make sense to switch to OpenSearch as well
> instead with the Elastic license issues.
>
> Anyone has or had Elastic in use in Fuseki & can share some experience?
> I found some posts here and there but not much details about how the
> integration worked.
>
> regards
>
> Adrian

Re: State of Elastic/Open Search support in Fuseki

Posted by Andy Seaborne <an...@apache.org>.

On 19/06/2023 13:29, Adrian Gschwend wrote:
> On 16.06.23 21:53, Andy Seaborne wrote:
> 
> Hi Andy,
> 
>>  From the documentation:
> ah thanks, I read it but have to wrap my head around it with some 
> examples to understand what happens.
> 
>> There is also the model of "One document equals one entity" model that 
>> might be more appropriate faceted search. It returns the subject URI 
>> with a Lucene document for multiple triples.
> 
> same. That might be what I had in mind.
> 
>> There then needs to be a facet property function. Would someone like 
>> to sketch one out as a GH issue?
> 
> I'll try to come up with some examples so we can see what would be useful.
> 
>> ** ElasticSearch - if we can negotiate the licensing issues (the 
>> client libs are OSS but to test them needs a server so it impacts the 
>> build; there may be a testcontainers.io way round this, or optional 
>> tests - we need the build to be clean as well as the produced 
>> binaries), then this could be done and/or solr. It does need someone 
>> or someones to take an interest in this both now and for keeping the 
>> code maintained especially if any security issues arise.
> 
> But the licensing issues would be solved if we switch to OpenSearch or 
> am I missing something?

Probably not - while the feature set isn't identical. For Jena usage, 
probably not significant.

There is a testcontainer:
https://github.com/opensearch-project/opensearch-testcontainers

     Andy

> And I agree on the interest, I will think about it more and see if we 
> have a case that is worth spending some time/money on that.
> 
> regards
> 
> Adrian

Re: State of Elastic/Open Search support in Fuseki

Posted by Adrian Gschwend <ml...@netlabs.org>.
On 16.06.23 21:53, Andy Seaborne wrote:

Hi Andy,

>  From the documentation:
ah thanks, I read it but have to wrap my head around it with some 
examples to understand what happens.

> There is also the model of "One document equals one entity" model that 
> might be more appropriate faceted search. It returns the subject URI 
> with a Lucene document for multiple triples.

same. That might be what I had in mind.

> There then needs to be a facet property function. Would someone like to 
> sketch one out as a GH issue?

I'll try to come up with some examples so we can see what would be useful.

> ** ElasticSearch - if we can negotiate the licensing issues (the client 
> libs are OSS but to test them needs a server so it impacts the build; 
> there may be a testcontainers.io way round this, or optional tests - we 
> need the build to be clean as well as the produced binaries), then this 
> could be done and/or solr. It does need someone or someones to take an 
> interest in this both now and for keeping the code maintained especially 
> if any security issues arise.

But the licensing issues would be solved if we switch to OpenSearch or 
am I missing something?

And I agree on the interest, I will think about it more and see if we 
have a case that is worth spending some time/money on that.

regards

Adrian

Re: State of Elastic/Open Search support in Fuseki

Posted by Andy Seaborne <an...@apache.org>.
** Faceted search

 From the documentation:

There is also the model of "One document equals one entity" model that 
might be more appropriate faceted search. It returns the subject URI 
with a Lucene document for multiple triples.

"""
When using this integration model, text:query returns the subject URI 
for the document
"""

There then needs to be a facet property function. Would someone like to 
sketch one out as a GH issue?

** ElasticSearch - if we can negotiate the licensing issues (the client 
libs are OSS but to test them needs a server so it impacts the build; 
there may be a testcontainers.io way round this, or optional tests - we 
need the build to be clean as well as the produced binaries), then this 
could be done and/or solr. It does need someone or someones to take an 
interest in this both now and for keeping the code maintained especially 
if any security issues arise.

     Andy

On 15/06/2023 12:49, Adrian Gschwend wrote:
> On 14.06.23 14:45, Øyvind Gjesdal wrote:
> 
> Hi Øyvind,
> 
> 
>> Facet/aggregation was not implemented as extension functions in SPARQL 
>> and
>> I believe that it also used the same abstraction described in the 
>> jena-text
>> docs:
>>
>>>   One Jena*triple*  equals one Lucene*document*
>> which makes aggregations/facets not available or usable neither from the
>> Elasticsearch APIs.
> 
> yes I saw that and I also thought that's probably not ideal. I don't 
> know much about Elastic in practice, I mainly read tutorials & 
> documentation. What I had in mind was that we could define for example 
> via SHACL shape (or something comparable) what a "document" contains. So 
> it's shapes that would define how we see the document and we could use 
> this abstraction for search. So the integration would take SHACL shapes, 
> create a "document" out of it that is consumable by Elastic and then we 
> could use this for search.
> 
> The second thing is that I'm mainly interested in an integration that we 
> don't have to update the Elastic index on our own. I guess that the 
> Fuseki integration takes care of that so it's "in sync" all the time. I 
> would want the Elastic API available as well as this is easier to use 
> for the facet use-cases than pure SPARQL. Paging is not trivial in 
> SPARQL for use-cases like this, the Elastic API however is built for that.
> 
>> We switched to jena-text with Lucene after some weeks, which didn't have
>> aggregations either, but there was much more activity and usage for the
>> module, and the options for configuring from the assembler files were 
>> much
>> richer.
> 
> ok, any example of what you configure in there? I don't think I saw much 
> in the documentation for that so far. Aggregations are definitely 
> something I would like to have. One example are archival records, where 
> we have a hierarchy in the data. And I need to be able to show that 
> hierarchy per record (which has it's own IRI) and to browse by hierarchy 
> levels as well. This is super easy to represent in RDF but super hard to 
> query efficiently.
> 
>> At the moment I'm unsure if I inspected and looked at the Elasticsearch
>> APIs directly to check the structure of the documents in the index 
>> itself,
>> after indexing.
> 
> What versions did you work on with Elastic?
> 
> regards
> 
> Adrian
> 

Re: State of Elastic/Open Search support in Fuseki

Posted by Øyvind Gjesdal <oy...@gmail.com>.
Hi Adrian,

From the git history I can see that we were using Elasticsearch 6.4.3 and
Fuseki 3.10.0 when we experimented with it and had it running.

AFAIK Both Lucene and Elasticsearch indexes the data for you as the
triplestore updates, without having to do anything, using a normal
configuration.

We mostly ended up using the defaults for jena-text and Lucene, but I think
what I missed was changing the analyzer or tokenizer. but on second thought
I could have maybe used the elasticsearch settings rest endpoint?

Best regards
Øyvind

Re: State of Elastic/Open Search support in Fuseki

Posted by Adrian Gschwend <ml...@netlabs.org>.
On 14.06.23 14:45, Øyvind Gjesdal wrote:

Hi Øyvind,


> Facet/aggregation was not implemented as extension functions in SPARQL and
> I believe that it also used the same abstraction described in the jena-text
> docs:
> 
>>   One Jena*triple*  equals one Lucene*document*
> which makes aggregations/facets not available or usable neither from the
> Elasticsearch APIs.

yes I saw that and I also thought that's probably not ideal. I don't 
know much about Elastic in practice, I mainly read tutorials & 
documentation. What I had in mind was that we could define for example 
via SHACL shape (or something comparable) what a "document" contains. So 
it's shapes that would define how we see the document and we could use 
this abstraction for search. So the integration would take SHACL shapes, 
create a "document" out of it that is consumable by Elastic and then we 
could use this for search.

The second thing is that I'm mainly interested in an integration that we 
don't have to update the Elastic index on our own. I guess that the 
Fuseki integration takes care of that so it's "in sync" all the time. I 
would want the Elastic API available as well as this is easier to use 
for the facet use-cases than pure SPARQL. Paging is not trivial in 
SPARQL for use-cases like this, the Elastic API however is built for that.

> We switched to jena-text with Lucene after some weeks, which didn't have
> aggregations either, but there was much more activity and usage for the
> module, and the options for configuring from the assembler files were much
> richer.

ok, any example of what you configure in there? I don't think I saw much 
in the documentation for that so far. Aggregations are definitely 
something I would like to have. One example are archival records, where 
we have a hierarchy in the data. And I need to be able to show that 
hierarchy per record (which has it's own IRI) and to browse by hierarchy 
levels as well. This is super easy to represent in RDF but super hard to 
query efficiently.

> At the moment I'm unsure if I inspected and looked at the Elasticsearch
> APIs directly to check the structure of the documents in the index itself,
> after indexing.

What versions did you work on with Elastic?

regards

Adrian


Re: State of Elastic/Open Search support in Fuseki

Posted by Øyvind Gjesdal <oy...@gmail.com>.
Hi Adrian,

We tried the elastic-search module when it was available and had your
use-case in mind with facets. But as far as I remember I don't think it was
possible to use aggregations (at least from the sparql side of things).
I understood the elasticsearch-module as an alternative to the lucene
module, used in a similar manner, using the same extension function
(text:query).

Facet/aggregation was not implemented as extension functions in SPARQL and
I believe that it also used the same abstraction described in the jena-text
docs:

>  One Jena *triple* equals one Lucene *document*

which makes aggregations/facets not available or usable neither from the
Elasticsearch APIs.

We switched to jena-text with Lucene after some weeks, which didn't have
aggregations either, but there was much more activity and usage for the
module, and the options for configuring from the assembler files were much
richer.

Just a disclaimer that this was a long time ago, and I could have
misunderstood how things worked.
At the moment I'm unsure if I inspected and looked at the Elasticsearch
APIs directly to check the structure of the documents in the index itself,
after indexing.

Best regards,
Øyvind




On Wed, Jun 14, 2023 at 11:48 AM Adrian Gschwend <ml...@netlabs.org> wrote:

>
> According to https://jena.apache.org/documentation/query/text-query.html
> there was support for text search using Elastic instead of Lucene in
> Fuseki at some point at least. But from what I can see it was removed
> (?) in 4.x.
>
> We have a use-case where faceted search is important and this is quite
> hard in SPARQL 1.1, paging & counting is less than ideal. Either the
> queries get very complex or the counts are wrong.
>
> What was the reason for removing that code, lack of maintenance? If so,
> any ideas on how much work it would be to bring this to the 4.x codebase
> again? I guess it might make sense to switch to OpenSearch as well
> instead with the Elastic license issues.
>
> Anyone has or had Elastic in use in Fuseki & can share some experience?
> I found some posts here and there but not much details about how the
> integration worked.
>
> regards
>
> Adrian
>
>