You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Mikael Pesonen <mi...@lingsoft.fi> on 2020/01/09 09:34:35 UTC
Text search and similar
Hi,
I asked about these few years ago so maybe there is some new ideas.
1) Is it possible to config text index so that it would add, for
example, all textual values (xsd:string etc) to index automatically? Now
every property has to be configured manually.
2) Is there planned support for searching similar resources, based on
the Lucene index?
Br
--
Re: Text search and similar
Posted by Mikael Pesonen <mi...@lingsoft.fi>.
Index all string literals:
https://issues.apache.org/jira/browse/JENA-1821
Search similar:
https://issues.apache.org/jira/browse/JENA-1822
On 13/01/2020 20.48, Chris Tomlinson wrote:
> Hi Mikael,
>
>> On Jan 13, 2020, at 3:30 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
>> I'm not quite sure what resource localName and entmap mean but this would be ideal yes.
>>
>> Reason for this is that we are providing our customers a file/metadata service so we don't have info on what metadata is inputted. For that reason we are using external Lucene index now and that is a bit of hassle.
> The localName of a resource URI, e.g., skos:prefLabel, is “prefLabel”. The entmap is discussed <https://jena.apache.org/documentation/query/text-query.html#entity-map-definition> in the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> documentation. The entmap associates an RDF property localName with a field in a Lucene document. This is what would be needed to use text:search to find triples. I.e., Lucene needs to know what field to search over for a given property.
>
> I’m still not seeing an answer regarding what constitutes "similar values” so I can’t respond to that.
>
> Please use the Jena issue tracker <https://issues.apache.org/jira/browse/JENA> and open an issue for the feature you’re proposing and refer to the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> for information about what is currently supported and what configuration capabilities are provided.
>
> Thank you,
> Chris
>
>
>
>
>
>
>
>
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
Mikael Pesonen
System Engineer
e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND
Re: Text search and similar
Posted by Mikael Pesonen <mi...@lingsoft.fi>.
Hi Chris,
On 13/01/2020 20.48, Chris Tomlinson wrote:
> Hi Mikael,
>
>> On Jan 13, 2020, at 3:30 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
>> I'm not quite sure what resource localName and entmap mean but this would be ideal yes.
>>
>> Reason for this is that we are providing our customers a file/metadata service so we don't have info on what metadata is inputted. For that reason we are using external Lucene index now and that is a bit of hassle.
> The localName of a resource URI, e.g., skos:prefLabel, is “prefLabel”. The entmap is discussed <https://jena.apache.org/documentation/query/text-query.html#entity-map-definition> in the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> documentation. The entmap associates an RDF property localName with a field in a Lucene document. This is what would be needed to use text:search to find triples. I.e., Lucene needs to know what field to search over for a given property.
>
> I’m still not seeing an answer regarding what constitutes "similar values” so I can’t respond to that.
About similar: it would be fine if it would be possible to find similar
triple values. We are storing documents as plain text into a single
value and would like to find the similar values.
>
> Please use the Jena issue tracker <https://issues.apache.org/jira/browse/JENA> and open an issue for the feature you’re proposing and refer to the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> for information about what is currently supported and what configuration capabilities are provided.
Okay I'll open issues for both. Thanks!
>
> Thank you,
> Chris
>
>
>
>
>
>
>
>
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
Mikael Pesonen
System Engineer
e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND
Re: Text search and similar
Posted by Chris Tomlinson <ch...@gmail.com>.
Hi Mikael,
> On Jan 13, 2020, at 3:30 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>
>> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
> I'm not quite sure what resource localName and entmap mean but this would be ideal yes.
>
> Reason for this is that we are providing our customers a file/metadata service so we don't have info on what metadata is inputted. For that reason we are using external Lucene index now and that is a bit of hassle.
The localName of a resource URI, e.g., skos:prefLabel, is “prefLabel”. The entmap is discussed <https://jena.apache.org/documentation/query/text-query.html#entity-map-definition> in the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> documentation. The entmap associates an RDF property localName with a field in a Lucene document. This is what would be needed to use text:search to find triples. I.e., Lucene needs to know what field to search over for a given property.
I’m still not seeing an answer regarding what constitutes "similar values” so I can’t respond to that.
Please use the Jena issue tracker <https://issues.apache.org/jira/browse/JENA> and open an issue for the feature you’re proposing and refer to the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> for information about what is currently supported and what configuration capabilities are provided.
Thank you,
Chris
Re: Text search and similar
Posted by Mikael Pesonen <mi...@lingsoft.fi>.
On 12/01/2020 21.50, Chris Tomlinson wrote:
> Hi Mikael,
>
>> On Jan 10, 2020, at 4:26 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>
>> Hi Chris,
>>
>> On 09/01/2020 17.50, Chris Tomlinson wrote:
>>> Hello Br,
>>>
>>>> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I asked about these few years ago so maybe there is some new ideas.
>>>>
>>>> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.
>>> No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.
>> In ideal case all values that are of type string literal would be indexed. Querys would work as now, you would define the properties you are querying, for example
>>
>> *(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*" "lang:en") Of course I don't know how hard this would be to implement. *
> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
I'm not quite sure what resource localName and entmap mean but this
would be ideal yes.
Reason for this is that we are providing our customers a file/metadata
service so we don't have info on what metadata is inputted. For that
reason we are using external Lucene index now and that is a bit of hassle.
>
>
>>>> 2) Is there planned support for searching similar resources, based on the Lucene index?
>>> I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.
>>>
>>> Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.
>> This would be fine. At least for our needs it would enough to find similar values only, not entire resources.
> I’m sorry I still don’t know what constitutes "similar values”. I’m guessing you’re referring to using Lucene fuzzy matches, proximity matches and the like. These are already supported to an extent (see Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html>).
>
> This sort of thing would not be released until Jena 3.15 at the earliest. I haven’t given any implementation thought to this other than what’s written here.
>
> Regards,
> Chris
>
>
>>> Chris
>>>
>>>> Br
>>>>
>>>> --
>>>>
>> --
>> Lingsoft - 30 years of Leading Language Management
>>
>> www.lingsoft.fi
>>
>> Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
>>
>> Mikael Pesonen
>> System Engineer
>>
>> e-mail: mikael.pesonen@lingsoft.fi
>> Tel. +358 2 279 3300
>>
>> Time zone: GMT+2
>>
>> Helsinki Office
>> Eteläranta 10
>> FI-00130 Helsinki
>> FINLAND
>>
>> Turku Office
>> Kauppiaskatu 5 A
>> FI-20100 Turku
>> FINLAND
>>
>
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
Mikael Pesonen
System Engineer
e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND
Re: Text search and similar
Posted by Chris Tomlinson <ch...@gmail.com>.
Hi Mikael,
> On Jan 10, 2020, at 4:26 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>
>
> Hi Chris,
>
> On 09/01/2020 17.50, Chris Tomlinson wrote:
>> Hello Br,
>>
>>> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>>
>>>
>>> Hi,
>>>
>>> I asked about these few years ago so maybe there is some new ideas.
>>>
>>> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.
>> No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.
> In ideal case all values that are of type string literal would be indexed. Querys would work as now, you would define the properties you are querying, for example
>
> *(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*" "lang:en") Of course I don't know how hard this would be to implement. *
So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
>>
>>> 2) Is there planned support for searching similar resources, based on the Lucene index?
>> I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.
>>
>> Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.
> This would be fine. At least for our needs it would enough to find similar values only, not entire resources.
I’m sorry I still don’t know what constitutes "similar values”. I’m guessing you’re referring to using Lucene fuzzy matches, proximity matches and the like. These are already supported to an extent (see Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html>).
This sort of thing would not be released until Jena 3.15 at the earliest. I haven’t given any implementation thought to this other than what’s written here.
Regards,
Chris
>>
>> Chris
>>
>>> Br
>>>
>>> --
>>>
>>
>
> --
> Lingsoft - 30 years of Leading Language Management
>
> www.lingsoft.fi
>
> Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
>
> Mikael Pesonen
> System Engineer
>
> e-mail: mikael.pesonen@lingsoft.fi
> Tel. +358 2 279 3300
>
> Time zone: GMT+2
>
> Helsinki Office
> Eteläranta 10
> FI-00130 Helsinki
> FINLAND
>
> Turku Office
> Kauppiaskatu 5 A
> FI-20100 Turku
> FINLAND
>
Re: Text search and similar
Posted by Mikael Pesonen <mi...@lingsoft.fi>.
Hi Chris,
On 09/01/2020 17.50, Chris Tomlinson wrote:
> Hello Br,
>
>> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>
>> Hi,
>>
>> I asked about these few years ago so maybe there is some new ideas.
>>
>> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.
> No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.
In ideal case all values that are of type string literal would be
indexed. Querys would work as now, you would define the properties you
are querying, for example
*(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*"
"lang:en") Of course I don't know how hard this would be to implement. *
>
>> 2) Is there planned support for searching similar resources, based on the Lucene index?
> I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.
>
> Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.
This would be fine. Atleast for our needs it would enough to find
similar values only, not entire resources.
>
> Chris
>
>> Br
>>
>> --
>>
>
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
Mikael Pesonen
System Engineer
e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND
Re: Text search and similar
Posted by Chris Tomlinson <ch...@gmail.com>.
Hello Br,
> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>
>
> Hi,
>
> I asked about these few years ago so maybe there is some new ideas.
>
> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.
No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.
>
> 2) Is there planned support for searching similar resources, based on the Lucene index?
I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.
Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.
Chris
>
> Br
>
> --
>