You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Mikael Pesonen <mi...@lingsoft.fi> on 2020/01/09 09:34:35 UTC

Text search and similar

Hi,

I asked about these few years ago so maybe there is some new ideas.

1) Is it possible to config text index so that it would add, for 
example, all textual values (xsd:string etc) to index automatically? Now 
every property has to be configured manually.

2) Is there planned support for searching similar resources, based on 
the Lucene index?

Br

--

Re: Text search and similar

Posted by Mikael Pesonen <mi...@lingsoft.fi>.

Index all string literals:
https://issues.apache.org/jira/browse/JENA-1821

Search similar:
https://issues.apache.org/jira/browse/JENA-1822


On 13/01/2020 20.48, Chris Tomlinson wrote:
> Hi Mikael,
>
>> On Jan 13, 2020, at 3:30 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
>> I'm not quite sure what resource localName and entmap mean but this would be ideal yes.
>>
>> Reason for this is that we are providing our customers a file/metadata service so we don't have info on what metadata is inputted. For that reason we are using external Lucene index now and that is a bit of hassle.
> The localName of a resource URI, e.g., skos:prefLabel, is “prefLabel”. The entmap is discussed <https://jena.apache.org/documentation/query/text-query.html#entity-map-definition> in the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> documentation. The entmap associates an RDF property localName with a field in a Lucene document. This is what would be needed to use text:search to find triples. I.e., Lucene needs to know what field to search over for a given property.
>
> I’m still not seeing an answer regarding what constitutes "similar values” so I can’t respond to that.
>
> Please use the Jena issue tracker <https://issues.apache.org/jira/browse/JENA> and open an issue for the feature you’re proposing and refer to the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> for information about what is currently supported and what configuration capabilities are provided.
>
> Thank you,
> Chris
>
>
>
>
>
>
>
>

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Text search and similar

Posted by Mikael Pesonen <mi...@lingsoft.fi>.

Hi Chris,

On 13/01/2020 20.48, Chris Tomlinson wrote:
> Hi Mikael,
>
>> On Jan 13, 2020, at 3:30 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
>> I'm not quite sure what resource localName and entmap mean but this would be ideal yes.
>>
>> Reason for this is that we are providing our customers a file/metadata service so we don't have info on what metadata is inputted. For that reason we are using external Lucene index now and that is a bit of hassle.
> The localName of a resource URI, e.g., skos:prefLabel, is “prefLabel”. The entmap is discussed <https://jena.apache.org/documentation/query/text-query.html#entity-map-definition> in the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> documentation. The entmap associates an RDF property localName with a field in a Lucene document. This is what would be needed to use text:search to find triples. I.e., Lucene needs to know what field to search over for a given property.
>
> I’m still not seeing an answer regarding what constitutes "similar values” so I can’t respond to that.
About similar: it would be fine if it would be possible to find similar 
triple values. We are storing documents as plain text into a single 
value and would like to find the similar values.
>
> Please use the Jena issue tracker <https://issues.apache.org/jira/browse/JENA> and open an issue for the feature you’re proposing and refer to the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> for information about what is currently supported and what configuration capabilities are provided.
Okay I'll open issues for both. Thanks!
>
> Thank you,
> Chris
>
>
>
>
>
>
>
>

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Text search and similar

Posted by Chris Tomlinson <ch...@gmail.com>.

Hi Mikael,

> On Jan 13, 2020, at 3:30 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
> 
>> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
> I'm not quite sure what resource localName and entmap mean but this would be ideal yes.
> 
> Reason for this is that we are providing our customers a file/metadata service so we don't have info on what metadata is inputted. For that reason we are using external Lucene index now and that is a bit of hassle.

The localName of a resource URI, e.g., skos:prefLabel, is “prefLabel”. The entmap is discussed <https://jena.apache.org/documentation/query/text-query.html#entity-map-definition> in the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> documentation. The entmap associates an RDF property localName with a field in a Lucene document. This is what would be needed to use text:search to find triples. I.e., Lucene needs to know what field to search over for a given property.

I’m still not seeing an answer regarding what constitutes "similar values” so I can’t respond to that.

Please use the Jena issue tracker <https://issues.apache.org/jira/browse/JENA> and open an issue for the feature you’re proposing and refer to the Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html> for information about what is currently supported and what configuration capabilities are provided.

Thank you,
Chris

Re: Text search and similar

Posted by Mikael Pesonen <mi...@lingsoft.fi>.


On 12/01/2020 21.50, Chris Tomlinson wrote:
> Hi Mikael,
>
>> On Jan 10, 2020, at 4:26 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>
>> Hi Chris,
>>
>> On 09/01/2020 17.50, Chris Tomlinson wrote:
>>> Hello Br,
>>>
>>>> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I asked about these few years ago so maybe there is some new ideas.
>>>>
>>>> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.
>>> No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.
>> In ideal case all values that are of type string literal would be indexed. Querys would work as now, you would define the properties you are querying, for example
>>
>> *(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*" "lang:en") Of course I don't know how hard this would be to implement. *
> So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.
I'm not quite sure what resource localName and entmap mean but this 
would be ideal yes.

Reason for this is that we are providing our customers a file/metadata 
service so we don't have info on what metadata is inputted. For that 
reason we are using external Lucene index now and that is a bit of hassle.
>
>
>>>> 2) Is there planned support for searching similar resources, based on the Lucene index?
>>> I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.
>>>
>>> Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.
>> This would be fine. At least for our needs it would enough to find similar values only, not entire resources.
> I’m sorry I still don’t know what constitutes "similar values”. I’m guessing you’re referring to using Lucene fuzzy matches, proximity matches and the like. These are already supported to an extent (see Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html>).
>
> This sort of thing would not be released until Jena 3.15 at the earliest. I haven’t given any implementation thought to this other than what’s written here.
>
> Regards,
> Chris
>
>
>>> Chris
>>>
>>>> Br
>>>>
>>>> -- 
>>>>
>> -- 
>> Lingsoft - 30 years of Leading Language Management
>>
>> www.lingsoft.fi
>>
>> Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
>>
>> Mikael Pesonen
>> System Engineer
>>
>> e-mail: mikael.pesonen@lingsoft.fi
>> Tel. +358 2 279 3300
>>
>> Time zone: GMT+2
>>
>> Helsinki Office
>> Eteläranta 10
>> FI-00130 Helsinki
>> FINLAND
>>
>> Turku Office
>> Kauppiaskatu 5 A
>> FI-20100 Turku
>> FINLAND
>>
>

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Text search and similar

Posted by Chris Tomlinson <ch...@gmail.com>.

Hi Mikael,

> On Jan 10, 2020, at 4:26 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
> 
> 
> Hi Chris,
> 
> On 09/01/2020 17.50, Chris Tomlinson wrote:
>> Hello Br,
>> 
>>> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>> 
>>> 
>>> Hi,
>>> 
>>> I asked about these few years ago so maybe there is some new ideas.
>>> 
>>> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.
>> No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.
> In ideal case all values that are of type string literal would be indexed. Querys would work as now, you would define the properties you are querying, for example
> 
> *(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*" "lang:en") Of course I don't know how hard this would be to implement. *

So, you're wanting objects of type xsd:string and rdf:langString to be indexed with the property/predicate appearing in the triple. This in turn would mean that a field name would need to be created based on the resource localName of the property and for rdf:langString a default lang field name would need to be defined in the assembler file along with whatever multi-language analyzer structure is needed. This is tantamount to creating the entmap for the Lucene index configuration on-the-fly.


>> 
>>> 2) Is there planned support for searching similar resources, based on the Lucene index?
>> I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.
>> 
>> Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.
> This would be fine. At least for our needs it would enough to find similar values only, not entire resources.

I’m sorry I still don’t know what constitutes "similar values”. I’m guessing you’re referring to using Lucene fuzzy matches, proximity matches and the like. These are already supported to an extent (see Jena Full Text Search <https://jena.apache.org/documentation/query/text-query.html>).

This sort of thing would not be released until Jena 3.15 at the earliest. I haven’t given any implementation thought to this other than what’s written here.

Regards,
Chris


>> 
>> Chris
>> 
>>> Br
>>> 
>>> -- 
>>> 
>> 
> 
> -- 
> Lingsoft - 30 years of Leading Language Management
> 
> www.lingsoft.fi
> 
> Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
> 
> Mikael Pesonen
> System Engineer
> 
> e-mail: mikael.pesonen@lingsoft.fi
> Tel. +358 2 279 3300
> 
> Time zone: GMT+2
> 
> Helsinki Office
> Eteläranta 10
> FI-00130 Helsinki
> FINLAND
> 
> Turku Office
> Kauppiaskatu 5 A
> FI-20100 Turku
> FINLAND
>

Re: Text search and similar

Posted by Mikael Pesonen <mi...@lingsoft.fi>.

Hi Chris,

On 09/01/2020 17.50, Chris Tomlinson wrote:
> Hello Br,
>
>> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
>>
>>
>> Hi,
>>
>> I asked about these few years ago so maybe there is some new ideas.
>>
>> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.
> No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.
In ideal case all values that are of type string literal would be 
indexed. Querys would work as now, you would define the properties you 
are querying, for example

*(?concept ?score ?prefLabel) text:query (skos:prefLabel "tech*" 
"lang:en") Of course I don't know how hard this would be to implement. *
>
>> 2) Is there planned support for searching similar resources, based on the Lucene index?
> I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.
>
> Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.
This would be fine. Atleast for our needs it would enough to find 
similar values only, not entire resources.
>
> Chris
>
>> Br
>>
>> -- 
>>
>

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: Text search and similar

Posted by Chris Tomlinson <ch...@gmail.com>.

Hello Br,

> On Jan 9, 2020, at 3:34 AM, Mikael Pesonen <mi...@lingsoft.fi> wrote:
> 
> 
> Hi,
> 
> I asked about these few years ago so maybe there is some new ideas.
> 
> 1) Is it possible to config text index so that it would add, for example, all textual values (xsd:string etc) to index automatically? Now every property has to be configured manually.

No it is not currently possible. Perhaps more detail on how you would see using such a feature and how you would handle various literal datatypes (convert all to xsd:string?) and then how would you search, currently searches are focussed on one or more properties - a recent update allows to provide a list of properties that can be searched in a single Lucene search. More detail is available at https://jena.apache.org/documentation/query/text-query.html <https://jena.apache.org/documentation/query/text-query.html>.

> 
> 2) Is there planned support for searching similar resources, based on the Lucene index?

I’m not aware of any such plans. More detail would be needed to evaluate feasibility, in particular how to recognize resources as similar.

Please note that the Jena+Lucene model is to index individual triples as Lucene documents not entire graphs or models which in turn leads to indexing and searching focussed on properties.

Chris

> 
> Br
> 
> -- 
>