You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2011/02/16 08:30:57 UTC

[jira] Created: (STANBOL-89) SolrYard uses string field for natural text queries

SolrYard uses string field for natural text queries
---------------------------------------------------

Key: STANBOL-89
URL: https://issues.apache.org/jira/browse/STANBOL-89
Project: Stanbol
Issue Type: Bug
Components: Entity Hub
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
Priority: Minor

This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!

Background:

The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.

Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
In contrast to that string field do not use any Tokenizer.

The Problem:

A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)

Old Implementation:

Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
String values for a field where stored with the prefix "str" without any processing.
Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
"(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
(1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
(2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.

With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements

Solution:

The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).

Additional Note:
It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Posted by Enrico Daga <en...@gmail.com>.

Hi

On 17 February 2011 16:28, Rupert Westenthaler <rw...@apache.org> wrote:
> Hi
>
>> There was a discussion a few weeks ago on the list about when closing
>> issues. As I understood, fixed issues should be set to "Resolved" and
>> issues will be closed when we do a release.
> I can remember this discussion but honestly had not think about it when
> closing this issue.
>
> However after checking I noticed, that "resolved" seams to be no
> longer an option when closing an issue.
>
> The current options are
>  - fixed
>  - won't fix
>  - duplicate
>  - invalid
>  - incomplete
>  - cannot reproduce
>  - later
>  - not a problem
>
> So I suggest to use "fixed" in future
+1
>
> best
> Rupert
>
> On Thu, Feb 17, 2011 at 3:07 PM, Fabian Christ
> <ch...@googlemail.com> wrote:
>> Hi,
>>
>> this issue is "closed" with resolution "Fixed".
>>
>> There was a discussion a few weeks ago on the list about when closing
>> issues. As I understood, fixed issues should be set to "Resolved" and
>> issues will be closed when we do a release.
>>
>>  - Fabian
>>
>> 2011/2/16 Rupert Westenthaler (JIRA) <ji...@apache.org>:
>>>
>>>     [ https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Rupert Westenthaler closed STANBOL-89.
>>> --------------------------------------
>>>
>>>    Resolution: Fixed
>>>
>>> Fixed with Revision 1071231
>>>
>>> This change does invalidate old indexes, because text searches within string field had not really worked before.
>>> However to benefit from this changes one would need to update the indices.
>>>
>>>> SolrYard uses string field for natural text queries
>>>> ---------------------------------------------------
>>>>
>>>>                 Key: STANBOL-89
>>>>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>>>>             Project: Stanbol
>>>>          Issue Type: Bug
>>>>          Components: Entity Hub
>>>>            Reporter: Rupert Westenthaler
>>>>            Assignee: Rupert Westenthaler
>>>>            Priority: Minor
>>>>
>>>> This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!
>>>> Background:
>>>> The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
>>>> This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.
>>>> Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
>>>> For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
>>>> In contrast to that string field do not use any Tokenizer.
>>>> The Problem:
>>>> A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
>>>> The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
>>>> The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)
>>>> Old Implementation:
>>>> Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
>>>> String values for a field where stored with the prefix "str" without any processing.
>>>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
>>>> To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>>>>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
>>>> However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
>>>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
>>>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.
>>>> With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements
>>>> Solution:
>>>> The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
>>>> As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).
>>>> Additional Note:
>>>> It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
>>>
>>
>>
>>
>> --
>> Fabian
>>
>
>
>
> --
> | Rupert Westenthaler                            rwesten@apache.org
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
Enrico Daga

--
http://www.enridaga.net
skype: enri-pan

Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Posted by Fabian Christ <ch...@googlemail.com>.

Hi,

I tried and you can reopen and the resolve the issue with resolution = fixed.

I just did this for STANBOL-89.

Best,
 - Fabian

2011/2/17 Rupert Westenthaler <rw...@apache.org>:
> Hi
>
>> There was a discussion a few weeks ago on the list about when closing
>> issues. As I understood, fixed issues should be set to "Resolved" and
>> issues will be closed when we do a release.
> I can remember this discussion but honestly had not think about it when
> closing this issue.
>
> However after checking I noticed, that "resolved" seams to be no
> longer an option when closing an issue.
>
> The current options are
>  - fixed
>  - won't fix
>  - duplicate
>  - invalid
>  - incomplete
>  - cannot reproduce
>  - later
>  - not a problem
>
> So I suggest to use "fixed" in future
>
> best
> Rupert
>
> On Thu, Feb 17, 2011 at 3:07 PM, Fabian Christ
> <ch...@googlemail.com> wrote:
>> Hi,
>>
>> this issue is "closed" with resolution "Fixed".
>>
>> There was a discussion a few weeks ago on the list about when closing
>> issues. As I understood, fixed issues should be set to "Resolved" and
>> issues will be closed when we do a release.
>>
>>  - Fabian
>>
>> 2011/2/16 Rupert Westenthaler (JIRA) <ji...@apache.org>:
>>>
>>>     [ https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Rupert Westenthaler closed STANBOL-89.
>>> --------------------------------------
>>>
>>>    Resolution: Fixed
>>>
>>> Fixed with Revision 1071231
>>>
>>> This change does invalidate old indexes, because text searches within string field had not really worked before.
>>> However to benefit from this changes one would need to update the indices.
>>>
>>>> SolrYard uses string field for natural text queries
>>>> ---------------------------------------------------
>>>>
>>>>                 Key: STANBOL-89
>>>>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>>>>             Project: Stanbol
>>>>          Issue Type: Bug
>>>>          Components: Entity Hub
>>>>            Reporter: Rupert Westenthaler
>>>>            Assignee: Rupert Westenthaler
>>>>            Priority: Minor
>>>>
>>>> This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!
>>>> Background:
>>>> The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
>>>> This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.
>>>> Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
>>>> For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
>>>> In contrast to that string field do not use any Tokenizer.
>>>> The Problem:
>>>> A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
>>>> The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
>>>> The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)
>>>> Old Implementation:
>>>> Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
>>>> String values for a field where stored with the prefix "str" without any processing.
>>>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
>>>> To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>>>>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
>>>> However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
>>>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
>>>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.
>>>> With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements
>>>> Solution:
>>>> The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
>>>> As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).
>>>> Additional Note:
>>>> It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
>>>
>>
>>
>>
>> --
>> Fabian
>>
>
>
>
> --
> | Rupert Westenthaler                            rwesten@apache.org
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>



-- 
Fabian

Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Posted by Rupert Westenthaler <rw...@apache.org>.

Hi

> There was a discussion a few weeks ago on the list about when closing
> issues. As I understood, fixed issues should be set to "Resolved" and
> issues will be closed when we do a release.
I can remember this discussion but honestly had not think about it when
closing this issue.

However after checking I noticed, that "resolved" seams to be no
longer an option when closing an issue.

The current options are
 - fixed
 - won't fix
 - duplicate
 - invalid
 - incomplete
 - cannot reproduce
 - later
 - not a problem

So I suggest to use "fixed" in future

best
Rupert

On Thu, Feb 17, 2011 at 3:07 PM, Fabian Christ
<ch...@googlemail.com> wrote:
> Hi,
>
> this issue is "closed" with resolution "Fixed".
>
> There was a discussion a few weeks ago on the list about when closing
> issues. As I understood, fixed issues should be set to "Resolved" and
> issues will be closed when we do a release.
>
>  - Fabian
>
> 2011/2/16 Rupert Westenthaler (JIRA) <ji...@apache.org>:
>>
>>     [ https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Rupert Westenthaler closed STANBOL-89.
>> --------------------------------------
>>
>>    Resolution: Fixed
>>
>> Fixed with Revision 1071231
>>
>> This change does invalidate old indexes, because text searches within string field had not really worked before.
>> However to benefit from this changes one would need to update the indices.
>>
>>> SolrYard uses string field for natural text queries
>>> ---------------------------------------------------
>>>
>>>                 Key: STANBOL-89
>>>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>>>             Project: Stanbol
>>>          Issue Type: Bug
>>>          Components: Entity Hub
>>>            Reporter: Rupert Westenthaler
>>>            Assignee: Rupert Westenthaler
>>>            Priority: Minor
>>>
>>> This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!
>>> Background:
>>> The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
>>> This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.
>>> Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
>>> For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
>>> In contrast to that string field do not use any Tokenizer.
>>> The Problem:
>>> A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
>>> The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
>>> The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)
>>> Old Implementation:
>>> Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
>>> String values for a field where stored with the prefix "str" without any processing.
>>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
>>> To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>>>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
>>> However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
>>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
>>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.
>>> With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements
>>> Solution:
>>> The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
>>> As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).
>>> Additional Note:
>>> It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
>
>
>
> --
> Fabian
>



-- 
| Rupert Westenthaler                            rwesten@apache.org
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: [jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Posted by Fabian Christ <ch...@googlemail.com>.

Hi,

this issue is "closed" with resolution "Fixed".

There was a discussion a few weeks ago on the list about when closing
issues. As I understood, fixed issues should be set to "Resolved" and
issues will be closed when we do a release.

 - Fabian

2011/2/16 Rupert Westenthaler (JIRA) <ji...@apache.org>:
>
>     [ https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Rupert Westenthaler closed STANBOL-89.
> --------------------------------------
>
>    Resolution: Fixed
>
> Fixed with Revision 1071231
>
> This change does invalidate old indexes, because text searches within string field had not really worked before.
> However to benefit from this changes one would need to update the indices.
>
>> SolrYard uses string field for natural text queries
>> ---------------------------------------------------
>>
>>                 Key: STANBOL-89
>>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>>             Project: Stanbol
>>          Issue Type: Bug
>>          Components: Entity Hub
>>            Reporter: Rupert Westenthaler
>>            Assignee: Rupert Westenthaler
>>            Priority: Minor
>>
>> This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!
>> Background:
>> The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
>> This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.
>> Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
>> For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
>> In contrast to that string field do not use any Tokenizer.
>> The Problem:
>> A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
>> The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
>> The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)
>> Old Implementation:
>> Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
>> String values for a field where stored with the prefix "str" without any processing.
>> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
>> To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
>> However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
>> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
>> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.
>> With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements
>> Solution:
>> The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
>> As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).
>> Additional Note:
>> It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>



-- 
Fabian

[jira] Reopened: (STANBOL-89) SolrYard uses string field for natural text queries

Posted by "Fabian Christ (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fabian Christ reopened STANBOL-89:
----------------------------------


Should be set to 'Resolved' instead of 'Closed'

> SolrYard uses string field for natural text queries
> ---------------------------------------------------
>
>                 Key: STANBOL-89
>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>            Priority: Minor
>
> This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!
> Background:
> The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
> This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.
> Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
> For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
> In contrast to that string field do not use any Tokenizer.
> The Problem:
> A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
> The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
> The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)
> Old Implementation:
> Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
> String values for a field where stored with the prefix "str" without any processing.
> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
> To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
> However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.
> With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements
> Solution:
> The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
> As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).
> Additional Note:
> It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (STANBOL-89) SolrYard uses string field for natural text queries

Posted by "Fabian Christ (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fabian Christ resolved STANBOL-89.
----------------------------------

    Resolution: Fixed

Resolved with solution Fixed.

> SolrYard uses string field for natural text queries
> ---------------------------------------------------
>
>                 Key: STANBOL-89
>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>            Priority: Minor
>
> This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!
> Background:
> The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
> This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.
> Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
> For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
> In contrast to that string field do not use any Tokenizer.
> The Problem:
> A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
> The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
> The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)
> Old Implementation:
> Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
> String values for a field where stored with the prefix "str" without any processing.
> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
> To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
> However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.
> With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements
> Solution:
> The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
> As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).
> Additional Note:
> It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Closed: (STANBOL-89) SolrYard uses string field for natural text queries

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler closed STANBOL-89.
--------------------------------------

    Resolution: Fixed

Fixed with Revision 1071231

This change does invalidate old indexes, because text searches within string field had not really worked before.
However to benefit from this changes one would need to update the indices.

> SolrYard uses string field for natural text queries
> ---------------------------------------------------
>
>                 Key: STANBOL-89
>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>            Priority: Minor
>
> This describes a change to the way the SolrYard does index values with the data type xsd:string in order to improve the support for natural language text searches for such values. This change will remove a wrong assumption present in the current implementation. Details below!
> Background:
> The Entityhub distinguishes "natural language text" from normal values such as integer, floats, dates and string values. This is mainly because one might want to process natural language differently than normal string values. e.g. When processing natural language text one might want to use things like white space separators, stop word filters and/or stemming, but for ISBN numbers, article numbers, postal codes using such algorithms will use to unwanted effects.
> This distinction is nothing special to the Entityhub, but also present within RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to represent natural language text and "TypedLiterals" (with an optional xsd data type) to represent other values (including xsd:string). This is also represented in the RDF APIs incl. Clerezzas RDF model.
> Solr also provides a lot of functionality to improve the indexing and searching for natural language texts. Therefore the correct declaration of natural language texts and string values is of importance for getting the expected search results.
> For natural language texts the Solr schema.xml used by the SolrYard defines a fieldType that uses the WhitespaceTokenizer, StopFilterFactory, WordDelimiterFilter and LowerCaseFilter. For English texts also the SnowballPorterFilter (stemming) is used.
> In contrast to that string field do not use any Tokenizer.
> The Problem:
> A lot of developers of applications that produce RDF data do not correctly use the RDF APIs. It is often the case that TypedLiterals with the data type xsd:string are used to create literals representing natural language texts. This is often because typically RDF APIs provide some kind of LiteralFactory to create RDF Literals for Java Objects. So parsing an Java String instance representing a natural language text will create a TypedLiteral with the data type xsd:string. Even the Stanbol Enhancer is no exception to that because it also creates TypedLiterals holding natural language texts! Developers usually only use PlainLiterals if there is a requirement to specify the language.
> The Conclusion is that components MUST NOT assume that string values do not represent natural language texts. However they can also not assume that all string values are in fact natural language texts.
> The best solution to that is to let the user define how to interpret the values when he interact with the data (at query time)
> Old Implementation:
> Previous to this change the SolrYard indexed "natural language text"s and "stirng" values differently.
> String values for a field where stored with the prefix "str" without any processing.
> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for english texts, "@" for texts without a language) and processed by several tokenizers as described above. In addition texts where also stored within a field with the prefix "_!@" that combined all natural text values of all languages.
> To include string values in search results for natural language text queries for natural language texts where created to search also within the "str" field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
> However this had one important shortcoming. The second term of the query searched within a field that is not suited for natural language text searches. To describe that in more detail lets assume the value "Rupert Westenthaler" defined in the following two ways:
> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the "_!@/rdfs:label/" fields.
> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" field.
> With (1) the above query would select the document in the second case it would not. This is because the query assumes to search for natural language values that are indexed in that way, but the "str/rdfs:label/" field does not fulfill this requirements
> Solution:
> The solution is to change the indexing to index string values also within the "_!@"-field. This means that searches within that field assumes that all string values do actually represent natural language texts. Searches for string values need to use the "str"-field. This assumes that string value searches (e.g. for an ISBN number) will still work as intended while searches for natural language texts do have also access to string values.
> As an positive side effect natural language searches will no longer need to search in two different fields (meaning the the OR clause as shown above in the example is no longer needed).
> Additional Note:
> It would be also possible to index natural language text values without defined language within the string field. This would remove the assumption that each natural language text value does in fact represent natural text and not a string. However until someone can point to real world cases where datasets do wrongly use PlainLiterals instead of TypedLiterals with the data type xsd:string there is no practical advantage to that.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira