You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Håvard Ottestad <hm...@gmail.com> on 2020/09/12 10:55:35 UTC

Question about basic vs extended language ranges

Hi,

I’ve been trying to get basic language ranges working for the SHACL engine in RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena implement basic language ranges.

The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt <https://www.ietf.org/rfc/rfc4647.txt>
Specifically sections
 -  2.1.  Basic Language Range
 - 3.3.1.  Basic Filtering

Looking at the ABNF in 2.1.

   language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
   alphanum         = ALPHA / DIGIT

It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even “a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.

It seems like the range “en” would match a tag “en-gb” and a tag “en”.

I had a deep dive into the langMatch code in Jena and it seems to support “*” at any position in the range. 

Is Jena supporting part of the extended range specification, or am I missing something? (I have been missing a lot of things lately :P so I wouldn’t be surprised).

Cheers,
Håvard



PS: From 2.2.  Extended Language Range

   extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))

Re: Question about basic vs extended language ranges

Posted by Håvard Ottestad <hm...@gmail.com>.

Hi Andy,

These official specs are really annoying to read I must admit. From the SPARQL spec it says for langMatch: "Returns true if language-tag (first argument) matches language-range (second argument) per the basic filtering scheme” 

I guess since it doesn’t say “Returns true ONLY if ….” we are allowed to return true for language ranges outside of the basic range. So it’s kinda minimum requirement, and if it weren’t for extended range being incompatible with the basic range we could use that (eg. there exists a range that returns true when used as a basic range but false when used as an extended range). 

Cheers,
Håvard

> On 12 Sep 2020, at 20:51, Andy Seaborne <an...@apache.org> wrote:
> 
> 
> 
> On 12/09/2020 17:58, Håvard Ottestad wrote:
>> Hi Andy,
>> Thanks for answering.
>> Do I understand correctly that a range like “en-*” is not a basic range and that Jena is supports it is not fully in line with what the SPARQL spec requires?
> 
> It is not basic range.
> (Your choices are exactly "*", or a prefix of subtags)
> 
> The SPARQL spec does not require specific behaviour for language tag patterns outside basic.
> 
> "requires" is tricky thing in SPARQL because there is an extensibility mechanism. e.g. xsd:date support in any function call.
> 
>    Andy
> 
>> Cheers,
>> Håvard
>>> On 12 Sep 2020, at 18:31, Andy Seaborne <an...@apache.org> wrote:
>>> 
>>> This is from a discussion this last week:
>>> 
>>>    https://github.com/TopQuadrant/shacl/issues/100
>>> 
>>> On 12/09/2020 11:55, Håvard Ottestad wrote:
>>>> Hi,
>>>> I’ve been trying to get basic language ranges working for the SHACL engine in RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena implement basic language ranges.
>>>> The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt <https://www.ietf.org/rfc/rfc4647.txt>
>>>> Specifically sections
>>>>  -  2.1.  Basic Language Range
>>>>  - 3.3.1.  Basic Filtering
>>>> Looking at the ABNF in 2.1.
>>>>    language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
>>>>    alphanum         = ALPHA / DIGIT
>>>> It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even “a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.
>>>> It seems like the range “en” would match a tag “en-gb” and a tag “en”.
>>>> I had a deep dive into the langMatch code in Jena and it seems to support “*” at any position in the range.
>>>> Is Jena supporting part of the extended range specification,
>>> 
>>> Jena LangMatches supports basic matching as required by SPARQL and SHACL, and does match some cases of "-*" but not properly by full RFC 4647. More by accident than design, I suspect.
>>> 
>>> Calling it "part of extended" is generous. It fails to match "-*" to multiples subtag ranges.
>>> 
>>> Basic is not completely compatible with extended.
>>> 
>>> Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.
>>> 
>>> Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, and 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" on language and region, skipping region. Each part of a language has a slightly different syntax and extended filtering seem to depend on this to do its jump ahead for "-*".
>>> 
>>> I haven't got my head around the full impact of extended matching. It assumes valid language tags and invalid (by RFC 5646) language exist. In the real world, bad tags are common.
>>> 
>>> But SPARQL and Turtle have a catch all parse syntax based on the earlier RFC 3066 and HTTP at the time.  And in the real world, bad tags are common.
>>> 
>>> "a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in several ways; it is legal by 3066.
>>> 
>>> To add to the language tag fun, RDF and RFC 4646 disagree on the canonical form of language tags.
>>> 
>>>> or am I missing something? (I have been missing a lot of things lately > :P so I wouldn’t be surprised).
>>> 
>>> This? :-)
>>> https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566
>>> 
>>> """
>>> The NodeFunctions.langMatches code does look like it gets basic matching right (as SPARQL requires), test cases to the contrary welcome, but the handling of extended matching looks wrong for "-*" with multiple occurences of subtags.
>>> 
>>> Extended matching is complicated and relies on (1) valid language tag input (2) the different parts of a language tag having different syntax.
>>> 
>>> "de-DE" does not match "de-Latn-DE" by basic but does by extended.
>>> """
>>> 
>>>    Andy
>>> 
>>>> Cheers,
>>>> Håvard
>>>> PS: From 2.2.  Extended Language Range
>>>>    extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))

Re: Question about basic vs extended language ranges

Posted by Andy Seaborne <an...@apache.org>.


On 12/09/2020 17:58, Håvard Ottestad wrote:
> Hi Andy,
> 
> Thanks for answering.
> 
> Do I understand correctly that a range like “en-*” is not a basic range and that Jena is supports it is not fully in line with what the SPARQL spec requires?

It is not basic range.
(Your choices are exactly "*", or a prefix of subtags)

The SPARQL spec does not require specific behaviour for language tag 
patterns outside basic.

"requires" is tricky thing in SPARQL because there is an extensibility 
mechanism. e.g. xsd:date support in any function call.

     Andy

> 
> Cheers,
> Håvard
> 
> 
> 
>> On 12 Sep 2020, at 18:31, Andy Seaborne <an...@apache.org> wrote:
>>
>> This is from a discussion this last week:
>>
>>     https://github.com/TopQuadrant/shacl/issues/100
>>
>> On 12/09/2020 11:55, Håvard Ottestad wrote:
>>> Hi,
>>> I’ve been trying to get basic language ranges working for the SHACL engine in RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena implement basic language ranges.
>>> The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt <https://www.ietf.org/rfc/rfc4647.txt>
>>> Specifically sections
>>>   -  2.1.  Basic Language Range
>>>   - 3.3.1.  Basic Filtering
>>> Looking at the ABNF in 2.1.
>>>     language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
>>>     alphanum         = ALPHA / DIGIT
>>> It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even “a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.
>>> It seems like the range “en” would match a tag “en-gb” and a tag “en”.
>>> I had a deep dive into the langMatch code in Jena and it seems to support “*” at any position in the range.
>>> Is Jena supporting part of the extended range specification,
>>
>> Jena LangMatches supports basic matching as required by SPARQL and SHACL, and does match some cases of "-*" but not properly by full RFC 4647. More by accident than design, I suspect.
>>
>> Calling it "part of extended" is generous. It fails to match "-*" to multiples subtag ranges.
>>
>> Basic is not completely compatible with extended.
>>
>> Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.
>>
>> Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, and 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" on language and region, skipping region. Each part of a language has a slightly different syntax and extended filtering seem to depend on this to do its jump ahead for "-*".
>>
>> I haven't got my head around the full impact of extended matching. It assumes valid language tags and invalid (by RFC 5646) language exist. In the real world, bad tags are common.
>>
>> But SPARQL and Turtle have a catch all parse syntax based on the earlier RFC 3066 and HTTP at the time.  And in the real world, bad tags are common.
>>
>> "a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in several ways; it is legal by 3066.
>>
>> To add to the language tag fun, RDF and RFC 4646 disagree on the canonical form of language tags.
>>
>>> or am I missing something? (I have been missing a lot of things lately > :P so I wouldn’t be surprised).
>>
>> This? :-)
>> https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566
>>
>> """
>> The NodeFunctions.langMatches code does look like it gets basic matching right (as SPARQL requires), test cases to the contrary welcome, but the handling of extended matching looks wrong for "-*" with multiple occurences of subtags.
>>
>> Extended matching is complicated and relies on (1) valid language tag input (2) the different parts of a language tag having different syntax.
>>
>> "de-DE" does not match "de-Latn-DE" by basic but does by extended.
>> """
>>
>>     Andy
>>
>>> Cheers,
>>> Håvard
>>> PS: From 2.2.  Extended Language Range
>>>     extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))
>

Re: Question about basic vs extended language ranges

Posted by Håvard Ottestad <hm...@gmail.com>.

Hi Andy,

Thanks for answering.

Do I understand correctly that a range like “en-*” is not a basic range and that Jena is supports it is not fully in line with what the SPARQL spec requires?

Cheers,
Håvard



> On 12 Sep 2020, at 18:31, Andy Seaborne <an...@apache.org> wrote:
> 
> This is from a discussion this last week:
> 
>    https://github.com/TopQuadrant/shacl/issues/100
> 
> On 12/09/2020 11:55, Håvard Ottestad wrote:
>> Hi,
>> I’ve been trying to get basic language ranges working for the SHACL engine in RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena implement basic language ranges.
>> The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt <https://www.ietf.org/rfc/rfc4647.txt>
>> Specifically sections
>>  -  2.1.  Basic Language Range
>>  - 3.3.1.  Basic Filtering
>> Looking at the ABNF in 2.1.
>>    language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
>>    alphanum         = ALPHA / DIGIT
>> It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even “a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.
>> It seems like the range “en” would match a tag “en-gb” and a tag “en”.
>> I had a deep dive into the langMatch code in Jena and it seems to support “*” at any position in the range.
>> Is Jena supporting part of the extended range specification, 
> 
> Jena LangMatches supports basic matching as required by SPARQL and SHACL, and does match some cases of "-*" but not properly by full RFC 4647. More by accident than design, I suspect.
> 
> Calling it "part of extended" is generous. It fails to match "-*" to multiples subtag ranges.
> 
> Basic is not completely compatible with extended.
> 
> Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.
> 
> Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, and 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" on language and region, skipping region. Each part of a language has a slightly different syntax and extended filtering seem to depend on this to do its jump ahead for "-*".
> 
> I haven't got my head around the full impact of extended matching. It assumes valid language tags and invalid (by RFC 5646) language exist. In the real world, bad tags are common.
> 
> But SPARQL and Turtle have a catch all parse syntax based on the earlier RFC 3066 and HTTP at the time.  And in the real world, bad tags are common.
> 
> "a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in several ways; it is legal by 3066.
> 
> To add to the language tag fun, RDF and RFC 4646 disagree on the canonical form of language tags.
> 
> > or am I missing something? (I have been missing a lot of things lately > :P so I wouldn’t be surprised).
> 
> This? :-)
> https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566
> 
> """
> The NodeFunctions.langMatches code does look like it gets basic matching right (as SPARQL requires), test cases to the contrary welcome, but the handling of extended matching looks wrong for "-*" with multiple occurences of subtags.
> 
> Extended matching is complicated and relies on (1) valid language tag input (2) the different parts of a language tag having different syntax.
> 
> "de-DE" does not match "de-Latn-DE" by basic but does by extended.
> """
> 
>    Andy
> 
>> Cheers,
>> Håvard
>> PS: From 2.2.  Extended Language Range
>>    extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))

Re: Question about basic vs extended language ranges

Posted by Andy Seaborne <an...@apache.org>.

This is from a discussion this last week:

     https://github.com/TopQuadrant/shacl/issues/100

On 12/09/2020 11:55, Håvard Ottestad wrote:
> Hi,
> 
> I’ve been trying to get basic language ranges working for the SHACL engine in RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena implement basic language ranges.
> 
> The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt <https://www.ietf.org/rfc/rfc4647.txt>
> Specifically sections
>   -  2.1.  Basic Language Range
>   - 3.3.1.  Basic Filtering
> 
> Looking at the ABNF in 2.1.
> 
>     language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
>     alphanum         = ALPHA / DIGIT
> 
> It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even “a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.
> 
> It seems like the range “en” would match a tag “en-gb” and a tag “en”.
> 
> I had a deep dive into the langMatch code in Jena and it seems to support “*” at any position in the range.
> 
> Is Jena supporting part of the extended range specification, 

Jena LangMatches supports basic matching as required by SPARQL and 
SHACL, and does match some cases of "-*" but not properly by full RFC 
4647. More by accident than design, I suspect.

Calling it "part of extended" is generous. It fails to match "-*" to 
multiples subtag ranges.

Basic is not completely compatible with extended.

Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.

Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, 
and 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" on 
language and region, skipping region. Each part of a language has a 
slightly different syntax and extended filtering seem to depend on this 
to do its jump ahead for "-*".

I haven't got my head around the full impact of extended matching. It 
assumes valid language tags and invalid (by RFC 5646) language exist. In 
the real world, bad tags are common.

But SPARQL and Turtle have a catch all parse syntax based on the earlier 
RFC 3066 and HTTP at the time.  And in the real world, bad tags are common.

"a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in 
several ways; it is legal by 3066.

To add to the language tag fun, RDF and RFC 4646 disagree on the 
canonical form of language tags.

 > or am I missing something? (I have been missing a lot of things 
lately > :P so I wouldn’t be surprised).

This? :-)
https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566

"""
The NodeFunctions.langMatches code does look like it gets basic matching 
right (as SPARQL requires), test cases to the contrary welcome, but the 
handling of extended matching looks wrong for "-*" with multiple 
occurences of subtags.

Extended matching is complicated and relies on (1) valid language tag 
input (2) the different parts of a language tag having different syntax.

"de-DE" does not match "de-Latn-DE" by basic but does by extended.
"""

     Andy

> 
> Cheers,
> Håvard
> 
> 
> 
> PS: From 2.2.  Extended Language Range
> 
>     extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))
> 
>