You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Rob Vesse <rv...@dotnetrdf.org> on 2014/09/15 11:31:07 UTC

Enable strict IRI parsing in query parser?

Is there an easy way to enable strict IRI parsing in the query parser?

For example the following user query is accepted by ARQ:

SELECT *
    WHERE {
      ?subject rdfs:subClassOf <http:/google.com <http://google.com/
<http://google.com/>>> .
    }

Note the incorrect URI, when put through the IRI validator at sparql.org
ARQ produces the following:

http:/google.com ==> http:/google.com
<http:/google.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A
component that is required by the scheme is missing.

Is there any way to get this behaviour from the query parser?

Rob

Re: Enable strict IRI parsing in query parser?

Posted by Andy Seaborne <an...@apache.org>.

On 17/09/14 09:46, Rob Vesse wrote:
> Closing the loop on this I can confirm that the specific example is invalid
>
> RFC 7230 (HTTP 1.1) Section 2.7.1
> (http://tools.ietf.org/html/rfc7230#section-2.7.1)
>
> A sender MUST NOT generate an "http" URI with an empty host
>     identifier.  A recipient that processes such a URI reference MUST
>     reject it as invalid
>
>
> So actually the IRI validator is quite correct in rejecting the example
> URI because URIs of that form while permitted by the generic syntax and
> not allowed by the specific scheme
>
> Rob

Thanks for checking 7230.

Empty host is case of there being a // but not a host name.

http://:1234/foobar   -- with port
http:///foobar        -- default port

	Andy

>>
>>
>>>
>>> Rob
>>>
>>> On 15/09/2014 19:25, "Andy Seaborne" <an...@apache.org> wrote:
>>>
>>>> On 15/09/14 11:25, Rob Vesse wrote:
>>>>> Found one way of doing this:
>>>>>
>>>>> query.setBaseURI(new IRIResolver());
>>>>>
>>>>> However you have to do this in the setup of the parser before the
>>>>> query
>>>>> is
>>>>> parsed which is not something your average user will have access to
>>>>> and
>>>>> setting it after parsing has happened has no effect.
>>>>>
>>>>> So how would an average user who is not customising the query parser
>>>>> enable strict IRI parsing?
>>>>>
>>>>> Rob
>>>>>
>>>>> On 15/09/2014 10:31, "Rob Vesse" <rv...@dotnetrdf.org> wrote:
>>>>>
>>>>>> Is there an easy way to enable strict IRI parsing in the query
>>>>>> parser?
>>>>>>
>>>>>> For example the following user query is accepted by ARQ:
>>>>>>
>>>>>> SELECT *
>>>>>>       WHERE {
>>>>>>         ?subject rdfs:subClassOf <http:/google.com <http://google.com/
>>>>>> <http://google.com/>>> .
>>>>>>       }
>>>>
>>>> (not sure if email has damaged that example)
>>>>
>>>>>>
>>>>>> Note the incorrect URI, when put through the IRI validator at
>>>>>> sparql.org
>>>>>> ARQ produces the following:
>>>>>>
>>>>>> http:/google.com ==> http:/google.com
>>>>>> <http:/google.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A
>>>>>> component that is required by the scheme is missing.
>>>>>>
>>>>>> Is there any way to get this behaviour from the query parser?
>>>>>>
>>>>>> Rob
>>>>>>
>>>>
>>>> http:/path is a valid URI - it's a rather odd one but the host name is
>>>> optional and when resolved will be the host name of the base.
>>>>
>>>> It does occur for real on the web - e.g. https:/login swaps the
>>>> protocol
>>>> to https if you were using http: and it works whatever hostname you got
>>>> to that page from.
>>
>>
>>
>>>>
>>>> 	Andy
>>>>
>>>>
>>>>
>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>

Re: Enable strict IRI parsing in query parser?

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Closing the loop on this I can confirm that the specific example is invalid

RFC 7230 (HTTP 1.1) Section 2.7.1
(http://tools.ietf.org/html/rfc7230#section-2.7.1)

A sender MUST NOT generate an "http" URI with an empty host
   identifier.  A recipient that processes such a URI reference MUST
   reject it as invalid


So actually the IRI validator is quite correct in rejecting the example
URI because URIs of that form while permitted by the generic syntax and
not allowed by the specific scheme

Rob

On 16/09/2014 18:02, "Andy Seaborne" <an...@apache.org> wrote:

>On 16/09/14 08:47, Rob Vesse wrote:
>> Yes looks like email managed it a bit but you got the correct gist, a
>>IRI
>> with http:/ I.e. only a single slash followed by some further path
>> components
>>
>> If as you say this is a valid albeit unusual IRIe how come the IRI
>> validator rejects it?  Is it requiring that all IRIs be absolute?
>
>The IRI code has a bunch of things it detects.  The IRI factory is then
>set to decide what to treat as fatal errors and which to report as
>warnings but continue.
>
>The IRI validator prints out all errors and all warning IRI code
>reports.  It's set more verbose than other code.
>
>------
>Where are you parsing these queries?  App code? Fuseki?
>It might make sense to have relative URIs in some more circumstance
>default to at least logged warnings, and maybe as error.
>
>(actually: http:/foo is an absolute URI! All "absolute" means is does it
>have a scheme name.  As an http URI isn't incomplete (I'm not sure if
>there is a technical term for a "complete" HTTP URI with authority and
>path is)
>
>
>       URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
>
>       hier-part   = "//" authority path-abempty
>                   / path-absolute
>                   / path-rootless
>                   / path-empty
>
>and the example is case 2 : path-absolute, not case 1.
>
>	Andy
>
>
>>
>> Rob
>>
>> On 15/09/2014 19:25, "Andy Seaborne" <an...@apache.org> wrote:
>>
>>> On 15/09/14 11:25, Rob Vesse wrote:
>>>> Found one way of doing this:
>>>>
>>>> query.setBaseURI(new IRIResolver());
>>>>
>>>> However you have to do this in the setup of the parser before the
>>>>query
>>>> is
>>>> parsed which is not something your average user will have access to
>>>>and
>>>> setting it after parsing has happened has no effect.
>>>>
>>>> So how would an average user who is not customising the query parser
>>>> enable strict IRI parsing?
>>>>
>>>> Rob
>>>>
>>>> On 15/09/2014 10:31, "Rob Vesse" <rv...@dotnetrdf.org> wrote:
>>>>
>>>>> Is there an easy way to enable strict IRI parsing in the query
>>>>>parser?
>>>>>
>>>>> For example the following user query is accepted by ARQ:
>>>>>
>>>>> SELECT *
>>>>>      WHERE {
>>>>>        ?subject rdfs:subClassOf <http:/google.com <http://google.com/
>>>>> <http://google.com/>>> .
>>>>>      }
>>>
>>> (not sure if email has damaged that example)
>>>
>>>>>
>>>>> Note the incorrect URI, when put through the IRI validator at
>>>>> sparql.org
>>>>> ARQ produces the following:
>>>>>
>>>>> http:/google.com ==> http:/google.com
>>>>> <http:/google.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A
>>>>> component that is required by the scheme is missing.
>>>>>
>>>>> Is there any way to get this behaviour from the query parser?
>>>>>
>>>>> Rob
>>>>>
>>>
>>> http:/path is a valid URI - it's a rather odd one but the host name is
>>> optional and when resolved will be the host name of the base.
>>>
>>> It does occur for real on the web - e.g. https:/login swaps the
>>>protocol
>>> to https if you were using http: and it works whatever hostname you got
>>> to that page from.
>
>
>
>>>
>>> 	Andy
>>>
>>>
>>>
>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>

Re: Enable strict IRI parsing in query parser?

Posted by Andy Seaborne <an...@apache.org>.

On 16/09/14 08:47, Rob Vesse wrote:
> Yes looks like email managed it a bit but you got the correct gist, a IRI
> with http:/ I.e. only a single slash followed by some further path
> components
>
> If as you say this is a valid albeit unusual IRIe how come the IRI
> validator rejects it?  Is it requiring that all IRIs be absolute?

The IRI code has a bunch of things it detects.  The IRI factory is then 
set to decide what to treat as fatal errors and which to report as 
warnings but continue.

The IRI validator prints out all errors and all warning IRI code 
reports.  It's set more verbose than other code.

------
Where are you parsing these queries?  App code? Fuseki?
It might make sense to have relative URIs in some more circumstance 
default to at least logged warnings, and maybe as error.

(actually: http:/foo is an absolute URI! All "absolute" means is does it 
have a scheme name.  As an http URI isn't incomplete (I'm not sure if 
there is a technical term for a "complete" HTTP URI with authority and 
path is)


       URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

       hier-part   = "//" authority path-abempty
                   / path-absolute
                   / path-rootless
                   / path-empty

and the example is case 2 : path-absolute, not case 1.

	Andy


>
> Rob
>
> On 15/09/2014 19:25, "Andy Seaborne" <an...@apache.org> wrote:
>
>> On 15/09/14 11:25, Rob Vesse wrote:
>>> Found one way of doing this:
>>>
>>> query.setBaseURI(new IRIResolver());
>>>
>>> However you have to do this in the setup of the parser before the query
>>> is
>>> parsed which is not something your average user will have access to and
>>> setting it after parsing has happened has no effect.
>>>
>>> So how would an average user who is not customising the query parser
>>> enable strict IRI parsing?
>>>
>>> Rob
>>>
>>> On 15/09/2014 10:31, "Rob Vesse" <rv...@dotnetrdf.org> wrote:
>>>
>>>> Is there an easy way to enable strict IRI parsing in the query parser?
>>>>
>>>> For example the following user query is accepted by ARQ:
>>>>
>>>> SELECT *
>>>>      WHERE {
>>>>        ?subject rdfs:subClassOf <http:/google.com <http://google.com/
>>>> <http://google.com/>>> .
>>>>      }
>>
>> (not sure if email has damaged that example)
>>
>>>>
>>>> Note the incorrect URI, when put through the IRI validator at
>>>> sparql.org
>>>> ARQ produces the following:
>>>>
>>>> http:/google.com ==> http:/google.com
>>>> <http:/google.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A
>>>> component that is required by the scheme is missing.
>>>>
>>>> Is there any way to get this behaviour from the query parser?
>>>>
>>>> Rob
>>>>
>>
>> http:/path is a valid URI - it's a rather odd one but the host name is
>> optional and when resolved will be the host name of the base.
>>
>> It does occur for real on the web - e.g. https:/login swaps the protocol
>> to https if you were using http: and it works whatever hostname you got
>> to that page from.



>>
>> 	Andy
>>
>>
>>
>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>

Re: Enable strict IRI parsing in query parser?

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Yes looks like email managed it a bit but you got the correct gist, a IRI
with http:/ I.e. only a single slash followed by some further path
components

If as you say this is a valid albeit unusual IRIe how come the IRI
validator rejects it?  Is it requiring that all IRIs be absolute?

Rob

On 15/09/2014 19:25, "Andy Seaborne" <an...@apache.org> wrote:

>On 15/09/14 11:25, Rob Vesse wrote:
>> Found one way of doing this:
>>
>> query.setBaseURI(new IRIResolver());
>>
>> However you have to do this in the setup of the parser before the query
>>is
>> parsed which is not something your average user will have access to and
>> setting it after parsing has happened has no effect.
>>
>> So how would an average user who is not customising the query parser
>> enable strict IRI parsing?
>>
>> Rob
>>
>> On 15/09/2014 10:31, "Rob Vesse" <rv...@dotnetrdf.org> wrote:
>>
>>> Is there an easy way to enable strict IRI parsing in the query parser?
>>>
>>> For example the following user query is accepted by ARQ:
>>>
>>> SELECT *
>>>     WHERE {
>>>       ?subject rdfs:subClassOf <http:/google.com <http://google.com/
>>> <http://google.com/>>> .
>>>     }
>
>(not sure if email has damaged that example)
>
>>>
>>> Note the incorrect URI, when put through the IRI validator at
>>>sparql.org
>>> ARQ produces the following:
>>>
>>> http:/google.com ==> http:/google.com
>>> <http:/google.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A
>>> component that is required by the scheme is missing.
>>>
>>> Is there any way to get this behaviour from the query parser?
>>>
>>> Rob
>>>
>
>http:/path is a valid URI - it's a rather odd one but the host name is
>optional and when resolved will be the host name of the base.
>
>It does occur for real on the web - e.g. https:/login swaps the protocol
>to https if you were using http: and it works whatever hostname you got
>to that page from.
>
>	Andy
>
>
>
>
>>>
>>>
>>>
>>
>>
>>
>>
>

Re: Enable strict IRI parsing in query parser?

Posted by Andy Seaborne <an...@apache.org>.

On 15/09/14 11:25, Rob Vesse wrote:
> Found one way of doing this:
>
> query.setBaseURI(new IRIResolver());
>
> However you have to do this in the setup of the parser before the query is
> parsed which is not something your average user will have access to and
> setting it after parsing has happened has no effect.
>
> So how would an average user who is not customising the query parser
> enable strict IRI parsing?
>
> Rob
>
> On 15/09/2014 10:31, "Rob Vesse" <rv...@dotnetrdf.org> wrote:
>
>> Is there an easy way to enable strict IRI parsing in the query parser?
>>
>> For example the following user query is accepted by ARQ:
>>
>> SELECT *
>>     WHERE {
>>       ?subject rdfs:subClassOf <http:/google.com <http://google.com/
>> <http://google.com/>>> .
>>     }

(not sure if email has damaged that example)

>>
>> Note the incorrect URI, when put through the IRI validator at sparql.org
>> ARQ produces the following:
>>
>> http:/google.com ==> http:/google.com
>> <http:/google.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A
>> component that is required by the scheme is missing.
>>
>> Is there any way to get this behaviour from the query parser?
>>
>> Rob
>>

http:/path is a valid URI - it's a rather odd one but the host name is 
optional and when resolved will be the host name of the base.

It does occur for real on the web - e.g. https:/login swaps the protocol 
to https if you were using http: and it works whatever hostname you got 
to that page from.

	Andy




>>
>>
>>
>
>
>
>

Re: Enable strict IRI parsing in query parser?

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Found one way of doing this:

query.setBaseURI(new IRIResolver());

However you have to do this in the setup of the parser before the query is
parsed which is not something your average user will have access to and
setting it after parsing has happened has no effect.

So how would an average user who is not customising the query parser
enable strict IRI parsing?

Rob

On 15/09/2014 10:31, "Rob Vesse" <rv...@dotnetrdf.org> wrote:

>Is there an easy way to enable strict IRI parsing in the query parser?
>
>For example the following user query is accepted by ARQ:
>
>SELECT *
>    WHERE {
>      ?subject rdfs:subClassOf <http:/google.com <http://google.com/
><http://google.com/>>> .
>    }
>
>Note the incorrect URI, when put through the IRI validator at sparql.org
>ARQ produces the following:
>
>http:/google.com ==> http:/google.com
><http:/google.com> Code: 57/REQUIRED_COMPONENT_MISSING in HOST: A
>component that is required by the scheme is missing.
>
>Is there any way to get this behaviour from the query parser?
>
>Rob
>
>
>
>