You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by yash datta <sa...@gmail.com> on 2018/01/12 04:05:11 UTC

[SQL] parse_url does not work for Internationalized domain names ?

Hi devs,

Stumbled across an interesting problem with the parse_url function that has
been implemented in spark in
https://issues.apache.org/jira/browse/SPARK-16281

When using internationalized Domains in the urls like:

val url = "http://правительство.рф <http://xn--80aealotwbjpid2k.xn--p1ai>"

The parse_url returns null, but works fine when using the hive 's version
of parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
  try {
    new URI(url.toString)
  } catch {
    case e: URISyntaxException => null
  }
}


while hive uses java.net.URL:

url = new URL(urlStr)


Sure enough, this simple test demonstrates URL works but URI does not in
this case:

val url = "http://правительство.рф <http://xn--80aealotwbjpid2k.xn--p1ai>"

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost")     // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф


To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test
<http://xn--u8jxcyd029o9bg.test>', 'HOST');
returns NULL

Could someone  please explain the reason of using URI instead of URL ? Does
this problem warrant creating a jira ticket ?


Best Regards
Yash

-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.

Re: [SQL] parse_url does not work for Internationalized domain names ?

Posted by StanZhai <ma...@stanzhai.site>.
This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve performance of PARSE_URL().The same issue exists in the following
SQL:```SQLSELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')//
return null in Spark 2.1+// return ["abc"] less than Spark 2.1```I think
it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: [SQL] parse_url does not work for Internationalized domain names ?

Posted by yash datta <sa...@gmail.com>.
Thanks for the prompt reply!.

Opened a ticket here: https://issues.apache.org/jira/browse/SPARK-23056


BR
Yash

On Fri, Jan 12, 2018 at 3:41 PM, StanZhai <ma...@stanzhai.site> wrote:

> This problem was introduced by
> <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
> improve performance of PARSE_URL().
>
> The same issue exists in the following SQL:
>
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')
>
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```
>
> I think it's a regression.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.

Re: [SQL] parse_url does not work for Internationalized domain names ?

Posted by StanZhai <ma...@stanzhai.site>.
This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

I think it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org