You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2018/01/12 14:35:00 UTC

[jira] [Commented] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

    [ https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324031#comment-16324031 ] 

Sean Owen commented on SPARK-23056:
-----------------------------------

[~saucam] that is not a valid URI or URL. I don't think this can be considered a bug. I'm surprised the URL class parses it, and I agree it's good to be consistent with Hive, but not sure this is guaranteed by the semantics of the function.

The problem was a big performance bottleneck. If there's a solution that avoids that problem and also makes this more lenient to match Hive, that could be OK, but I am not sure if this should be considered a problem. You can URL-escape that URL.

> parse_url regression when switched to using java.net.URI instead of java.net.URL
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-23056
>                 URL: https://issues.apache.org/jira/browse/SPARK-23056
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.3, 2.2.2, 2.3.0
>            Reporter: Yash Datta
>              Labels: regression
>
> When using internationalized Domains in the urls like:
> {code:java}
> val url = "http://правительство.рф"
> {code}
> The parse_url returns null, but works fine when using the hive 's version of parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
>     new URI(url.toString)
>   } catch {
>     case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this case:
> {code:java}
> val url = "http://правительство.рф"
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost")     // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
> <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> {code:java}
> SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')
> {code}
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org