You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yash Datta (JIRA)" <ji...@apache.org> on 2018/01/12 08:47:00 UTC

[jira] [Updated] (SPARK-23056) parse_url regression when switched to using java.net.URI instead of java.net.URL

     [ https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yash Datta updated SPARK-23056:
-------------------------------
    Description: 
When using internationalized Domains in the urls like:

val url = "http://правительство.рф"
The parse_url returns null, but works fine when using the hive 's version of parse_url

On digging further, found that the difference is in below call in spark:


{code:java}
private def getUrl(url: UTF8String): URI = {
  try {
    new URI(url.toString)
  } catch {
    case e: URISyntaxException => null
  }
}
{code}



while hive uses java.net.URL:

{code:java}
url = new URL(urlStr)
{code}

Sure enough, this simple test demonstrates URL works but URI does not in this case:

{code:java}
val url = "http://правительство.рф"

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost")     // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
{code}

To reproduce the problem on spark-sql:

{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL

This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

  was:
When using internationalized Domains in the urls like:

val url = "http://правительство.рф"
The parse_url returns null, but works fine when using the hive 's version of parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
  try {
    new URI(url.toString)
  } catch {
    case e: URISyntaxException => null
  }
}

while hive uses java.net.URL:

url = new URL(urlStr)

Sure enough, this simple test demonstrates URL works but URI does not in this case:

val url = "http://правительство.рф"

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost")     // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  

To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
returns NULL

This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```


> parse_url regression when switched to using java.net.URI instead of java.net.URL
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-23056
>                 URL: https://issues.apache.org/jira/browse/SPARK-23056
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.3, 2.2.2, 2.3.0
>            Reporter: Yash Datta
>
> When using internationalized Domains in the urls like:
> val url = "http://правительство.рф"
> The parse_url returns null, but works fine when using the hive 's version of parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
>   try {
>     new URI(url.toString)
>   } catch {
>     case e: URISyntaxException => null
>   }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this case:
> {code:java}
> val url = "http://правительство.рф"
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost")     // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
> <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org