You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yash Datta (JIRA)" <ji...@apache.org> on 2018/01/12 08:47:00 UTC
[jira] [Updated] (SPARK-23056) parse_url regression when switched
to using java.net.URI instead of java.net.URL
[ https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yash Datta updated SPARK-23056:
-------------------------------
Description:
When using internationalized Domains in the urls like:
val url = "http://правительство.рф"
The parse_url returns null, but works fine when using the hive 's version of parse_url
On digging further, found that the difference is in below call in spark:
{code:java}
private def getUrl(url: UTF8String): URI = {
try {
new URI(url.toString)
} catch {
case e: URISyntaxException => null
}
}
{code}
while hive uses java.net.URL:
{code:java}
url = new URL(urlStr)
{code}
Sure enough, this simple test demonstrates URL works but URI does not in this case:
{code:java}
val url = "http://правительство.рф"
val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost
println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф
{code}
To reproduce the problem on spark-sql:
{code:java}
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
{code}
returns NULL
This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve the performance of PARSE_URL().
The same issue exists in the following SQL:
```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')
// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```
was:
When using internationalized Domains in the urls like:
val url = "http://правительство.рф"
The parse_url returns null, but works fine when using the hive 's version of parse_url
On digging further, found that the difference is in below call in spark:
private def getUrl(url: UTF8String): URI = {
try {
new URI(url.toString)
} catch {
case e: URISyntaxException => null
}
}
while hive uses java.net.URL:
url = new URL(urlStr)
Sure enough, this simple test demonstrates URL works but URI does not in this case:
val url = "http://правительство.рф"
val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost
println(s"uriHost = $uriHost") // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф
To reproduce the problem on spark-sql:
spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
returns NULL
This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve the performance of PARSE_URL().
The same issue exists in the following SQL:
```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')
// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```
> parse_url regression when switched to using java.net.URI instead of java.net.URL
> --------------------------------------------------------------------------------
>
> Key: SPARK-23056
> URL: https://issues.apache.org/jira/browse/SPARK-23056
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.3, 2.2.2, 2.3.0
> Reporter: Yash Datta
>
> When using internationalized Domains in the urls like:
> val url = "http://правительство.рф"
> The parse_url returns null, but works fine when using the hive 's version of parse_url
> On digging further, found that the difference is in below call in spark:
> {code:java}
> private def getUrl(url: UTF8String): URI = {
> try {
> new URI(url.toString)
> } catch {
> case e: URISyntaxException => null
> }
> }
> {code}
> while hive uses java.net.URL:
> {code:java}
> url = new URL(urlStr)
> {code}
> Sure enough, this simple test demonstrates URL works but URI does not in this case:
> {code:java}
> val url = "http://правительство.рф"
> val uriHost = new URI(url).getHost
> val urlHost = new URL(url).getHost
> println(s"uriHost = $uriHost") // prints uriHost = null
> println(s"urlHost = $urlHost") // prints urlHost = правительство.рф
> {code}
> To reproduce the problem on spark-sql:
> {code:java}
> spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
> {code}
> returns NULL
> This problem was introduced by
> <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
> improve the performance of PARSE_URL().
> The same issue exists in the following SQL:
> ```SQL
> SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p')
> // return null in Spark 2.1+
> // return ["abc"] less than Spark 2.1
> ```
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org