You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Daniel Mateus Pires <dm...@gmail.com> on 2018/07/27 17:34:09 UTC

[Spark SQL] Future of CalendarInterval

Hi Sparkers! (maybe Sparkles ?)

I just wanted to bring up the apparently ?controversial? Calendar Interval topic.

I worked on: https://issues.apache.org/jira/browse/SPARK-24702 <https://issues.apache.org/jira/browse/SPARK-24702>, https://github.com/apache/spark/pull/21706 <https://github.com/apache/spark/pull/21706>

The user was reporting an unexpected behaviour where he/she wasn’t able to cast to a Calendar Interval type.

In the current version of Spark the following code works:
scala> spark.sql("SELECT 'interval 1 hour' as a").select(col("a").cast("calendarinterval")).show()
+----------------+
|               a|
+----------------+
|interval 1 hours|
+----------------+

While the following doesn’t:
spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()


Since the DataFrame API equivalent of the SQL worked, I thought adding it would be an easy decision to make (to make it consistent)

However, I got push-back on the PR on the basis that “we do not plan to expose Calendar Interval as a public type”
Should there be a consensus on either cleaning up the public DataFrame API out of CalendarIntervalType OR making it consistent with the SQL ?

--
Best regards,
Daniel Mateus Pires
Data Engineer @ Hudson's Bay Company

Re: [Spark SQL] Future of CalendarInterval

Posted by Hyukjin Kwon <gu...@gmail.com>.
FYI, org.apache.spark.unsafe.types.CalendarInterval is undocumented in both
scaladoc/javadoc (entire unsafe module)
but org.apache.spark.sql.types.CalendarIntervalType is exposed (
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.CalendarIntervalType
)

+1 for starting the discussion after 2.4.0. I would suggest defer, as I
said in the PR again.

2018년 7월 29일 (일) 오후 6:58, Daniel Mateus Pires <dm...@gmail.com>님이 작성:

> Sounds good! @Xiao
>
> @Reynold AFAIK the only data type that is valid to cast to Calendar
> Interval is VARCHAR
>
> here is Postgres:
>
> postgres=# select CAST(CAST(interval '1 hour' AS varchar) AS interval);
>  interval
> ----------
>  01:00:00
> (1 row)
>
> (snippet comes from the JIRA)
>
> Thanks,
>
> Daniel
>
>
> On 27 July 2018 at 20:38, Xiao Li <ga...@gmail.com> wrote:
>
>> The code freeze of the upcoming release Spark 2.4 is very close. How
>> about revisiting this and explicitly defining the support scope
>> of CalendarIntervalType in the next release (Spark 3.0)?
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> 2018-07-27 10:45 GMT-07:00 Reynold Xin <rx...@databricks.com>:
>>
>>> CalendarInterval is definitely externally visible.
>>>
>>> E.g. sql("select interval 1 day").dtypes would return "Array[(String,
>>> String)] = Array((interval 1 days,CalendarIntervalType))"
>>>
>>> However, I'm not sure what it means to support casting. What are the
>>> semantics for casting from any other data type to calendar interval? I can
>>> see string casting and casting from itself, but not any other data types.
>>>
>>>
>>>
>>>
>>> On Fri, Jul 27, 2018 at 10:34 AM Daniel Mateus Pires <dm...@gmail.com>
>>> wrote:
>>>
>>>> Hi Sparkers! (maybe Sparkles ?)
>>>>
>>>> I just wanted to bring up the apparently ?controversial? Calendar
>>>> Interval topic.
>>>>
>>>> I worked on: https://issues.apache.org/jira/browse/SPARK-24702,
>>>> https://github.com/apache/spark/pull/21706
>>>>
>>>> The user was reporting an unexpected behaviour where he/she wasn’t able
>>>> to cast to a Calendar Interval type.
>>>>
>>>> In the current version of Spark the following code works:
>>>>
>>>> scala> spark.sql("SELECT 'interval 1 hour' as a").select(col("a").cast("calendarinterval")).show()+----------------+|               a|+----------------+|interval 1 hours|+----------------+
>>>>
>>>>
>>>> While the following doesn’t:
>>>> spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()
>>>>
>>>>
>>>> Since the DataFrame API equivalent of the SQL worked, I thought adding
>>>> it would be an easy decision to make (to make it consistent)
>>>>
>>>> However, I got push-back on the PR on the basis that “*we do not plan
>>>> to expose Calendar Interval as a public type*”
>>>> Should there be a consensus on either cleaning up the public DataFrame
>>>> API out of CalendarIntervalType OR making it consistent with the SQL ?
>>>>
>>>> --
>>>> Best regards,
>>>> Daniel Mateus Pires
>>>> Data Engineer @ Hudson's Bay Company
>>>>
>>>
>>
>

Re: [Spark SQL] Future of CalendarInterval

Posted by Daniel Mateus Pires <dm...@gmail.com>.
Sounds good! @Xiao

@Reynold AFAIK the only data type that is valid to cast to Calendar
Interval is VARCHAR

here is Postgres:

postgres=# select CAST(CAST(interval '1 hour' AS varchar) AS interval);
 interval
----------
 01:00:00
(1 row)

(snippet comes from the JIRA)

Thanks,

Daniel


On 27 July 2018 at 20:38, Xiao Li <ga...@gmail.com> wrote:

> The code freeze of the upcoming release Spark 2.4 is very close. How about
> revisiting this and explicitly defining the support scope
> of CalendarIntervalType in the next release (Spark 3.0)?
>
> Thanks,
>
> Xiao
>
>
> 2018-07-27 10:45 GMT-07:00 Reynold Xin <rx...@databricks.com>:
>
>> CalendarInterval is definitely externally visible.
>>
>> E.g. sql("select interval 1 day").dtypes would return "Array[(String,
>> String)] = Array((interval 1 days,CalendarIntervalType))"
>>
>> However, I'm not sure what it means to support casting. What are the
>> semantics for casting from any other data type to calendar interval? I can
>> see string casting and casting from itself, but not any other data types.
>>
>>
>>
>>
>> On Fri, Jul 27, 2018 at 10:34 AM Daniel Mateus Pires <dm...@gmail.com>
>> wrote:
>>
>>> Hi Sparkers! (maybe Sparkles ?)
>>>
>>> I just wanted to bring up the apparently ?controversial? Calendar
>>> Interval topic.
>>>
>>> I worked on: https://issues.apache.org/jira/browse/SPARK-24702, https
>>> ://github.com/apache/spark/pull/21706
>>>
>>> The user was reporting an unexpected behaviour where he/she wasn’t able
>>> to cast to a Calendar Interval type.
>>>
>>> In the current version of Spark the following code works:
>>>
>>> scala> spark.sql("SELECT 'interval 1 hour' as a").select(col("a").cast("calendarinterval")).show()+----------------+|               a|+----------------+|interval 1 hours|+----------------+
>>>
>>>
>>> While the following doesn’t:
>>> spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()
>>>
>>>
>>> Since the DataFrame API equivalent of the SQL worked, I thought adding
>>> it would be an easy decision to make (to make it consistent)
>>>
>>> However, I got push-back on the PR on the basis that “*we do not plan
>>> to expose Calendar Interval as a public type*”
>>> Should there be a consensus on either cleaning up the public DataFrame
>>> API out of CalendarIntervalType OR making it consistent with the SQL ?
>>>
>>> --
>>> Best regards,
>>> Daniel Mateus Pires
>>> Data Engineer @ Hudson's Bay Company
>>>
>>
>

Re: [Spark SQL] Future of CalendarInterval

Posted by Xiao Li <ga...@gmail.com>.
The code freeze of the upcoming release Spark 2.4 is very close. How about
revisiting this and explicitly defining the support scope
of CalendarIntervalType in the next release (Spark 3.0)?

Thanks,

Xiao


2018-07-27 10:45 GMT-07:00 Reynold Xin <rx...@databricks.com>:

> CalendarInterval is definitely externally visible.
>
> E.g. sql("select interval 1 day").dtypes would return "Array[(String,
> String)] = Array((interval 1 days,CalendarIntervalType))"
>
> However, I'm not sure what it means to support casting. What are the
> semantics for casting from any other data type to calendar interval? I can
> see string casting and casting from itself, but not any other data types.
>
>
>
>
> On Fri, Jul 27, 2018 at 10:34 AM Daniel Mateus Pires <dm...@gmail.com>
> wrote:
>
>> Hi Sparkers! (maybe Sparkles ?)
>>
>> I just wanted to bring up the apparently ?controversial? Calendar
>> Interval topic.
>>
>> I worked on: https://issues.apache.org/jira/browse/SPARK-24702, https
>> ://github.com/apache/spark/pull/21706
>>
>> The user was reporting an unexpected behaviour where he/she wasn’t able
>> to cast to a Calendar Interval type.
>>
>> In the current version of Spark the following code works:
>>
>> scala> spark.sql("SELECT 'interval 1 hour' as a").select(col("a").cast("calendarinterval")).show()+----------------+|               a|+----------------+|interval 1 hours|+----------------+
>>
>>
>> While the following doesn’t:
>> spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()
>>
>>
>> Since the DataFrame API equivalent of the SQL worked, I thought adding it
>> would be an easy decision to make (to make it consistent)
>>
>> However, I got push-back on the PR on the basis that “*we do not plan to
>> expose Calendar Interval as a public type*”
>> Should there be a consensus on either cleaning up the public DataFrame
>> API out of CalendarIntervalType OR making it consistent with the SQL ?
>>
>> --
>> Best regards,
>> Daniel Mateus Pires
>> Data Engineer @ Hudson's Bay Company
>>
>

Re: [Spark SQL] Future of CalendarInterval

Posted by Reynold Xin <rx...@databricks.com>.
CalendarInterval is definitely externally visible.

E.g. sql("select interval 1 day").dtypes would return "Array[(String,
String)] = Array((interval 1 days,CalendarIntervalType))"

However, I'm not sure what it means to support casting. What are the
semantics for casting from any other data type to calendar interval? I can
see string casting and casting from itself, but not any other data types.




On Fri, Jul 27, 2018 at 10:34 AM Daniel Mateus Pires <dm...@gmail.com>
wrote:

> Hi Sparkers! (maybe Sparkles ?)
>
> I just wanted to bring up the apparently ?controversial? Calendar Interval
> topic.
>
> I worked on: https://issues.apache.org/jira/browse/SPARK-24702,
> https://github.com/apache/spark/pull/21706
>
> The user was reporting an unexpected behaviour where he/she wasn’t able to
> cast to a Calendar Interval type.
>
> In the current version of Spark the following code works:
>
> scala> spark.sql("SELECT 'interval 1 hour' as a").select(col("a").cast("calendarinterval")).show()+----------------+|               a|+----------------+|interval 1 hours|+----------------+
>
>
> While the following doesn’t:
> spark.sql("SELECT CALENDARINTERVAL('interval 1 hour') as a").show()
>
>
> Since the DataFrame API equivalent of the SQL worked, I thought adding it
> would be an easy decision to make (to make it consistent)
>
> However, I got push-back on the PR on the basis that “*we do not plan to
> expose Calendar Interval as a public type*”
> Should there be a consensus on either cleaning up the public DataFrame API
> out of CalendarIntervalType OR making it consistent with the SQL ?
>
> --
> Best regards,
> Daniel Mateus Pires
> Data Engineer @ Hudson's Bay Company
>