You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Cristian O <cr...@googlemail.com> on 2015/12/09 17:34:06 UTC

SQL language vs DataFrame API

Hi,

I was wondering what the "official" view is on feature parity between SQL
and DF apis. Docs are pretty sparse on the SQL front, and it seems that
some features are only supported at various times in only one of Spark SQL
dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
are some examples

Is there an explicit goal of having consistent support for all features in
both DF and SQL ?

Thanks,
Cristian

Re: SQL language vs DataFrame API

Posted by Stephen Boesch <ja...@gmail.com>.

Is this a candidate for the version 1.X/2.0 split?

2015-12-09 16:29 GMT-08:00 Michael Armbrust <mi...@databricks.com>:

> Yeah, I would like to address any actual gaps in functionality that are
> present.
>
> On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris <cristian.b.opris@gmail.com
> > wrote:
>
>> The reason I'm asking is because it's important in larger projects to be
>> able to stick to a particular programming style. Some people are more
>> comfortable with SQL, others might find the DF api more suitable, but it's
>> important to have full expressivity in both to make it easier to adopt one
>> approach rather than have to mix and match to achieve full functionality.
>>
>> On 9 December 2015 at 19:41, Xiao Li <ga...@gmail.com> wrote:
>>
>>> That sounds great! When it is decided, please let us know and we can add
>>> more features and make it ANSI SQL compliant.
>>>
>>> Thank you!
>>>
>>> Xiao Li
>>>
>>>
>>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust <mi...@databricks.com>:
>>>
>>>> I don't plan to abandon HiveQL compatibility, but I'd like to see us
>>>> move towards something with more SQL compliance (perhaps just newer
>>>> versions of the HiveQL parser).  Exactly which parser will do that for us
>>>> is under investigation.
>>>>
>>>> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>>> Hi, Michael,
>>>>>
>>>>> Does that mean SqlContext will be built on HiveQL in the near future?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xiao Li
>>>>>
>>>>>
>>>>> 2015-12-09 10:36 GMT-08:00 Michael Armbrust <mi...@databricks.com>:
>>>>>
>>>>>> I think that it is generally good to have parity when the
>>>>>> functionality is useful.  However, in some cases various features are there
>>>>>> just to maintain compatibility with other system.  For example CACHE TABLE
>>>>>> is eager because Shark's cache table was.  df.cache() is lazy because
>>>>>> Spark's cache is.  Does that mean that we need to add some eager caching
>>>>>> mechanism to dataframes to have parity?  Probably not, users can just call
>>>>>> .count() if they want to force materialization.
>>>>>>
>>>>>> Regarding the differences between HiveQL and the SQLParser, I think
>>>>>> we should get rid of the SQL parser.  Its kind of a hack that I built just
>>>>>> so that there was some SQL story for people who didn't compile with Hive.
>>>>>> Moving forward, I'd like to see the distinction between the HiveContext and
>>>>>> SQLContext removed and we can standardize on a single parser.  For this
>>>>>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>>>>>> features there.
>>>>>>
>>>>>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>>>>>> cristian.b.opris@googlemail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was wondering what the "official" view is on feature parity
>>>>>>> between SQL and DF apis. Docs are pretty sparse on the SQL front, and it
>>>>>>> seems that some features are only supported at various times in only one of
>>>>>>> Spark SQL dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY,
>>>>>>> CACHE LAZY are some examples
>>>>>>>
>>>>>>> Is there an explicit goal of having consistent support for all
>>>>>>> features in both DF and SQL ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cristian
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SQL language vs DataFrame API

Posted by Michael Armbrust <mi...@databricks.com>.

Yeah, I would like to address any actual gaps in functionality that are
present.

On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris <cr...@gmail.com>
wrote:

> The reason I'm asking is because it's important in larger projects to be
> able to stick to a particular programming style. Some people are more
> comfortable with SQL, others might find the DF api more suitable, but it's
> important to have full expressivity in both to make it easier to adopt one
> approach rather than have to mix and match to achieve full functionality.
>
> On 9 December 2015 at 19:41, Xiao Li <ga...@gmail.com> wrote:
>
>> That sounds great! When it is decided, please let us know and we can add
>> more features and make it ANSI SQL compliant.
>>
>> Thank you!
>>
>> Xiao Li
>>
>>
>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust <mi...@databricks.com>:
>>
>>> I don't plan to abandon HiveQL compatibility, but I'd like to see us
>>> move towards something with more SQL compliance (perhaps just newer
>>> versions of the HiveQL parser).  Exactly which parser will do that for us
>>> is under investigation.
>>>
>>> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> Hi, Michael,
>>>>
>>>> Does that mean SqlContext will be built on HiveQL in the near future?
>>>>
>>>> Thanks,
>>>>
>>>> Xiao Li
>>>>
>>>>
>>>> 2015-12-09 10:36 GMT-08:00 Michael Armbrust <mi...@databricks.com>:
>>>>
>>>>> I think that it is generally good to have parity when the
>>>>> functionality is useful.  However, in some cases various features are there
>>>>> just to maintain compatibility with other system.  For example CACHE TABLE
>>>>> is eager because Shark's cache table was.  df.cache() is lazy because
>>>>> Spark's cache is.  Does that mean that we need to add some eager caching
>>>>> mechanism to dataframes to have parity?  Probably not, users can just call
>>>>> .count() if they want to force materialization.
>>>>>
>>>>> Regarding the differences between HiveQL and the SQLParser, I think we
>>>>> should get rid of the SQL parser.  Its kind of a hack that I built just so
>>>>> that there was some SQL story for people who didn't compile with Hive.
>>>>> Moving forward, I'd like to see the distinction between the HiveContext and
>>>>> SQLContext removed and we can standardize on a single parser.  For this
>>>>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>>>>> features there.
>>>>>
>>>>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>>>>> cristian.b.opris@googlemail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I was wondering what the "official" view is on feature parity between
>>>>>> SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems that
>>>>>> some features are only supported at various times in only one of Spark SQL
>>>>>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>>>>>> are some examples
>>>>>>
>>>>>> Is there an explicit goal of having consistent support for all
>>>>>> features in both DF and SQL ?
>>>>>>
>>>>>> Thanks,
>>>>>> Cristian
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SQL language vs DataFrame API

Posted by Xiao Li <ga...@gmail.com>.

That sounds great! When it is decided, please let us know and we can add
more features and make it ANSI SQL compliant.

Thank you!

Xiao Li


2015-12-09 11:31 GMT-08:00 Michael Armbrust <mi...@databricks.com>:

> I don't plan to abandon HiveQL compatibility, but I'd like to see us move
> towards something with more SQL compliance (perhaps just newer versions of
> the HiveQL parser).  Exactly which parser will do that for us is under
> investigation.
>
> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li <ga...@gmail.com> wrote:
>
>> Hi, Michael,
>>
>> Does that mean SqlContext will be built on HiveQL in the near future?
>>
>> Thanks,
>>
>> Xiao Li
>>
>>
>> 2015-12-09 10:36 GMT-08:00 Michael Armbrust <mi...@databricks.com>:
>>
>>> I think that it is generally good to have parity when the functionality
>>> is useful.  However, in some cases various features are there just to
>>> maintain compatibility with other system.  For example CACHE TABLE is eager
>>> because Shark's cache table was.  df.cache() is lazy because Spark's cache
>>> is.  Does that mean that we need to add some eager caching mechanism to
>>> dataframes to have parity?  Probably not, users can just call .count() if
>>> they want to force materialization.
>>>
>>> Regarding the differences between HiveQL and the SQLParser, I think we
>>> should get rid of the SQL parser.  Its kind of a hack that I built just so
>>> that there was some SQL story for people who didn't compile with Hive.
>>> Moving forward, I'd like to see the distinction between the HiveContext and
>>> SQLContext removed and we can standardize on a single parser.  For this
>>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>>> features there.
>>>
>>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>>> cristian.b.opris@googlemail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I was wondering what the "official" view is on feature parity between
>>>> SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems that
>>>> some features are only supported at various times in only one of Spark SQL
>>>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>>>> are some examples
>>>>
>>>> Is there an explicit goal of having consistent support for all features
>>>> in both DF and SQL ?
>>>>
>>>> Thanks,
>>>> Cristian
>>>>
>>>
>>>
>>
>

Re: SQL language vs DataFrame API

Posted by Michael Armbrust <mi...@databricks.com>.

I don't plan to abandon HiveQL compatibility, but I'd like to see us move
towards something with more SQL compliance (perhaps just newer versions of
the HiveQL parser).  Exactly which parser will do that for us is under
investigation.

On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li <ga...@gmail.com> wrote:

> Hi, Michael,
>
> Does that mean SqlContext will be built on HiveQL in the near future?
>
> Thanks,
>
> Xiao Li
>
>
> 2015-12-09 10:36 GMT-08:00 Michael Armbrust <mi...@databricks.com>:
>
>> I think that it is generally good to have parity when the functionality
>> is useful.  However, in some cases various features are there just to
>> maintain compatibility with other system.  For example CACHE TABLE is eager
>> because Shark's cache table was.  df.cache() is lazy because Spark's cache
>> is.  Does that mean that we need to add some eager caching mechanism to
>> dataframes to have parity?  Probably not, users can just call .count() if
>> they want to force materialization.
>>
>> Regarding the differences between HiveQL and the SQLParser, I think we
>> should get rid of the SQL parser.  Its kind of a hack that I built just so
>> that there was some SQL story for people who didn't compile with Hive.
>> Moving forward, I'd like to see the distinction between the HiveContext and
>> SQLContext removed and we can standardize on a single parser.  For this
>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>> features there.
>>
>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>> cristian.b.opris@googlemail.com> wrote:
>>
>>> Hi,
>>>
>>> I was wondering what the "official" view is on feature parity between
>>> SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems that
>>> some features are only supported at various times in only one of Spark SQL
>>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>>> are some examples
>>>
>>> Is there an explicit goal of having consistent support for all features
>>> in both DF and SQL ?
>>>
>>> Thanks,
>>> Cristian
>>>
>>
>>
>

Re: SQL language vs DataFrame API

Posted by Xiao Li <ga...@gmail.com>.

Hi, Michael,

Does that mean SqlContext will be built on HiveQL in the near future?

Thanks,

Xiao Li


2015-12-09 10:36 GMT-08:00 Michael Armbrust <mi...@databricks.com>:

> I think that it is generally good to have parity when the functionality is
> useful.  However, in some cases various features are there just to maintain
> compatibility with other system.  For example CACHE TABLE is eager because
> Shark's cache table was.  df.cache() is lazy because Spark's cache is.
> Does that mean that we need to add some eager caching mechanism to
> dataframes to have parity?  Probably not, users can just call .count() if
> they want to force materialization.
>
> Regarding the differences between HiveQL and the SQLParser, I think we
> should get rid of the SQL parser.  Its kind of a hack that I built just so
> that there was some SQL story for people who didn't compile with Hive.
> Moving forward, I'd like to see the distinction between the HiveContext and
> SQLContext removed and we can standardize on a single parser.  For this
> reason I'd be opposed to spending a lot of dev/reviewer time on adding
> features there.
>
> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
> cristian.b.opris@googlemail.com> wrote:
>
>> Hi,
>>
>> I was wondering what the "official" view is on feature parity between SQL
>> and DF apis. Docs are pretty sparse on the SQL front, and it seems that
>> some features are only supported at various times in only one of Spark SQL
>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>> are some examples
>>
>> Is there an explicit goal of having consistent support for all features
>> in both DF and SQL ?
>>
>> Thanks,
>> Cristian
>>
>
>

Re: SQL language vs DataFrame API

Posted by Michael Armbrust <mi...@databricks.com>.

I think that it is generally good to have parity when the functionality is
useful.  However, in some cases various features are there just to maintain
compatibility with other system.  For example CACHE TABLE is eager because
Shark's cache table was.  df.cache() is lazy because Spark's cache is.
Does that mean that we need to add some eager caching mechanism to
dataframes to have parity?  Probably not, users can just call .count() if
they want to force materialization.

Regarding the differences between HiveQL and the SQLParser, I think we
should get rid of the SQL parser.  Its kind of a hack that I built just so
that there was some SQL story for people who didn't compile with Hive.
Moving forward, I'd like to see the distinction between the HiveContext and
SQLContext removed and we can standardize on a single parser.  For this
reason I'd be opposed to spending a lot of dev/reviewer time on adding
features there.

On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <cr...@googlemail.com>
wrote:

> Hi,
>
> I was wondering what the "official" view is on feature parity between SQL
> and DF apis. Docs are pretty sparse on the SQL front, and it seems that
> some features are only supported at various times in only one of Spark SQL
> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
> are some examples
>
> Is there an explicit goal of having consistent support for all features in
> both DF and SQL ?
>
> Thanks,
> Cristian
>