You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Shuai Lin <li...@gmail.com> on 2017/01/22 12:51:47 UTC

A question about creating persistent table when in-memory catalog is used

Hi all,

Currently when the in-memory catalog is used, e.g. through `--conf
spark.sql.catalogImplementation=in-memory`, we can create a persistent
table, but inserting into this table would fail with error message "Hive
support is required to insert into the following tables..".

    sql("create table t1 (id int, name string, dept string)") // OK
    sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


This doesn't make sense for me, because this table would always be empty if
we can't insert into it, thus would be of no use. But I wonder if there are
other good reasons for the current logic. If not, I would propose to raise
an error when creating the table in the first place.

Thanks!

Regards,
Shuai Lin (@lins05)

Re: A question about creating persistent table when in-memory catalog is used

Posted by Shuai Lin <li...@gmail.com>.

I see, thanks for the info!

On Mon, Jan 23, 2017 at 4:12 PM, Xiao Li <ga...@gmail.com> wrote:

> Reynold mentioned the direction we are heading. You can see many PRs the
> community submitted are for this target. To achieve this, a lot of works we
> need to do.
>
> For example, for some serde, Hive metastore will infer the schema when the
> schema is not provided, but our InMemoryCatalog does not have such a
> capability. Thus, we need to see how to resolve this.
>
> Hopefully, it answers your question. BTW, the issue you mentioned at the
> beginning has been resolved. Please fetch the latest master. You are unable
> to create such a hive serde table without Hive support.
>
> Thanks,
>
> Xiao Li
>
>
> 2017-01-23 0:01 GMT-08:00 Shuai Lin <li...@gmail.com>:
>
>> Cool, thanks for the info.
>>
>> I think this is something we are going to change to completely decouple
>>> the Hive support and catalog.
>>
>>
>> Is there a ticket for this? I did a search in jira and only found
>> "SPARK-16275: Implement all the Hive fallback functions", which seems to be
>> related to it.
>>
>>
>> On Mon, Jan 23, 2017 at 3:21 AM, Xiao Li <ga...@gmail.com> wrote:
>>
>>> Agree. : )
>>>
>>> 2017-01-22 11:20 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>>>
>>>> To be clear there are two separate "hive" we are talking about here.
>>>> One is the catalog, and the other is the Hive serde and UDF support. We
>>>> want to get to a point that the choice of catalog does not impact the
>>>> functionality in Spark other than where the catalog is stored.
>>>>
>>>>
>>>> On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <ga...@gmail.com> wrote:
>>>>
>>>>> We have a pending PR to block users to create the Hive serde table
>>>>> when using InMemroyCatalog. See: https://github.com/apache/spar
>>>>> k/pull/16587 I believe it answers your question.
>>>>>
>>>>> BTW, we still can create the regular data source tables and insert the
>>>>> data into the tables. The major difference is whether the metadata is
>>>>> persistently stored or not.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Xiao Li
>>>>>
>>>>> 2017-01-22 11:14 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>>>>>
>>>>> I think this is something we are going to change to completely
>>>>> decouple the Hive support and catalog.
>>>>>
>>>>>
>>>>> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Currently when the in-memory catalog is used, e.g. through `--conf
>>>>> spark.sql.catalogImplementation=in-memory`, we can create a
>>>>> persistent table, but inserting into this table would fail with error
>>>>> message "Hive support is required to insert into the following tables..".
>>>>>
>>>>>     sql("create table t1 (id int, name string, dept string)") // OK
>>>>>     sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR
>>>>>
>>>>>
>>>>> This doesn't make sense for me, because this table would always be
>>>>> empty if we can't insert into it, thus would be of no use. But I wonder if
>>>>> there are other good reasons for the current logic. If not, I would propose
>>>>> to raise an error when creating the table in the first place.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Regards,
>>>>> Shuai Lin (@lins05)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: A question about creating persistent table when in-memory catalog is used

Posted by Xiao Li <ga...@gmail.com>.

Reynold mentioned the direction we are heading. You can see many PRs the
community submitted are for this target. To achieve this, a lot of works we
need to do.

For example, for some serde, Hive metastore will infer the schema when the
schema is not provided, but our InMemoryCatalog does not have such a
capability. Thus, we need to see how to resolve this.

Hopefully, it answers your question. BTW, the issue you mentioned at the
beginning has been resolved. Please fetch the latest master. You are unable
to create such a hive serde table without Hive support.

Thanks,

Xiao Li


2017-01-23 0:01 GMT-08:00 Shuai Lin <li...@gmail.com>:

> Cool, thanks for the info.
>
> I think this is something we are going to change to completely decouple
>> the Hive support and catalog.
>
>
> Is there a ticket for this? I did a search in jira and only found
> "SPARK-16275: Implement all the Hive fallback functions", which seems to be
> related to it.
>
>
> On Mon, Jan 23, 2017 at 3:21 AM, Xiao Li <ga...@gmail.com> wrote:
>
>> Agree. : )
>>
>> 2017-01-22 11:20 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>>
>>> To be clear there are two separate "hive" we are talking about here. One
>>> is the catalog, and the other is the Hive serde and UDF support. We want to
>>> get to a point that the choice of catalog does not impact the functionality
>>> in Spark other than where the catalog is stored.
>>>
>>>
>>> On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> We have a pending PR to block users to create the Hive serde table when
>>>> using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587
>>>> I believe it answers your question.
>>>>
>>>> BTW, we still can create the regular data source tables and insert the
>>>> data into the tables. The major difference is whether the metadata is
>>>> persistently stored or not.
>>>>
>>>> Thanks,
>>>>
>>>> Xiao Li
>>>>
>>>> 2017-01-22 11:14 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>>>>
>>>> I think this is something we are going to change to completely decouple
>>>> the Hive support and catalog.
>>>>
>>>>
>>>> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <li...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Currently when the in-memory catalog is used, e.g. through `--conf
>>>> spark.sql.catalogImplementation=in-memory`, we can create a persistent
>>>> table, but inserting into this table would fail with error message "Hive
>>>> support is required to insert into the following tables..".
>>>>
>>>>     sql("create table t1 (id int, name string, dept string)") // OK
>>>>     sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR
>>>>
>>>>
>>>> This doesn't make sense for me, because this table would always be
>>>> empty if we can't insert into it, thus would be of no use. But I wonder if
>>>> there are other good reasons for the current logic. If not, I would propose
>>>> to raise an error when creating the table in the first place.
>>>>
>>>> Thanks!
>>>>
>>>> Regards,
>>>> Shuai Lin (@lins05)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>

Re: A question about creating persistent table when in-memory catalog is used

Posted by Shuai Lin <li...@gmail.com>.

Cool, thanks for the info.

I think this is something we are going to change to completely decouple the
> Hive support and catalog.


Is there a ticket for this? I did a search in jira and only found
"SPARK-16275: Implement all the Hive fallback functions", which seems to be
related to it.


On Mon, Jan 23, 2017 at 3:21 AM, Xiao Li <ga...@gmail.com> wrote:

> Agree. : )
>
> 2017-01-22 11:20 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>
>> To be clear there are two separate "hive" we are talking about here. One
>> is the catalog, and the other is the Hive serde and UDF support. We want to
>> get to a point that the choice of catalog does not impact the functionality
>> in Spark other than where the catalog is stored.
>>
>>
>> On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <ga...@gmail.com> wrote:
>>
>>> We have a pending PR to block users to create the Hive serde table when
>>> using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587
>>> I believe it answers your question.
>>>
>>> BTW, we still can create the regular data source tables and insert the
>>> data into the tables. The major difference is whether the metadata is
>>> persistently stored or not.
>>>
>>> Thanks,
>>>
>>> Xiao Li
>>>
>>> 2017-01-22 11:14 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>>>
>>> I think this is something we are going to change to completely decouple
>>> the Hive support and catalog.
>>>
>>>
>>> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <li...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Currently when the in-memory catalog is used, e.g. through `--conf
>>> spark.sql.catalogImplementation=in-memory`, we can create a persistent
>>> table, but inserting into this table would fail with error message "Hive
>>> support is required to insert into the following tables..".
>>>
>>>     sql("create table t1 (id int, name string, dept string)") // OK
>>>     sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR
>>>
>>>
>>> This doesn't make sense for me, because this table would always be empty
>>> if we can't insert into it, thus would be of no use. But I wonder if there
>>> are other good reasons for the current logic. If not, I would propose to
>>> raise an error when creating the table in the first place.
>>>
>>> Thanks!
>>>
>>> Regards,
>>> Shuai Lin (@lins05)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: A question about creating persistent table when in-memory catalog is used

Posted by Xiao Li <ga...@gmail.com>.

Agree. : )

2017-01-22 11:20 GMT-08:00 Reynold Xin <rx...@databricks.com>:

> To be clear there are two separate "hive" we are talking about here. One
> is the catalog, and the other is the Hive serde and UDF support. We want to
> get to a point that the choice of catalog does not impact the functionality
> in Spark other than where the catalog is stored.
>
>
> On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <ga...@gmail.com> wrote:
>
>> We have a pending PR to block users to create the Hive serde table when
>> using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I
>> believe it answers your question.
>>
>> BTW, we still can create the regular data source tables and insert the
>> data into the tables. The major difference is whether the metadata is
>> persistently stored or not.
>>
>> Thanks,
>>
>> Xiao Li
>>
>> 2017-01-22 11:14 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>>
>> I think this is something we are going to change to completely decouple
>> the Hive support and catalog.
>>
>>
>> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <li...@gmail.com> wrote:
>>
>> Hi all,
>>
>> Currently when the in-memory catalog is used, e.g. through `--conf
>> spark.sql.catalogImplementation=in-memory`, we can create a persistent
>> table, but inserting into this table would fail with error message "Hive
>> support is required to insert into the following tables..".
>>
>>     sql("create table t1 (id int, name string, dept string)") // OK
>>     sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR
>>
>>
>> This doesn't make sense for me, because this table would always be empty
>> if we can't insert into it, thus would be of no use. But I wonder if there
>> are other good reasons for the current logic. If not, I would propose to
>> raise an error when creating the table in the first place.
>>
>> Thanks!
>>
>> Regards,
>> Shuai Lin (@lins05)
>>
>>
>>
>>
>>
>>
>>
>>

Re: A question about creating persistent table when in-memory catalog is used

Posted by Reynold Xin <rx...@databricks.com>.

To be clear there are two separate "hive" we are talking about here. One is
the catalog, and the other is the Hive serde and UDF support. We want to
get to a point that the choice of catalog does not impact the functionality
in Spark other than where the catalog is stored.


On Sun, Jan 22, 2017 at 11:18 AM Xiao Li <ga...@gmail.com> wrote:

> We have a pending PR to block users to create the Hive serde table when
> using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I
> believe it answers your question.
>
> BTW, we still can create the regular data source tables and insert the
> data into the tables. The major difference is whether the metadata is
> persistently stored or not.
>
> Thanks,
>
> Xiao Li
>
> 2017-01-22 11:14 GMT-08:00 Reynold Xin <rx...@databricks.com>:
>
> I think this is something we are going to change to completely decouple
> the Hive support and catalog.
>
>
> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <li...@gmail.com> wrote:
>
> Hi all,
>
> Currently when the in-memory catalog is used, e.g. through `--conf
> spark.sql.catalogImplementation=in-memory`, we can create a persistent
> table, but inserting into this table would fail with error message "Hive
> support is required to insert into the following tables..".
>
>     sql("create table t1 (id int, name string, dept string)") // OK
>     sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR
>
>
> This doesn't make sense for me, because this table would always be empty
> if we can't insert into it, thus would be of no use. But I wonder if there
> are other good reasons for the current logic. If not, I would propose to
> raise an error when creating the table in the first place.
>
> Thanks!
>
> Regards,
> Shuai Lin (@lins05)
>
>
>
>
>
>
>
>

Re: A question about creating persistent table when in-memory catalog is used

Posted by Xiao Li <ga...@gmail.com>.

We have a pending PR to block users to create the Hive serde table when
using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587 I
believe it answers your question.

BTW, we still can create the regular data source tables and insert the data
into the tables. The major difference is whether the metadata is
persistently stored or not.

Thanks,

Xiao Li

2017-01-22 11:14 GMT-08:00 Reynold Xin <rx...@databricks.com>:

> I think this is something we are going to change to completely decouple
> the Hive support and catalog.
>
>
> On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <li...@gmail.com> wrote:
>
>> Hi all,
>>
>> Currently when the in-memory catalog is used, e.g. through `--conf
>> spark.sql.catalogImplementation=in-memory`, we can create a persistent
>> table, but inserting into this table would fail with error message "Hive
>> support is required to insert into the following tables..".
>>
>>     sql("create table t1 (id int, name string, dept string)") // OK
>>     sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR
>>
>>
>> This doesn't make sense for me, because this table would always be empty
>> if we can't insert into it, thus would be of no use. But I wonder if there
>> are other good reasons for the current logic. If not, I would propose to
>> raise an error when creating the table in the first place.
>>
>> Thanks!
>>
>> Regards,
>> Shuai Lin (@lins05)
>>
>>
>>

Re: A question about creating persistent table when in-memory catalog is used

Posted by Reynold Xin <rx...@databricks.com>.

I think this is something we are going to change to completely decouple the
Hive support and catalog.


On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin <li...@gmail.com> wrote:

> Hi all,
>
> Currently when the in-memory catalog is used, e.g. through `--conf
> spark.sql.catalogImplementation=in-memory`, we can create a persistent
> table, but inserting into this table would fail with error message "Hive
> support is required to insert into the following tables..".
>
>     sql("create table t1 (id int, name string, dept string)") // OK
>     sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR
>
>
> This doesn't make sense for me, because this table would always be empty
> if we can't insert into it, thus would be of no use. But I wonder if there
> are other good reasons for the current logic. If not, I would propose to
> raise an error when creating the table in the first place.
>
> Thanks!
>
> Regards,
> Shuai Lin (@lins05)
>
>
>