You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2019/12/06 18:35:21 UTC

FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Hi, All.

I want to share the following change to the community.

    SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

This is merged today and now Spark's `CREATE TABLE` is using Spark's
default data sources instead of `hive` provider. This is a good and big
improvement for Apache Spark 3.0, but this might surprise someone. (Please
note that there is a fallback option for them.)

Thank you, Yi, Wenchen, Xiao.

Cheers,
Dongjoon.

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Wenchen, could you start a new thread? Many people have probably already
muted this one, and it isn't really on topic.

The question that needs to be discussed is whether this is a safe change
for the 3.1 release, and reusing an old thread is not a great way to get
people's attention about something potentially harmful like that.

On Tue, Dec 1, 2020 at 10:46 AM Wenchen Fan <cl...@gmail.com> wrote:

> I'm reviving this thread because this feature was reverted before the 3.0
> release, and now we are trying to add it back since the CREATE TABLE syntax
> is unified.
>
> The benefits are pretty clear: CREATE TABLE by default (without USING or
> STORED AS) should create native tables that work best with Spark. You can
> see all the benefits listed in https://github.com/apache/spark/pull/30554.
>
> I'm sending this email to collect feedback about the risks. AFAIK
> the broken use cases are:
> 1. A user issues `CREATE TABLE ... LOCATION ...`. After some table
> insertions he want to read the data files directly from the table location.
> Because the file format is changed from Hive text to Parquet, this use case
> may be broken.
> 2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE`
> or `LOAD DATA`. These two are Hive specific commands and doesn't work with
> Spark native tables.
> 3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions
> with different serdes to this table. Spark doesn't allow a native
> partitioned table to have partitions with different formats.
>
> From my personal experience, the Hive text tables are usually used to
> import CSV-like data. It's very likely that people will create Hive text
> table explicitly as they need the Hive syntax to specify options like
> delimiter. Besides, I'm not sure how many Spark users are using this
> feature, as the native CSV data source can do the same job.
>
> I'd consider it a bad user experience if a simple `CREATE TABLE` gives
> users a very slow table. Changing it to return native Parquet table doesn't
> seems to break many people, but I can be wrong.
>
> Please reply to this thread if you know more use cases that may be
> affected by this change, and share your thoughts.
>
> Thanks,
> Wenchen
>
> On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> Oh, looks nice. Thanks for the sharing, Dongjoon
>>
>> Bests,
>> Takeshi
>>
>> On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> I want to share the following change to the community.
>>>
>>>     SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax
>>>
>>> This is merged today and now Spark's `CREATE TABLE` is using Spark's
>>> default data sources instead of `hive` provider. This is a good and big
>>> improvement for Apache Spark 3.0, but this might surprise someone. (Please
>>> note that there is a fallback option for them.)
>>>
>>> Thank you, Yi, Wenchen, Xiao.
>>>
>>> Cheers,
>>> Dongjoon.
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Posted by Wenchen Fan <cl...@gmail.com>.

I'm reviving this thread because this feature was reverted before the 3.0
release, and now we are trying to add it back since the CREATE TABLE syntax
is unified.

The benefits are pretty clear: CREATE TABLE by default (without USING or
STORED AS) should create native tables that work best with Spark. You can
see all the benefits listed in https://github.com/apache/spark/pull/30554.

I'm sending this email to collect feedback about the risks. AFAIK
the broken use cases are:
1. A user issues `CREATE TABLE ... LOCATION ...`. After some table
insertions he want to read the data files directly from the table location.
Because the file format is changed from Hive text to Parquet, this use case
may be broken.
2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE`
or `LOAD DATA`. These two are Hive specific commands and doesn't work with
Spark native tables.
3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions
with different serdes to this table. Spark doesn't allow a native
partitioned table to have partitions with different formats.

From my personal experience, the Hive text tables are usually used to
import CSV-like data. It's very likely that people will create Hive text
table explicitly as they need the Hive syntax to specify options like
delimiter. Besides, I'm not sure how many Spark users are using this
feature, as the native CSV data source can do the same job.

I'd consider it a bad user experience if a simple `CREATE TABLE` gives
users a very slow table. Changing it to return native Parquet table doesn't
seems to break many people, but I can be wrong.

Please reply to this thread if you know more use cases that may be affected
by this change, and share your thoughts.

Thanks,
Wenchen

On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro <li...@gmail.com>
wrote:

> Oh, looks nice. Thanks for the sharing, Dongjoon
>
> Bests,
> Takeshi
>
> On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> I want to share the following change to the community.
>>
>>     SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
>>
>> This is merged today and now Spark's `CREATE TABLE` is using Spark's
>> default data sources instead of `hive` provider. This is a good and big
>> improvement for Apache Spark 3.0, but this might surprise someone. (Please
>> note that there is a fallback option for them.)
>>
>> Thank you, Yi, Wenchen, Xiao.
>>
>> Cheers,
>> Dongjoon.
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Posted by Takeshi Yamamuro <li...@gmail.com>.

Oh, looks nice. Thanks for the sharing, Dongjoon

Bests,
Takeshi

On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> I want to share the following change to the community.
>
>     SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
>
> This is merged today and now Spark's `CREATE TABLE` is using Spark's
> default data sources instead of `hive` provider. This is a good and big
> improvement for Apache Spark 3.0, but this might surprise someone. (Please
> note that there is a fallback option for them.)
>
> Thank you, Yi, Wenchen, Xiao.
>
> Cheers,
> Dongjoon.
>


-- 
---
Takeshi Yamamuro