You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Horváth Péter Gergely <ho...@gmail.com> on 2019/02/13 11:37:42 UTC

Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

Dear All,

I am facing a strange issue with Spark 2.3, where I would like to create a
MANAGED table out of the content of a DataFrame with the storage path
overridden.

Apparently, when one tries to create a Hive table via
DataFrameWriter.saveAsTable, supplying a "path" option causes Spark to
automatically create an external table.

This demonstrates the behaviour:

scala> val numbersDF = sc.parallelize((1 to 100).toList).toDF("numbers")
numbersDF: org.apache.spark.sql.DataFrame = [numbers: int]

scala> numbersDF.write.format("orc").saveAsTable("numbers_table1")

scala> spark.sql("describe formatted
numbers_table1").filter(_.get(0).toString == "Type").show
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|    Type|  MANAGED|       |
+--------+---------+-------+


scala> numbersDF.write.format("orc").option("path",
"/user/foobar/numbers_table_data").saveAsTable("numbers_table2")

scala> spark.sql("describe formatted
numbers_table2").filter(_.get(0).toString == "Type").show
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|    Type| EXTERNAL|       |
+--------+---------+-------+



I am wondering if there is any way to force creation of a managed table
with a custom path (which as far as I know, should be possible via standard
Hive commands).

I often seem to have the problem that I cannot find the appropriate
documentation for the option configuration of Spark APIs. Could someone
please point me to the right direction and tell me where these things are
documented?

Thanks,
Peter

Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

Posted by Chris Teoh <ch...@gmail.com>.
Thanks Peter.

I'm not sure if that is possible yet. The closest I can think of to
achieving what you want is to try something like:-
df.registerTempTable("mytable")
sql("create table mymanagedtable as select * from mytable")

I haven't used CTAS in Spark SQL before but have heard it works. This would
infer the schema for you and from what I have heard CTAS creates managed
tables.

Let me know if this works for you.

Kind Regards
Chris

On Thu, 14 Feb 2019 at 03:08 Horváth Péter Gergely <
horvath.peter.gergely@gmail.com> wrote:

> Hi Chris,
>
> Thank you for the input, I know I can always write the table DDL manually.
>
> But here I would like to rely on Spark generating the schema. What I don't
> understand is the change in the behaviour of Spark: having the storage path
> specified does not necessarily mean it should be an external table.
>
> Is there any way to control/override this?
>
> Thanks,
> Peter
>
>
> On Wed, Feb 13, 2019, 13:09 Chris Teoh <chris.teoh@gmail.com wrote:
>
>> Hey there,
>>
>> Could you not just create a managed table using the DDL in Spark SQL and
>> then written the data frame to the underlying folder or use Spark SQL to do
>> an insert?
>>
>> Alternatively try create table as select. Iirc hive creates managed
>> tables this way.
>>
>> I've not confirmed this works but I think that might be worth trying.
>>
>> I hope that helps.
>>
>> Kind regards
>> Chris
>>
>> On Wed., 13 Feb. 2019, 10:44 pm Horváth Péter Gergely, <
>> horvath.peter.gergely@gmail.com> wrote:
>>
>>> Dear All,
>>>
>>> I am facing a strange issue with Spark 2.3, where I would like to create
>>> a MANAGED table out of the content of a DataFrame with the storage path
>>> overridden.
>>>
>>> Apparently, when one tries to create a Hive table via
>>> DataFrameWriter.saveAsTable, supplying a "path" option causes Spark to
>>> automatically create an external table.
>>>
>>> This demonstrates the behaviour:
>>>
>>> scala> val numbersDF = sc.parallelize((1 to 100).toList).toDF("numbers")
>>> numbersDF: org.apache.spark.sql.DataFrame = [numbers: int]
>>>
>>> scala> numbersDF.write.format("orc").saveAsTable("numbers_table1")
>>>
>>> scala> spark.sql("describe formatted
>>> numbers_table1").filter(_.get(0).toString == "Type").show
>>> +--------+---------+-------+
>>> |col_name|data_type|comment|
>>> +--------+---------+-------+
>>> |    Type|  MANAGED|       |
>>> +--------+---------+-------+
>>>
>>>
>>> scala> numbersDF.write.format("orc").option("path",
>>> "/user/foobar/numbers_table_data").saveAsTable("numbers_table2")
>>>
>>> scala> spark.sql("describe formatted
>>> numbers_table2").filter(_.get(0).toString == "Type").show
>>> +--------+---------+-------+
>>> |col_name|data_type|comment|
>>> +--------+---------+-------+
>>> |    Type| EXTERNAL|       |
>>> +--------+---------+-------+
>>>
>>>
>>>
>>> I am wondering if there is any way to force creation of a managed table
>>> with a custom path (which as far as I know, should be possible via standard
>>> Hive commands).
>>>
>>> I often seem to have the problem that I cannot find the appropriate
>>> documentation for the option configuration of Spark APIs. Could someone
>>> please point me to the right direction and tell me where these things are
>>> documented?
>>>
>>> Thanks,
>>> Peter
>>>
>>>

Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

Posted by Horváth Péter Gergely <ho...@gmail.com>.
Hi Chris,

Thank you for the input, I know I can always write the table DDL manually.

But here I would like to rely on Spark generating the schema. What I don't
understand is the change in the behaviour of Spark: having the storage path
specified does not necessarily mean it should be an external table.

Is there any way to control/override this?

Thanks,
Peter


On Wed, Feb 13, 2019, 13:09 Chris Teoh <chris.teoh@gmail.com wrote:

> Hey there,
>
> Could you not just create a managed table using the DDL in Spark SQL and
> then written the data frame to the underlying folder or use Spark SQL to do
> an insert?
>
> Alternatively try create table as select. Iirc hive creates managed tables
> this way.
>
> I've not confirmed this works but I think that might be worth trying.
>
> I hope that helps.
>
> Kind regards
> Chris
>
> On Wed., 13 Feb. 2019, 10:44 pm Horváth Péter Gergely, <
> horvath.peter.gergely@gmail.com> wrote:
>
>> Dear All,
>>
>> I am facing a strange issue with Spark 2.3, where I would like to create
>> a MANAGED table out of the content of a DataFrame with the storage path
>> overridden.
>>
>> Apparently, when one tries to create a Hive table via
>> DataFrameWriter.saveAsTable, supplying a "path" option causes Spark to
>> automatically create an external table.
>>
>> This demonstrates the behaviour:
>>
>> scala> val numbersDF = sc.parallelize((1 to 100).toList).toDF("numbers")
>> numbersDF: org.apache.spark.sql.DataFrame = [numbers: int]
>>
>> scala> numbersDF.write.format("orc").saveAsTable("numbers_table1")
>>
>> scala> spark.sql("describe formatted
>> numbers_table1").filter(_.get(0).toString == "Type").show
>> +--------+---------+-------+
>> |col_name|data_type|comment|
>> +--------+---------+-------+
>> |    Type|  MANAGED|       |
>> +--------+---------+-------+
>>
>>
>> scala> numbersDF.write.format("orc").option("path",
>> "/user/foobar/numbers_table_data").saveAsTable("numbers_table2")
>>
>> scala> spark.sql("describe formatted
>> numbers_table2").filter(_.get(0).toString == "Type").show
>> +--------+---------+-------+
>> |col_name|data_type|comment|
>> +--------+---------+-------+
>> |    Type| EXTERNAL|       |
>> +--------+---------+-------+
>>
>>
>>
>> I am wondering if there is any way to force creation of a managed table
>> with a custom path (which as far as I know, should be possible via standard
>> Hive commands).
>>
>> I often seem to have the problem that I cannot find the appropriate
>> documentation for the option configuration of Spark APIs. Could someone
>> please point me to the right direction and tell me where these things are
>> documented?
>>
>> Thanks,
>> Peter
>>
>>

Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

Posted by Chris Teoh <ch...@gmail.com>.
Hey there,

Could you not just create a managed table using the DDL in Spark SQL and
then written the data frame to the underlying folder or use Spark SQL to do
an insert?

Alternatively try create table as select. Iirc hive creates managed tables
this way.

I've not confirmed this works but I think that might be worth trying.

I hope that helps.

Kind regards
Chris

On Wed., 13 Feb. 2019, 10:44 pm Horváth Péter Gergely, <
horvath.peter.gergely@gmail.com> wrote:

> Dear All,
>
> I am facing a strange issue with Spark 2.3, where I would like to create a
> MANAGED table out of the content of a DataFrame with the storage path
> overridden.
>
> Apparently, when one tries to create a Hive table via
> DataFrameWriter.saveAsTable, supplying a "path" option causes Spark to
> automatically create an external table.
>
> This demonstrates the behaviour:
>
> scala> val numbersDF = sc.parallelize((1 to 100).toList).toDF("numbers")
> numbersDF: org.apache.spark.sql.DataFrame = [numbers: int]
>
> scala> numbersDF.write.format("orc").saveAsTable("numbers_table1")
>
> scala> spark.sql("describe formatted
> numbers_table1").filter(_.get(0).toString == "Type").show
> +--------+---------+-------+
> |col_name|data_type|comment|
> +--------+---------+-------+
> |    Type|  MANAGED|       |
> +--------+---------+-------+
>
>
> scala> numbersDF.write.format("orc").option("path",
> "/user/foobar/numbers_table_data").saveAsTable("numbers_table2")
>
> scala> spark.sql("describe formatted
> numbers_table2").filter(_.get(0).toString == "Type").show
> +--------+---------+-------+
> |col_name|data_type|comment|
> +--------+---------+-------+
> |    Type| EXTERNAL|       |
> +--------+---------+-------+
>
>
>
> I am wondering if there is any way to force creation of a managed table
> with a custom path (which as far as I know, should be possible via standard
> Hive commands).
>
> I often seem to have the problem that I cannot find the appropriate
> documentation for the option configuration of Spark APIs. Could someone
> please point me to the right direction and tell me where these things are
> documented?
>
> Thanks,
> Peter
>
>