You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Christian Perez <ch...@svds.com> on 2015/03/19 17:00:12 UTC

saveAsTable broken in v1.3 DataFrames?

Hi all,

DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
schema _and_ storage format in the Hive metastore, so that the table
cannot be read from inside Hive. Spark itself can read the table, but
Hive throws a Serialization error because it doesn't know it is
Parquet.

val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income")
df.saveAsTable("spark_test_foo")

Expected:

COLUMNS(
  education BIGINT,
  income BIGINT
)

SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

Actual:

COLUMNS(
  col array<string> COMMENT "from deserializer"
)

SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

---

Manually changing schema and storage restores access in Hive and
doesn't affect Spark. Note also that Hive's table property
"spark.sql.sources.schema" is correct. At first glance, it looks like
the schema data is serialized when sent to Hive but not deserialized
properly on receive.

I'm tracing execution through source code... but before I get any
deeper, can anyone reproduce this behavior?

Cheers,

Christian

-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christian@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: saveAsTable broken in v1.3 DataFrames?

Posted by Michael Armbrust <mi...@databricks.com>.
I believe that you can get what you want by using HiveQL instead of the
pure programatic API.  This is a little verbose so perhaps a specialized
function would also be useful here.  I'm not sure I would call it
saveAsExternalTable as there are also "external" spark sql data source
tables that have nothing to do with hive.

The following should create a proper hive table:
df.registerTempTable("df")
sqlContext.sql("CREATE TABLE newTable AS SELECT * FROM df")

At the very least we should clarify in the documentation to avoid future
confusion.  The piggybacking is a little unfortunate but also gives us a
lot of new functionality that we can't get when strictly following the way
that Hive expects tables to be formatted.

I'd suggest opening a JIRA for the specialized method you describe.  Feel
free to mention me and Yin in a comment when create you it.

On Fri, Mar 20, 2015 at 12:55 PM, Christian Perez <ch...@svds.com>
wrote:

> Any other users interested in a feature
> DataFrame.saveAsExternalTable() for making _useful_ external tables in
> Hive, or am I the only one? Bueller? If I start a PR for this, will it
> be taken seriously?
>
> On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez <ch...@svds.com>
> wrote:
> > Hi Yin,
> >
> > Thanks for the clarification. My first reaction is that if this is the
> > intended behavior, it is a wasted opportunity. Why create a managed
> > table in Hive that cannot be read from inside Hive? I think I
> > understand now that you are essentially piggybacking on Hive's
> > metastore to persist table info between/across sessions, but I imagine
> > others might expect more (as I have.)
> >
> > We find ourselves wanting to do work in Spark and persist the results
> > where other users (e.g. analysts using Tableau connected to
> > Hive/Impala) can explore it. I imagine this is very common. I can, of
> > course, save it as parquet and create an external table in hive (which
> > I will do now), but saveAsTable seems much less useful to me now.
> >
> > Any other opinions?
> >
> > Cheers,
> >
> > C
> >
> > On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote:
> >> I meant table properties and serde properties are used to store
> metadata of
> >> a Spark SQL data source table. We do not set other fields like SerDe
> lib.
> >> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
> table
> >> should not show unrelated stuff like Serde lib and InputFormat. I have
> >> created https://issues.apache.org/jira/browse/SPARK-6413 to track the
> >> improvement on the output of DESCRIBE statement.
> >>
> >> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com>
> wrote:
> >>>
> >>> Hi Christian,
> >>>
> >>> Your table is stored correctly in Parquet format.
> >>>
> >>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
> >>> data source table
> >>> (
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources
> ).
> >>> We are only using Hive's metastore to store the metadata (to be
> specific,
> >>> only table properties and serde properties). When you look at table
> >>> property, there will be a field called "spark.sql.sources.provider"
> and the
> >>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can
> also
> >>> look at your files in the file system. They are stored by Parquet.
> >>>
> >>> Thanks,
> >>>
> >>> Yin
> >>>
> >>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <ch...@svds.com>
> >>> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
> >>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
> >>>> schema _and_ storage format in the Hive metastore, so that the table
> >>>> cannot be read from inside Hive. Spark itself can read the table, but
> >>>> Hive throws a Serialization error because it doesn't know it is
> >>>> Parquet.
> >>>>
> >>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
> >>>> "income")
> >>>> df.saveAsTable("spark_test_foo")
> >>>>
> >>>> Expected:
> >>>>
> >>>> COLUMNS(
> >>>>   education BIGINT,
> >>>>   income BIGINT
> >>>> )
> >>>>
> >>>> SerDe Library:
> >>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> >>>> InputFormat:
> >>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> >>>>
> >>>> Actual:
> >>>>
> >>>> COLUMNS(
> >>>>   col array<string> COMMENT "from deserializer"
> >>>> )
> >>>>
> >>>> SerDe Library:
> org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
> >>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
> >>>>
> >>>> ---
> >>>>
> >>>> Manually changing schema and storage restores access in Hive and
> >>>> doesn't affect Spark. Note also that Hive's table property
> >>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
> >>>> the schema data is serialized when sent to Hive but not deserialized
> >>>> properly on receive.
> >>>>
> >>>> I'm tracing execution through source code... but before I get any
> >>>> deeper, can anyone reproduce this behavior?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Christian
> >>>>
> >>>> --
> >>>> Christian Perez
> >>>> Silicon Valley Data Science
> >>>> Data Analyst
> >>>> christian@svds.com
> >>>> @cp_phd
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >>>> For additional commands, e-mail: user-help@spark.apache.org
> >>>>
> >>>
> >>
> >
> >
> >
> > --
> > Christian Perez
> > Silicon Valley Data Science
> > Data Analyst
> > christian@svds.com
> > @cp_phd
>
>
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christian@svds.com
> @cp_phd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: saveAsTable broken in v1.3 DataFrames?

Posted by Christian Perez <ch...@svds.com>.
Any other users interested in a feature
DataFrame.saveAsExternalTable() for making _useful_ external tables in
Hive, or am I the only one? Bueller? If I start a PR for this, will it
be taken seriously?

On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez <ch...@svds.com> wrote:
> Hi Yin,
>
> Thanks for the clarification. My first reaction is that if this is the
> intended behavior, it is a wasted opportunity. Why create a managed
> table in Hive that cannot be read from inside Hive? I think I
> understand now that you are essentially piggybacking on Hive's
> metastore to persist table info between/across sessions, but I imagine
> others might expect more (as I have.)
>
> We find ourselves wanting to do work in Spark and persist the results
> where other users (e.g. analysts using Tableau connected to
> Hive/Impala) can explore it. I imagine this is very common. I can, of
> course, save it as parquet and create an external table in hive (which
> I will do now), but saveAsTable seems much less useful to me now.
>
> Any other opinions?
>
> Cheers,
>
> C
>
> On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote:
>> I meant table properties and serde properties are used to store metadata of
>> a Spark SQL data source table. We do not set other fields like SerDe lib.
>> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
>> should not show unrelated stuff like Serde lib and InputFormat. I have
>> created https://issues.apache.org/jira/browse/SPARK-6413 to track the
>> improvement on the output of DESCRIBE statement.
>>
>> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote:
>>>
>>> Hi Christian,
>>>
>>> Your table is stored correctly in Parquet format.
>>>
>>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
>>> data source table
>>> (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
>>> We are only using Hive's metastore to store the metadata (to be specific,
>>> only table properties and serde properties). When you look at table
>>> property, there will be a field called "spark.sql.sources.provider" and the
>>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
>>> look at your files in the file system. They are stored by Parquet.
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <ch...@svds.com>
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>>>> schema _and_ storage format in the Hive metastore, so that the table
>>>> cannot be read from inside Hive. Spark itself can read the table, but
>>>> Hive throws a Serialization error because it doesn't know it is
>>>> Parquet.
>>>>
>>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
>>>> "income")
>>>> df.saveAsTable("spark_test_foo")
>>>>
>>>> Expected:
>>>>
>>>> COLUMNS(
>>>>   education BIGINT,
>>>>   income BIGINT
>>>> )
>>>>
>>>> SerDe Library:
>>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
>>>> InputFormat:
>>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>>>>
>>>> Actual:
>>>>
>>>> COLUMNS(
>>>>   col array<string> COMMENT "from deserializer"
>>>> )
>>>>
>>>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>>>
>>>> ---
>>>>
>>>> Manually changing schema and storage restores access in Hive and
>>>> doesn't affect Spark. Note also that Hive's table property
>>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>>>> the schema data is serialized when sent to Hive but not deserialized
>>>> properly on receive.
>>>>
>>>> I'm tracing execution through source code... but before I get any
>>>> deeper, can anyone reproduce this behavior?
>>>>
>>>> Cheers,
>>>>
>>>> Christian
>>>>
>>>> --
>>>> Christian Perez
>>>> Silicon Valley Data Science
>>>> Data Analyst
>>>> christian@svds.com
>>>> @cp_phd
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>
>>
>
>
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christian@svds.com
> @cp_phd



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christian@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: saveAsTable broken in v1.3 DataFrames?

Posted by Christian Perez <ch...@svds.com>.
Hi Yin,

Thanks for the clarification. My first reaction is that if this is the
intended behavior, it is a wasted opportunity. Why create a managed
table in Hive that cannot be read from inside Hive? I think I
understand now that you are essentially piggybacking on Hive's
metastore to persist table info between/across sessions, but I imagine
others might expect more (as I have.)

We find ourselves wanting to do work in Spark and persist the results
where other users (e.g. analysts using Tableau connected to
Hive/Impala) can explore it. I imagine this is very common. I can, of
course, save it as parquet and create an external table in hive (which
I will do now), but saveAsTable seems much less useful to me now.

Any other opinions?

Cheers,

C

On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote:
> I meant table properties and serde properties are used to store metadata of
> a Spark SQL data source table. We do not set other fields like SerDe lib.
> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
> should not show unrelated stuff like Serde lib and InputFormat. I have
> created https://issues.apache.org/jira/browse/SPARK-6413 to track the
> improvement on the output of DESCRIBE statement.
>
> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote:
>>
>> Hi Christian,
>>
>> Your table is stored correctly in Parquet format.
>>
>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
>> data source table
>> (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
>> We are only using Hive's metastore to store the metadata (to be specific,
>> only table properties and serde properties). When you look at table
>> property, there will be a field called "spark.sql.sources.provider" and the
>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
>> look at your files in the file system. They are stored by Parquet.
>>
>> Thanks,
>>
>> Yin
>>
>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <ch...@svds.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>>> schema _and_ storage format in the Hive metastore, so that the table
>>> cannot be read from inside Hive. Spark itself can read the table, but
>>> Hive throws a Serialization error because it doesn't know it is
>>> Parquet.
>>>
>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
>>> "income")
>>> df.saveAsTable("spark_test_foo")
>>>
>>> Expected:
>>>
>>> COLUMNS(
>>>   education BIGINT,
>>>   income BIGINT
>>> )
>>>
>>> SerDe Library:
>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
>>> InputFormat:
>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>>>
>>> Actual:
>>>
>>> COLUMNS(
>>>   col array<string> COMMENT "from deserializer"
>>> )
>>>
>>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>>
>>> ---
>>>
>>> Manually changing schema and storage restores access in Hive and
>>> doesn't affect Spark. Note also that Hive's table property
>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>>> the schema data is serialized when sent to Hive but not deserialized
>>> properly on receive.
>>>
>>> I'm tracing execution through source code... but before I get any
>>> deeper, can anyone reproduce this behavior?
>>>
>>> Cheers,
>>>
>>> Christian
>>>
>>> --
>>> Christian Perez
>>> Silicon Valley Data Science
>>> Data Analyst
>>> christian@svds.com
>>> @cp_phd
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>
>



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christian@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: saveAsTable broken in v1.3 DataFrames?

Posted by Yin Huai <yh...@databricks.com>.
I meant table properties and serde properties are used to store metadata of
a Spark SQL data source table. We do not set other fields like SerDe lib.
For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
table should not show unrelated stuff like Serde lib and InputFormat. I
have created https://issues.apache.org/jira/browse/SPARK-6413 to track the
improvement on the output of DESCRIBE statement.

On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote:

> Hi Christian,
>
> Your table is stored correctly in Parquet format.
>
> For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
> data source table (
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
> We are only using Hive's metastore to store the metadata (to be specific,
> only table properties and serde properties). When you look at table
> property, there will be a field called "spark.sql.sources.provider" and the
> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
> look at your files in the file system. They are stored by Parquet.
>
> Thanks,
>
> Yin
>
> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <ch...@svds.com>
> wrote:
>
>> Hi all,
>>
>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>> schema _and_ storage format in the Hive metastore, so that the table
>> cannot be read from inside Hive. Spark itself can read the table, but
>> Hive throws a Serialization error because it doesn't know it is
>> Parquet.
>>
>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income")
>> df.saveAsTable("spark_test_foo")
>>
>> Expected:
>>
>> COLUMNS(
>>   education BIGINT,
>>   income BIGINT
>> )
>>
>> SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
>> InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>>
>> Actual:
>>
>> COLUMNS(
>>   col array<string> COMMENT "from deserializer"
>> )
>>
>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>
>> ---
>>
>> Manually changing schema and storage restores access in Hive and
>> doesn't affect Spark. Note also that Hive's table property
>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>> the schema data is serialized when sent to Hive but not deserialized
>> properly on receive.
>>
>> I'm tracing execution through source code... but before I get any
>> deeper, can anyone reproduce this behavior?
>>
>> Cheers,
>>
>> Christian
>>
>> --
>> Christian Perez
>> Silicon Valley Data Science
>> Data Analyst
>> christian@svds.com
>> @cp_phd
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: saveAsTable broken in v1.3 DataFrames?

Posted by Yin Huai <yh...@databricks.com>.
Hi Christian,

Your table is stored correctly in Parquet format.

For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
data source table (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
We are only using Hive's metastore to store the metadata (to be specific,
only table properties and serde properties). When you look at table
property, there will be a field called "spark.sql.sources.provider" and the
value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
look at your files in the file system. They are stored by Parquet.

Thanks,

Yin

On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <ch...@svds.com>
wrote:

> Hi all,
>
> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
> schema _and_ storage format in the Hive metastore, so that the table
> cannot be read from inside Hive. Spark itself can read the table, but
> Hive throws a Serialization error because it doesn't know it is
> Parquet.
>
> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income")
> df.saveAsTable("spark_test_foo")
>
> Expected:
>
> COLUMNS(
>   education BIGINT,
>   income BIGINT
> )
>
> SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>
> Actual:
>
> COLUMNS(
>   col array<string> COMMENT "from deserializer"
> )
>
> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>
> ---
>
> Manually changing schema and storage restores access in Hive and
> doesn't affect Spark. Note also that Hive's table property
> "spark.sql.sources.schema" is correct. At first glance, it looks like
> the schema data is serialized when sent to Hive but not deserialized
> properly on receive.
>
> I'm tracing execution through source code... but before I get any
> deeper, can anyone reproduce this behavior?
>
> Cheers,
>
> Christian
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christian@svds.com
> @cp_phd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>