You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by The Watcher <wa...@gmail.com> on 2015/02/19 23:50:27 UTC

Spark SQL, Hive & Parquet data types

Still trying to get my head around Spark SQL & Hive.

1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
tables, declared in a Hive meta-store.

Does it matter at all if Hive supports the data types I need with Parquet,
or is all that matters what Catalyst & spark's parquet relation support ?

Case in point : timestamps & Parquet
* Parquet now supports them as per
https://github.com/Parquet/parquet-mr/issues/218
* Hive only supports them in 0.14
So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
1.3 ?

I have found this thread
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
which seems to indicate that the data types supported by Hive would matter
to Spark SQL.
If so, why is that ? Doesn't the read path go through Spark SQL to read the
parquet file ?

2) Is there planned support for Hive 0.14 ?

Thanks

Re: Spark SQL, Hive & Parquet data types

Posted by yash datta <sa...@gmail.com>.
For the old parquet path (available in 1.2.1) , i made a few changes for
being able to read/write to a table partitioned on timestamp type column

https://github.com/apache/spark/pull/4469


On Fri, Feb 20, 2015 at 8:28 PM, The Watcher <wa...@gmail.com> wrote:

> >
> >
> >    1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
> >    its own Parquet support to handle both read path and write path when
> >    dealing with Parquet tables declared in Hive metastore, as long as
> you’re
> >    not writing to a partitioned table. So yes, you can.
> >
> > Ah, I had missed the part about being partitioned or not. Is this related
> to the work being done on ParquetRelation2 ?
>
> We will indeed write to a partitioned table : do neither the read nor the
> write path go through Spark SQL's parquet support in that case ? Is there a
> JIRA/PR I can monitor to see when this would change ?
>
> Thanks
>



-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.

Re: Spark SQL, Hive & Parquet data types

Posted by Cheng Lian <li...@gmail.com>.
Ah, sorry for not being clear enough.

So now in Spark 1.3.0, we have two Parquet support implementations, the 
old one is tightly coupled with the Spark SQL framework, while the new 
one is based on data sources API. In both versions, we try to intercept 
operations over Parquet tables registered in metastore when possible for 
better performance (mainly filter push-down optimization and extra 
metadata for more accurate schema inference). The distinctions are:

 1.

    For old version (set |spark.sql.parquet.useDataSourceApi| to |false|)

    When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
    “hijack” the read path. Namely whenever you query a Parquet table
    registered in metastore, we’re using our own Parquet implementation.

    For write path, we fallback to default Hive SerDe implementation
    (namely Spark SQL’s |InsertIntoHiveTable| operator).

 2.

    For new data source version (set
    |spark.sql.parquet.useDataSourceApi| to |true|, which is the default
    value in master and branch-1.3)

    When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
    “hijack” both read and write path, but if you’re writing to a
    partitioned table, we still fallback to default Hive SerDe
    implementation.

For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet data 
source, but it’s not enabled if you’re not using data sources API 
specific DDL (|CREATE TEMPORARY TABLE <table-name> USING <data-source>|).

Cheng

On 2/23/15 10:05 PM, The Watcher wrote:

>> Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
>> own Parquet support to read partitioned Parquet tables declared in Hive
>> metastore. Only writing to partitioned tables is not covered yet. These
>> improvements will be included in Spark 1.3.0.
>>
>> Just created SPARK-5948 to track writing to partitioned Parquet tables.
>>
> Ok, this is still a little confusing.
>
> Since I am able in 1.2.0 to write to a partitioned Hive by registering my
> SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the
> registrered", what is the write-path in this case ? Full Hive with a
> SparkSQL<->Hive bridge ?
> If that were the case, why wouldn't SKEWED ON be honored (see another
> thread I opened).
>
> Thanks
>
​

Re: Spark SQL, Hive & Parquet data types

Posted by The Watcher <wa...@gmail.com>.
>
> Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
> own Parquet support to read partitioned Parquet tables declared in Hive
> metastore. Only writing to partitioned tables is not covered yet. These
> improvements will be included in Spark 1.3.0.
>
> Just created SPARK-5948 to track writing to partitioned Parquet tables.
>
Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the
registrered", what is the write-path in this case ? Full Hive with a
SparkSQL<->Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks

Re: Spark SQL, Hive & Parquet data types

Posted by Cheng Lian <li...@gmail.com>.
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses 
its own Parquet support to read partitioned Parquet tables declared in 
Hive metastore. Only writing to partitioned tables is not covered yet. 
These improvements will be included in Spark 1.3.0.

Just created SPARK-5948 to track writing to partitioned Parquet tables.

Cheng

On 2/20/15 10:58 PM, The Watcher wrote:
>>
>>     1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
>>     its own Parquet support to handle both read path and write path when
>>     dealing with Parquet tables declared in Hive metastore, as long as you’re
>>     not writing to a partitioned table. So yes, you can.
>>
>> Ah, I had missed the part about being partitioned or not. Is this related
> to the work being done on ParquetRelation2 ?
>
> We will indeed write to a partitioned table : do neither the read nor the
> write path go through Spark SQL's parquet support in that case ? Is there a
> JIRA/PR I can monitor to see when this would change ?
>
> Thanks
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Spark SQL, Hive & Parquet data types

Posted by The Watcher <wa...@gmail.com>.
>
>
>    1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
>    its own Parquet support to handle both read path and write path when
>    dealing with Parquet tables declared in Hive metastore, as long as you’re
>    not writing to a partitioned table. So yes, you can.
>
> Ah, I had missed the part about being partitioned or not. Is this related
to the work being done on ParquetRelation2 ?

We will indeed write to a partitioned table : do neither the read nor the
write path go through Spark SQL's parquet support in that case ? Is there a
JIRA/PR I can monitor to see when this would change ?

Thanks

Re: Spark SQL, Hive & Parquet data types

Posted by Cheng Lian <li...@gmail.com>.
For the second question, we do plan to support Hive 0.14, possibly in 
Spark 1.4.0.

For the first question:

 1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
    type, so you can’t.
 2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
    own Parquet support to handle both read path and write path when
    dealing with Parquet tables declared in Hive metastore, as long as
    you’re not writing to a partitioned table. So yes, you can.

The Parquet version bundled with Spark 1.3.0 is 1.6.0rc3, which supports 
timestamp type natively. However, the Parquet versions bundled with Hive 
0.13.1 and Hive 0.14.0 are 1.3.2 and 1.5.0 respectively. Neither of them 
supports timestamp type. Hive 0.14.0 “supports” read/write timestamp 
from/to Parquet by converting timestamps from/to Parquet binaries. 
Similarly, Impala converts timestamp into Parquet int96. This can be 
annoying for Spark SQL, because we must interpret Parquet files in 
different ways according to the original writer of the file. As Parquet 
matures, recent Parquet versions support more and more standard data 
types. Mappings from complex nested types to Parquet types are also 
being standardized 1 
<https://github.com/apache/incubator-parquet-mr/pull/83>.

On 2/20/15 6:50 AM, The Watcher wrote:

> Still trying to get my head around Spark SQL & Hive.
>
> 1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
> tables, declared in a Hive meta-store.
>
> Does it matter at all if Hive supports the data types I need with Parquet,
> or is all that matters what Catalyst & spark's parquet relation support ?
>
> Case in point : timestamps & Parquet
> * Parquet now supports them as per
> https://github.com/Parquet/parquet-mr/issues/218
> * Hive only supports them in 0.14
> So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
> 1.3 ?
>
> I have found this thread
> http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
> which seems to indicate that the data types supported by Hive would matter
> to Spark SQL.
> If so, why is that ? Doesn't the read path go through Spark SQL to read the
> parquet file ?
>
> 2) Is there planned support for Hive 0.14 ?
>
> Thanks
>
​