You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2017/09/15 00:31:54 UTC

CHAR implementation?

Hi, All.

Currently, Spark shows different behavior when we uses CHAR types.

spark-sql> CREATE TABLE t1(a CHAR(3));
spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t1;
a   3
spark-sql> SELECT a, length(a) FROM t2;
a   3
spark-sql> SELECT a, length(a) FROM t3;
a 2

The reason I'm asking here is that it's a little bit old default behavior
of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)

For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
Parquet has been de-factor standard in Spark also. (I'm not comparing this
with the other DBMS.)

I'm wondering which way we need to go or want to go in Spark?

Bests,
Dongjoon.

Re: CHAR implementation?

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you, Ryan!

Yes. Right. If we turn off `spark.sql.hive.convertMetastoreParquet`, Spark
pads the space.

For ORC CHAR, it's the same. ORC only handles truncation on write.
The padding is handled by Hive side in `HiveCharWritable` via
`HiveBaseChar.java` on read.
Spark ORCFileFormat uses HiveCharWritable, so the space is padded for both
`spark.sql.hive.convertMetastoreOrc` false or true. I was able to test it
in the following PR. Previously, it's blocked by another reason.

https://github.com/apache/spark/pull/19235

It seems that we may choose,
- adding the padding logic inside Spark Parquet reader
- ignoring that for performance/backward compatibility.

Data Source v2 Read Path is merged into master today.
We will change the code base anyway in Spark 2.3.

Bests,
Dongjoon


On Fri, Sep 15, 2017 at 10:05 AM, Ryan Blue <rb...@netflix.com> wrote:

> My guess is that this is because Parquet doesn't have a CHAR type. That
> should be applied to strings by Spark for Parquet.
>
> The reason from Parquet's perspective not to support CHAR is that we have
> no expectation that it is a portable type. Non-SQL writers aren't going to
> pad values with spaces, and it is a terrible idea for Parquet to silently
> alter or truncate data to fit the CHAR type. There's also no performance
> gain from CHAR because multi-byte UTF8 characters prevent us from using a
> fixed-length binary field. The conclusion we came to is that CHAR is a
> SQL-only type and has to be enforced by SQL engines: Spark should pad or
> truncate values, and expect Parquet to faithfully hand back exactly what
> was stored.
>
> If Spark doesn't have logic for this, then it is probably relying on the
> Hive serde to pad the first case. I'm not sure what ORC does, maybe it has
> a native CHAR type.
>
> rb
>
> On Thu, Sep 14, 2017 at 5:31 PM, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Currently, Spark shows different behavior when we uses CHAR types.
>>
>> spark-sql> CREATE TABLE t1(a CHAR(3));
>> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>
>> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>
>> spark-sql> SELECT a, length(a) FROM t1;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t2;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t3;
>> a 2
>>
>> The reason I'm asking here is that it's a little bit old default behavior
>> of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)
>>
>> For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
>> Parquet has been de-factor standard in Spark also. (I'm not comparing this
>> with the other DBMS.)
>>
>> I'm wondering which way we need to go or want to go in Spark?
>>
>> Bests,
>> Dongjoon.
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: CHAR implementation?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
My guess is that this is because Parquet doesn't have a CHAR type. That
should be applied to strings by Spark for Parquet.

The reason from Parquet's perspective not to support CHAR is that we have
no expectation that it is a portable type. Non-SQL writers aren't going to
pad values with spaces, and it is a terrible idea for Parquet to silently
alter or truncate data to fit the CHAR type. There's also no performance
gain from CHAR because multi-byte UTF8 characters prevent us from using a
fixed-length binary field. The conclusion we came to is that CHAR is a
SQL-only type and has to be enforced by SQL engines: Spark should pad or
truncate values, and expect Parquet to faithfully hand back exactly what
was stored.

If Spark doesn't have logic for this, then it is probably relying on the
Hive serde to pad the first case. I'm not sure what ORC does, maybe it has
a native CHAR type.

rb

On Thu, Sep 14, 2017 at 5:31 PM, Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> Currently, Spark shows different behavior when we uses CHAR types.
>
> spark-sql> CREATE TABLE t1(a CHAR(3));
> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
> spark-sql> SELECT a, length(a) FROM t1;
> a   3
> spark-sql> SELECT a, length(a) FROM t2;
> a   3
> spark-sql> SELECT a, length(a) FROM t3;
> a 2
>
> The reason I'm asking here is that it's a little bit old default behavior
> of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)
>
> For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
> Parquet has been de-factor standard in Spark also. (I'm not comparing this
> with the other DBMS.)
>
> I'm wondering which way we need to go or want to go in Spark?
>
> Bests,
> Dongjoon.
>



-- 
Ryan Blue
Software Engineer
Netflix