You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2020/03/14 22:51:35 UTC

FYI: The evolution on `CHAR` type behavior

Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR`
type behavior among its usages and configurations. However, the evolution
direction has been gradually moving forward to be consistent inside Apache
Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive
behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15

https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06

https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
I haven't spent enough time thinking about it to give a strong opinion, but this is of course very different from TRIM.

TRIM is a publicly documented function with two arguments, and we silently swapped the two arguments. And trim is also quite commonly used from a long time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not great that we are changing the value here, but the value is already fucked up. It depends on the underlying data source, and random configs that are seemingly unrelated (orc) would impact the value.

On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Hi, Reynold.
> (And +Michael Armbrust)
> 
> 
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
> 
> 
> > Are we sure "not padding" is "incorrect"?
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> 
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
> com ( gourav.sengupta@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> 100% agree with Reynold.
>> 
>> 
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> 
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> rxin@databricks.com ) > wrote:
>> 
>> 
>>> Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>> 
>>> 
>>> 
>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>> (
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>> strings shorter than the maximum length are not space-padded at the end."
>>> 
>>> 
>>> 
>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>> (
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>> 
>>>> Hi, Reynold.
>>>> 
>>>> 
>>>> Please see the following for the context.
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>> https://issues.apache.org/jira/browse/SPARK-31136 )
>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax"
>>>> 
>>>> 
>>>> I raised the above issue according to the new rubric, and the banning was
>>>> the proposed alternative to reduce the potential issue.
>>>> 
>>>> 
>>>> Please give us your opinion since it's still PR.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>> rxin@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>> of both new and old users?
>>>>> 
>>>>> 
>>>>> For old users, their old code that was working for char(3) would now stop
>>>>> working. 
>>>>> 
>>>>> 
>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>> deal if we explain it) or not supported. 
>>>>> 
>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi, All.
>>>>>> 
>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>> summary.
>>>>>> 
>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>> Hive behavior.)
>>>>>> 
>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>> 
>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>> behavior.)
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>> consistent.
>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>> fallback to Hive behavior.)
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>> following syntax to be safe.
>>>>>> 
>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>> https://github.com/apache/spark/pull/27902 )
>>>>>> 
>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> References:
>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>> spark. apache. org%3E (
>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>> )
>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>> syntax", 2019/12/06
>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>> spark. apache. org%3E (
>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>> )
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
I haven't spent enough time thinking about it to give a strong opinion, but this is of course very different from TRIM.

TRIM is a publicly documented function with two arguments, and we silently swapped the two arguments. And trim is also quite commonly used from a long time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not great that we are changing the value here, but the value is already fucked up. It depends on the underlying data source, and random configs that are seemingly unrelated (orc) would impact the value.

On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Hi, Reynold.
> (And +Michael Armbrust)
> 
> 
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
> 
> 
> > Are we sure "not padding" is "incorrect"?
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> 
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
> com ( gourav.sengupta@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> 100% agree with Reynold.
>> 
>> 
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> 
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> rxin@databricks.com ) > wrote:
>> 
>> 
>>> Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>> 
>>> 
>>> 
>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>> (
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>> strings shorter than the maximum length are not space-padded at the end."
>>> 
>>> 
>>> 
>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>> (
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>> 
>>>> Hi, Reynold.
>>>> 
>>>> 
>>>> Please see the following for the context.
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>> https://issues.apache.org/jira/browse/SPARK-31136 )
>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax"
>>>> 
>>>> 
>>>> I raised the above issue according to the new rubric, and the banning was
>>>> the proposed alternative to reduce the potential issue.
>>>> 
>>>> 
>>>> Please give us your opinion since it's still PR.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>> rxin@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>> of both new and old users?
>>>>> 
>>>>> 
>>>>> For old users, their old code that was working for char(3) would now stop
>>>>> working. 
>>>>> 
>>>>> 
>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>> deal if we explain it) or not supported. 
>>>>> 
>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi, All.
>>>>>> 
>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>> summary.
>>>>>> 
>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>> Hive behavior.)
>>>>>> 
>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>> 
>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>> behavior.)
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>> consistent.
>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>> fallback to Hive behavior.)
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>> following syntax to be safe.
>>>>>> 
>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>> https://github.com/apache/spark/pull/27902 )
>>>>>> 
>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> References:
>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>> spark. apache. org%3E (
>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>> )
>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>> syntax", 2019/12/06
>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>> spark. apache. org%3E (
>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>> )
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Maryann Xue <ma...@databricks.com>.
It would be super weird not to support VARCHAR as SQL engine. Banning CHAR
is probably fine, as its semantics is genuinely confusing.
We can issue a warning when parsing VARCHAR with a limit and suggest the
usage of String instead.

On Tue, Mar 17, 2020 at 10:27 AM Wenchen Fan <cl...@gmail.com> wrote:

> I agree that Spark can define the semantic of CHAR(x) differently than
> the SQL standard (no padding), and ask the data sources to follow it. But
> the problem is, some data sources may not be able to skip padding, like the
> Hive serde table.
>
> On the other hand, it's easier to require padding for CHAR(x). Even if
> some data sources don't support padding, Spark can simply do the padding at
> the read time, using the `rpad` function. However, if CHAR(x) is rarely
> used, maybe we should just ban it and only keep it for Hive compatibility,
> to save our work.
>
> VARCHAR(x) is a different story as it's a commonly used data type in
> databases. It has a length limitation which can help the backed engine to
> make better decisions when dealing with it. Currently Spark just treats
> VARCHAR(x) as string type, which works fine in most cases, but different
> data sources may have different behaviors during writing. For example,
> pgsql JDBC data source fails the writing if length limitation is hit, Hive
> serde table simply truncate the chars exceeding length limitation, Parquet
> data source writes whatever string it gets.
>
> We can just document that, the underlying data source may or may not
> enforce the length limitation of VARCHAR(x). Or we can make VARCHAR(x) a
> first-class data type, which requires a lot more changes (type coercion,
> type cast, etc.).
>
> Before we make a final decision, I think it's reasonable to ban
> CHAR/VARCHAR in non-Hive-serde tables in 3.0, so that we don't introduce
> silent result changing here.
>
> Any ideas are welcome!
>
> Thanks,
> Wenchen
>
> On Tue, Mar 17, 2020 at 11:29 AM Stephen Coy <sc...@infomedia.com.au.invalid>
> wrote:
>
>> I don’t think I can recall any usages of type CHAR in any situation.
>>
>> Really, it’s only use (on any traditional SQL database) would be when you
>> *want* a fixed width character column that has been right padded with
>> spaces.
>>
>>
>> On 17 Mar 2020, at 12:13 pm, Reynold Xin <rx...@databricks.com> wrote:
>>
>> For sure.
>>
>> There's another reason I feel char is not that important and it's more
>> important to be internally consistent (e.g. all data sources support it
>> with the same behavior, vs one data sources do one behavior and another do
>> the other). char was created at a time when cpu was slow and storage was
>> expensive, and being able to pack things nicely at fixed length was highly
>> useful. The fact that it was padded was initially done for performance, not
>> for the padding itself. A lot has changed since char was invented, and with
>> modern technologies (columnar, dictionary encoding, etc) there is little
>> reason to use a char data type for anything. As a matter of fact, Spark
>> internally converts char type to string to work with.
>>
>>
>> I see two solutions really.
>>
>> 1. We require padding, and ban all uses of char when it is not properly
>> padded. This would ban all the native data sources, which are the primarily
>> way people are using Spark. This leaves only char support for tables going
>> through Hive serdes, which are slow to begin with. It is basically Dongjoon
>> and Wenchen's suggestion. This turns char support into a compatibility
>> feature only for some Hive tables that cannot be converted into Spark
>> native data sources. This has confusing end-user behavior because depending
>> on whether that Hive table is converted into Spark native data sources, we
>> might or might not support char type.
>>
>> An extension to the above is to introduce padding for char type across
>> the board, and make char type a first class data type. There are a lot of
>> work to introduce another data type, especially for one that has virtually no
>> usage
>> <https://trends.google.com/trends/explore?geo=US&q=hive%20char,hive%20string> and
>> its usage will likely continue to decline in the future (just reason from
>> first principle based on the reason char was introduced in the first place).
>>
>> Now I'm assuming it's a lot of work to do char properly. But if it is not
>> the case (e.g. just a simple rule to insert padding at planning time), then
>> maybe it's worth doing it this way. I'm totally OK with this too.
>>
>> What I'd oppose is to just ban char for the native data sources, and do
>> not have a plan to address this problem systematically.
>>
>>
>> 2. Just forget about padding, like what Snowflake and MySQL have done.
>> Document that char(x) is just an alias for string. And then move on. Almost
>> no work needs to be done...
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Thank you for sharing and confirming.
>>>
>>> We had better consider all heterogeneous customers in the world. And, I
>>> also have experiences with the non-negligible cases in on-prem.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> −User
>>>>
>>>> char barely showed up (honestly negligible). I was comparing select vs
>>>> select.
>>>>
>>>>
>>>>
>>>> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun <dongjoon.hyun@gmail.com
>>>> > wrote:
>>>>
>>>>> Ur, are you comparing the number of SELECT statement with TRIM and
>>>>> CREATE statements with `CHAR`?
>>>>>
>>>>> > I looked up our usage logs (sorry I can't share this publicly) and
>>>>> trim has at least four orders of magnitude higher usage than char.
>>>>>
>>>>> We need to discuss more about what to do. This thread is what I
>>>>> expected exactly. :)
>>>>>
>>>>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>>>> it). I was merely pointing out that if we deviate away from SQL standard in
>>>>> any way we are considered "wrong" or "incorrect". That argument itself is
>>>>> flawed when plenty of other popular database systems also deviate away from
>>>>> the standard on this specific behavior.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>>>>> it). I was merely pointing out that if we deviate away from SQL standard in
>>>>>> any way we are considered "wrong" or "incorrect". That argument itself is
>>>>>> flawed when plenty of other popular database systems also deviate away from
>>>>>> the standard on this specific behavior.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I looked up our usage logs (sorry I can't share this publicly) and
>>>>>>> trim has at least four orders of magnitude higher usage than char.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <
>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you, Stephen and Reynold.
>>>>>>>>
>>>>>>>> To Reynold.
>>>>>>>>
>>>>>>>> The way I see the following is a little different.
>>>>>>>>
>>>>>>>>       > CHAR is an undocumented data type without clearly defined
>>>>>>>> semantics.
>>>>>>>>
>>>>>>>> Let me describe in Apache Spark User's View point.
>>>>>>>>
>>>>>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql`
>>>>>>>> function) at Apache Spark 1.x without much documentation. In addition,
>>>>>>>> there still exists an effort which is trying to keep it in 3.0.0 age.
>>>>>>>>
>>>>>>>>        https://issues.apache.org/jira/browse/SPARK-31088
>>>>>>>>        Add back HiveContext and createExternalTable
>>>>>>>>
>>>>>>>> Historically, we tried to make many SQL-based customer migrate
>>>>>>>> their workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>>>>>>
>>>>>>>> Although Apache Spark didn't have a good document about the
>>>>>>>> inconsistent behavior among its data sources, Apache Hive has been
>>>>>>>> providing its documentation and many customers rely the behavior.
>>>>>>>>
>>>>>>>>       -
>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>>>>>>
>>>>>>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>>>>>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>>>>>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>>>>>>> *caching*. This was true because Apache Spark was added into the
>>>>>>>> Hadoop-vendor products later than Apache Hive.
>>>>>>>>
>>>>>>>> Until the turning point at Apache Spark 2.0, we tried to catch
>>>>>>>> up more features to be consistent at least with Hive tables in Apache Hive
>>>>>>>> and Apache Spark because two SQL engines share the same tables.
>>>>>>>>
>>>>>>>> For the following, technically, while Apache Hive doesn't changed
>>>>>>>> its existing behavior in this part, Apache Spark evolves inevitably by
>>>>>>>> moving away from the original Apache Spark old behaviors one-by-one.
>>>>>>>>
>>>>>>>>       >  the value is already fucked up
>>>>>>>>
>>>>>>>> The following is the change log.
>>>>>>>>
>>>>>>>>       - When we switched the default value of
>>>>>>>> `convertMetastoreParquet`. (at Apache Spark 1.2)
>>>>>>>>       - When we switched the default value of `convertMetastoreOrc`
>>>>>>>> (at Apache Spark 2.4)
>>>>>>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT`
>>>>>>>> table to `PARQUET` table at Apache Spark 3.0)
>>>>>>>>
>>>>>>>> To sum up, this has been a well-known issue in the community and
>>>>>>>> among the customers.
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I’m kind of new around here, but I have had experience with all of
>>>>>>>>> all the so called “big iron” databases such as Oracle, IBM DB2 and
>>>>>>>>> Microsoft SQL Server as well as Postgresql.
>>>>>>>>>
>>>>>>>>> They all support the notion of “ANSI padding” for CHAR columns -
>>>>>>>>> which means that such columns are always space padded, and they default to
>>>>>>>>> having this enabled (for ANSI compliance).
>>>>>>>>>
>>>>>>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>>>>>>> historical reasons not unlike what we have here.
>>>>>>>>>
>>>>>>>>> In my opinion we should push toward standards compliance where
>>>>>>>>> possible and then document where it cannot work.
>>>>>>>>>
>>>>>>>>> If users don’t like the padding on CHAR columns then they should
>>>>>>>>> change to VARCHAR - I believe that was its purpose in the first place, and
>>>>>>>>> it does not dictate any sort of “padding".
>>>>>>>>>
>>>>>>>>> I can see why you might “ban” the use of CHAR columns where they
>>>>>>>>> cannot be consistently supported, but VARCHAR is a different animal and I
>>>>>>>>> would expect it to work consistently everywhere.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Steve C
>>>>>>>>>
>>>>>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <
>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi, Reynold.
>>>>>>>>> (And +Michael Armbrust)
>>>>>>>>>
>>>>>>>>> If you think so, do you think it's okay that we change the return
>>>>>>>>> value silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>>>>>>
>>>>>>>>> > Are we sure "not padding" is "incorrect"?
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
>>>>>>>>> gourav.sengupta@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> 100% agree with Reynold.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Gourav Sengupta
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>>>>>>
>>>>>>>>>>> I don't know whether ANSI SQL actually requires padding, but
>>>>>>>>>>> plenty of databases don't actually pad.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> :
>>>>>>>>>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>>>>>>>>>> shorter than the maximum length are not space-padded at the end."
>>>>>>>>>>>
>>>>>>>>>>> MySQL:
>>>>>>>>>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <
>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, Reynold.
>>>>>>>>>>>>
>>>>>>>>>>>> Please see the following for the context.
>>>>>>>>>>>>
>>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>>>>>>>>>> "Revert SPARK-30098 Use default datasource as provider for
>>>>>>>>>>>> CREATE TABLE syntax"
>>>>>>>>>>>>
>>>>>>>>>>>> I raised the above issue according to the new rubric, and the
>>>>>>>>>>>> banning was the proposed alternative to reduce the potential issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>>>>>>
>>>>>>>>>>>> Bests,
>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse
>>>>>>>>>>>>> the hell out of both new and old users?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For old users, their old code that was working for char(3)
>>>>>>>>>>>>> would now stop working.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For new users, depending on whether the underlying metastore
>>>>>>>>>>>>> char(3) is either supported but different from ansi Sql (which is not that
>>>>>>>>>>>>> big of a deal if we explain it) or not supported.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <
>>>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, All.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apache Spark has been suffered from a known consistency issue
>>>>>>>>>>>>>> on `CHAR` type behavior among its usages and configurations. However, the
>>>>>>>>>>>>>> evolution direction has been gradually moving forward to be consistent
>>>>>>>>>>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>>>>>>>>>>> is the summary.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following
>>>>>>>>>>>>>> different result.
>>>>>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a
>>>>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a
>>>>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS`
>>>>>>>>>>>>>> clause) became consistent.
>>>>>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true`
>>>>>>>>>>>>>> provides a fallback to Hive behavior.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR`
>>>>>>>>>>>>>> type in the following syntax to be safe.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This email is sent out to inform you based on the new policy
>>>>>>>>>>>>>> we voted.
>>>>>>>>>>>>>> The recommendation is always using Apache Spark's native type
>>>>>>>>>>>>>> `String`.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> References:
>>>>>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for
>>>>>>>>>>>>>> CREATE TABLE syntax", 2019/12/06
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> This email contains confidential information of and is the
>>>>>>>>> copyright of Infomedia. It must not be forwarded, amended or disclosed
>>>>>>>>> without consent of the sender. If you received this message by mistake,
>>>>>>>>> please advise the sender and delete all copies. Security of transmission on
>>>>>>>>> the internet cannot be guaranteed, could be infected, intercepted, or
>>>>>>>>> corrupted and you should ensure you have suitable antivirus protection in
>>>>>>>>> place. By sending us your or any third party personal details, you consent
>>>>>>>>> to (or confirm you have obtained consent from such third parties) to
>>>>>>>>> Infomedia’s privacy policy.
>>>>>>>>> http://www.infomedia.com.au/privacy-policy/
>>>>>>>>>
>>>>>>>>
>>
>>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Wenchen Fan <cl...@gmail.com>.
I agree that Spark can define the semantic of CHAR(x) differently than the
SQL standard (no padding), and ask the data sources to follow it. But the
problem is, some data sources may not be able to skip padding, like the
Hive serde table.

On the other hand, it's easier to require padding for CHAR(x). Even if some
data sources don't support padding, Spark can simply do the padding at the
read time, using the `rpad` function. However, if CHAR(x) is rarely used,
maybe we should just ban it and only keep it for Hive compatibility, to
save our work.

VARCHAR(x) is a different story as it's a commonly used data type in
databases. It has a length limitation which can help the backed engine to
make better decisions when dealing with it. Currently Spark just treats
VARCHAR(x) as string type, which works fine in most cases, but different
data sources may have different behaviors during writing. For example,
pgsql JDBC data source fails the writing if length limitation is hit, Hive
serde table simply truncate the chars exceeding length limitation, Parquet
data source writes whatever string it gets.

We can just document that, the underlying data source may or may not
enforce the length limitation of VARCHAR(x). Or we can make VARCHAR(x) a
first-class data type, which requires a lot more changes (type coercion,
type cast, etc.).

Before we make a final decision, I think it's reasonable to ban CHAR/VARCHAR
in non-Hive-serde tables in 3.0, so that we don't introduce silent result
changing here.

Any ideas are welcome!

Thanks,
Wenchen

On Tue, Mar 17, 2020 at 11:29 AM Stephen Coy <sc...@infomedia.com.au.invalid>
wrote:

> I don’t think I can recall any usages of type CHAR in any situation.
>
> Really, it’s only use (on any traditional SQL database) would be when you
> *want* a fixed width character column that has been right padded with
> spaces.
>
>
> On 17 Mar 2020, at 12:13 pm, Reynold Xin <rx...@databricks.com> wrote:
>
> For sure.
>
> There's another reason I feel char is not that important and it's more
> important to be internally consistent (e.g. all data sources support it
> with the same behavior, vs one data sources do one behavior and another do
> the other). char was created at a time when cpu was slow and storage was
> expensive, and being able to pack things nicely at fixed length was highly
> useful. The fact that it was padded was initially done for performance, not
> for the padding itself. A lot has changed since char was invented, and with
> modern technologies (columnar, dictionary encoding, etc) there is little
> reason to use a char data type for anything. As a matter of fact, Spark
> internally converts char type to string to work with.
>
>
> I see two solutions really.
>
> 1. We require padding, and ban all uses of char when it is not properly
> padded. This would ban all the native data sources, which are the primarily
> way people are using Spark. This leaves only char support for tables going
> through Hive serdes, which are slow to begin with. It is basically Dongjoon
> and Wenchen's suggestion. This turns char support into a compatibility
> feature only for some Hive tables that cannot be converted into Spark
> native data sources. This has confusing end-user behavior because depending
> on whether that Hive table is converted into Spark native data sources, we
> might or might not support char type.
>
> An extension to the above is to introduce padding for char type across the
> board, and make char type a first class data type. There are a lot of work
> to introduce another data type, especially for one that has virtually no
> usage
> <https://trends.google.com/trends/explore?geo=US&q=hive%20char,hive%20string> and
> its usage will likely continue to decline in the future (just reason from
> first principle based on the reason char was introduced in the first place).
>
> Now I'm assuming it's a lot of work to do char properly. But if it is not
> the case (e.g. just a simple rule to insert padding at planning time), then
> maybe it's worth doing it this way. I'm totally OK with this too.
>
> What I'd oppose is to just ban char for the native data sources, and do
> not have a plan to address this problem systematically.
>
>
> 2. Just forget about padding, like what Snowflake and MySQL have done.
> Document that char(x) is just an alias for string. And then move on. Almost
> no work needs to be done...
>
>
>
>
>
>
>
> On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Thank you for sharing and confirming.
>>
>> We had better consider all heterogeneous customers in the world. And, I
>> also have experiences with the non-negligible cases in on-prem.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> −User
>>>
>>> char barely showed up (honestly negligible). I was comparing select vs
>>> select.
>>>
>>>
>>>
>>> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Ur, are you comparing the number of SELECT statement with TRIM and
>>>> CREATE statements with `CHAR`?
>>>>
>>>> > I looked up our usage logs (sorry I can't share this publicly) and
>>>> trim has at least four orders of magnitude higher usage than char.
>>>>
>>>> We need to discuss more about what to do. This thread is what I
>>>> expected exactly. :)
>>>>
>>>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>>> it). I was merely pointing out that if we deviate away from SQL standard in
>>>> any way we are considered "wrong" or "incorrect". That argument itself is
>>>> flawed when plenty of other popular database systems also deviate away from
>>>> the standard on this specific behavior.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>>>> it). I was merely pointing out that if we deviate away from SQL standard in
>>>>> any way we are considered "wrong" or "incorrect". That argument itself is
>>>>> flawed when plenty of other popular database systems also deviate away from
>>>>> the standard on this specific behavior.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> I looked up our usage logs (sorry I can't share this publicly) and
>>>>>> trim has at least four orders of magnitude higher usage than char.
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you, Stephen and Reynold.
>>>>>>>
>>>>>>> To Reynold.
>>>>>>>
>>>>>>> The way I see the following is a little different.
>>>>>>>
>>>>>>>       > CHAR is an undocumented data type without clearly defined
>>>>>>> semantics.
>>>>>>>
>>>>>>> Let me describe in Apache Spark User's View point.
>>>>>>>
>>>>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql`
>>>>>>> function) at Apache Spark 1.x without much documentation. In addition,
>>>>>>> there still exists an effort which is trying to keep it in 3.0.0 age.
>>>>>>>
>>>>>>>        https://issues.apache.org/jira/browse/SPARK-31088
>>>>>>>        Add back HiveContext and createExternalTable
>>>>>>>
>>>>>>> Historically, we tried to make many SQL-based customer migrate their
>>>>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>>>>>
>>>>>>> Although Apache Spark didn't have a good document about the
>>>>>>> inconsistent behavior among its data sources, Apache Hive has been
>>>>>>> providing its documentation and many customers rely the behavior.
>>>>>>>
>>>>>>>       -
>>>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>>>>>
>>>>>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>>>>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>>>>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>>>>>> *caching*. This was true because Apache Spark was added into the
>>>>>>> Hadoop-vendor products later than Apache Hive.
>>>>>>>
>>>>>>> Until the turning point at Apache Spark 2.0, we tried to catch
>>>>>>> up more features to be consistent at least with Hive tables in Apache Hive
>>>>>>> and Apache Spark because two SQL engines share the same tables.
>>>>>>>
>>>>>>> For the following, technically, while Apache Hive doesn't changed
>>>>>>> its existing behavior in this part, Apache Spark evolves inevitably by
>>>>>>> moving away from the original Apache Spark old behaviors one-by-one.
>>>>>>>
>>>>>>>       >  the value is already fucked up
>>>>>>>
>>>>>>> The following is the change log.
>>>>>>>
>>>>>>>       - When we switched the default value of
>>>>>>> `convertMetastoreParquet`. (at Apache Spark 1.2)
>>>>>>>       - When we switched the default value of `convertMetastoreOrc`
>>>>>>> (at Apache Spark 2.4)
>>>>>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table
>>>>>>> to `PARQUET` table at Apache Spark 3.0)
>>>>>>>
>>>>>>> To sum up, this has been a well-known issue in the community and
>>>>>>> among the customers.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I’m kind of new around here, but I have had experience with all of
>>>>>>>> all the so called “big iron” databases such as Oracle, IBM DB2 and
>>>>>>>> Microsoft SQL Server as well as Postgresql.
>>>>>>>>
>>>>>>>> They all support the notion of “ANSI padding” for CHAR columns -
>>>>>>>> which means that such columns are always space padded, and they default to
>>>>>>>> having this enabled (for ANSI compliance).
>>>>>>>>
>>>>>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>>>>>> historical reasons not unlike what we have here.
>>>>>>>>
>>>>>>>> In my opinion we should push toward standards compliance where
>>>>>>>> possible and then document where it cannot work.
>>>>>>>>
>>>>>>>> If users don’t like the padding on CHAR columns then they should
>>>>>>>> change to VARCHAR - I believe that was its purpose in the first place, and
>>>>>>>> it does not dictate any sort of “padding".
>>>>>>>>
>>>>>>>> I can see why you might “ban” the use of CHAR columns where they
>>>>>>>> cannot be consistently supported, but VARCHAR is a different animal and I
>>>>>>>> would expect it to work consistently everywhere.
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Steve C
>>>>>>>>
>>>>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi, Reynold.
>>>>>>>> (And +Michael Armbrust)
>>>>>>>>
>>>>>>>> If you think so, do you think it's okay that we change the return
>>>>>>>> value silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>>>>>
>>>>>>>> > Are we sure "not padding" is "incorrect"?
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
>>>>>>>> gourav.sengupta@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> 100% agree with Reynold.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Gourav Sengupta
>>>>>>>>>
>>>>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>>>>>
>>>>>>>>>> I don't know whether ANSI SQL actually requires padding, but
>>>>>>>>>> plenty of databases don't actually pad.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> :
>>>>>>>>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>>>>>>>>> shorter than the maximum length are not space-padded at the end."
>>>>>>>>>>
>>>>>>>>>> MySQL:
>>>>>>>>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <
>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, Reynold.
>>>>>>>>>>>
>>>>>>>>>>> Please see the following for the context.
>>>>>>>>>>>
>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>>>>>>>>> "Revert SPARK-30098 Use default datasource as provider for
>>>>>>>>>>> CREATE TABLE syntax"
>>>>>>>>>>>
>>>>>>>>>>> I raised the above issue according to the new rubric, and the
>>>>>>>>>>> banning was the proposed alternative to reduce the potential issue.
>>>>>>>>>>>
>>>>>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>>>>>
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the
>>>>>>>>>>>> hell out of both new and old users?
>>>>>>>>>>>>
>>>>>>>>>>>> For old users, their old code that was working for char(3)
>>>>>>>>>>>> would now stop working.
>>>>>>>>>>>>
>>>>>>>>>>>> For new users, depending on whether the underlying metastore
>>>>>>>>>>>> char(3) is either supported but different from ansi Sql (which is not that
>>>>>>>>>>>> big of a deal if we explain it) or not supported.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <
>>>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, All.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Apache Spark has been suffered from a known consistency issue
>>>>>>>>>>>>> on `CHAR` type behavior among its usages and configurations. However, the
>>>>>>>>>>>>> evolution direction has been gradually moving forward to be consistent
>>>>>>>>>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>>>>>>>>>> is the summary.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following
>>>>>>>>>>>>> different result.
>>>>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a
>>>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>>>>>
>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>>>>>
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a
>>>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS`
>>>>>>>>>>>>> clause) became consistent.
>>>>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true`
>>>>>>>>>>>>> provides a fallback to Hive behavior.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>
>>>>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR`
>>>>>>>>>>>>> type in the following syntax to be safe.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This email is sent out to inform you based on the new policy
>>>>>>>>>>>>> we voted.
>>>>>>>>>>>>> The recommendation is always using Apache Spark's native type
>>>>>>>>>>>>> `String`.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>>
>>>>>>>>>>>>> References:
>>>>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for
>>>>>>>>>>>>> CREATE TABLE syntax", 2019/12/06
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> This email contains confidential information of and is the
>>>>>>>> copyright of Infomedia. It must not be forwarded, amended or disclosed
>>>>>>>> without consent of the sender. If you received this message by mistake,
>>>>>>>> please advise the sender and delete all copies. Security of transmission on
>>>>>>>> the internet cannot be guaranteed, could be infected, intercepted, or
>>>>>>>> corrupted and you should ensure you have suitable antivirus protection in
>>>>>>>> place. By sending us your or any third party personal details, you consent
>>>>>>>> to (or confirm you have obtained consent from such third parties) to
>>>>>>>> Infomedia’s privacy policy.
>>>>>>>> http://www.infomedia.com.au/privacy-policy/
>>>>>>>>
>>>>>>>
>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Stephen Coy <sc...@infomedia.com.au.INVALID>.
I don’t think I can recall any usages of type CHAR in any situation.

Really, it’s only use (on any traditional SQL database) would be when you *want* a fixed width character column that has been right padded with spaces.


On 17 Mar 2020, at 12:13 pm, Reynold Xin <rx...@databricks.com>> wrote:


For sure.

There's another reason I feel char is not that important and it's more important to be internally consistent (e.g. all data sources support it with the same behavior, vs one data sources do one behavior and another do the other). char was created at a time when cpu was slow and storage was expensive, and being able to pack things nicely at fixed length was highly useful. The fact that it was padded was initially done for performance, not for the padding itself. A lot has changed since char was invented, and with modern technologies (columnar, dictionary encoding, etc) there is little reason to use a char data type for anything. As a matter of fact, Spark internally converts char type to string to work with.


I see two solutions really.

1. We require padding, and ban all uses of char when it is not properly padded. This would ban all the native data sources, which are the primarily way people are using Spark. This leaves only char support for tables going through Hive serdes, which are slow to begin with. It is basically Dongjoon and Wenchen's suggestion. This turns char support into a compatibility feature only for some Hive tables that cannot be converted into Spark native data sources. This has confusing end-user behavior because depending on whether that Hive table is converted into Spark native data sources, we might or might not support char type.

An extension to the above is to introduce padding for char type across the board, and make char type a first class data type. There are a lot of work to introduce another data type, especially for one that has virtually no usage<https://trends.google.com/trends/explore?geo=US&q=hive%20char,hive%20string> and its usage will likely continue to decline in the future (just reason from first principle based on the reason char was introduced in the first place).

Now I'm assuming it's a lot of work to do char properly. But if it is not the case (e.g. just a simple rule to insert padding at planning time), then maybe it's worth doing it this way. I'm totally OK with this too.

What I'd oppose is to just ban char for the native data sources, and do not have a plan to address this problem systematically.


2. Just forget about padding, like what Snowflake and MySQL have done. Document that char(x) is just an alias for string. And then move on. Almost no work needs to be done...







On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun <do...@gmail.com>> wrote:
Thank you for sharing and confirming.

We had better consider all heterogeneous customers in the world. And, I also have experiences with the non-negligible cases in on-prem.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin <rx...@databricks.com>> wrote:
−User

char barely showed up (honestly negligible). I was comparing select vs select.



On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun <do...@gmail.com>> wrote:
Ur, are you comparing the number of SELECT statement with TRIM and CREATE statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was merely pointing out that if we deviate away from SQL standard in any way we are considered "wrong" or "incorrect". That argument itself is flawed when plenty of other popular database systems also deviate away from the standard on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <rx...@databricks.com>> wrote:
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was merely pointing out that if we deviate away from SQL standard in any way we are considered "wrong" or "incorrect". That argument itself is flawed when plenty of other popular database systems also deviate away from the standard on this specific behavior.




On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <rx...@databricks.com>> wrote:
I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char.


On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <do...@gmail.com>> wrote:
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

      > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at Apache Spark 1.x without much documentation. In addition, there still exists an effort which is trying to keep it in 3.0.0 age.

       https://issues.apache.org/jira/browse/SPARK-31088
       Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent behavior among its data sources, Apache Hive has been providing its documentation and many customers rely the behavior.

      - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors, many existing huge tables were created by Apache Hive, not Apache Spark. And, Apache Spark is used for boosting SQL performance with its *caching*. This was true because Apache Spark was added into the Hadoop-vendor products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more features to be consistent at least with Hive tables in Apache Hive and Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its existing behavior in this part, Apache Spark evolves inevitably by moving away from the original Apache Spark old behaviors one-by-one.

      >  the value is already fucked up

The following is the change log.

      - When we switched the default value of `convertMetastoreParquet`. (at Apache Spark 1.2)
      - When we switched the default value of `convertMetastoreOrc` (at Apache Spark 2.4)
      - When we switched `CREATE TABLE` itself. (Change `TEXT` table to `PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au>> wrote:
Hi there,

I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always space padded, and they default to having this enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to VARCHAR - I believe that was its purpose in the first place, and it does not dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be consistently supported, but VARCHAR is a different animal and I would expect it to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <go...@gmail.com>> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>> wrote:
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html<https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."

MySQL: https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>








On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working.

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>


This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/



Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
I agree it sucks. We started with some decision that might have made sense back in 2013 (let's use Hive as the default source, and guess what, pick the slowest possible serde by default). We are paying that debt ever since.

Thanks for bringing this thread up though. We don't have a clear solution yet, but at least it made a lot of people aware of the issues.

On Thu, Mar 19, 2020 at 8:56 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Technically, I has been suffered with (1) `CREATE TABLE` due to many
> difference for a long time (since 2017). So, I had a wrong assumption for
> the implication of that "(2) FYI: SPARK-30098 Use default datasource as
> provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
> in the similar way. However, it was a lot to me. Also, switching
> `convertMetastoreOrc` at 2.4 was a big change to me although there will be
> no difference for Parquet-only users.
> 
> 
> Dongjoon.
> 
> 
> > References:
> > 1. "CHAR implementation?", 2017/09/15
> >      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> )
> > 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
> >    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> 
> 
> 
> On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> You are joking when you said " informed widely and discussed in many ways
>> twice" right?
>> 
>> 
>> 
>> This thread doesn't even talk about char/varchar: https:/ / lists. apache.
>> org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>> spark. apache. org%3E (
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>> )
>> 
>> 
>> 
>> (Yes it talked about changing the default data source provider, but that's
>> just one of the ways we are exposing this char/varchar issue).
>> 
>> 
>> 
>> 
>> 
>> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.hyun@gmail.com ) > wrote:
>> 
>>> +1 for Wenchen's suggestion.
>>> 
>>> I believe that the difference and effects are informed widely and
>>> discussed in many ways twice.
>>> 
>>> First, this was shared on last December.
>>> 
>>>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>>> Second (at this time in this thread), this has been discussed according to
>>> the new community rubric.
>>> 
>>>     - https:/ / spark. apache. org/ versioning-policy. html (
>>> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
>>> When Breaking APIs")
>>> 
>>> 
>>> Thank you all.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
>>> cloud0fan@gmail.com ) > wrote:
>>> 
>>> 
>>>> OK let me put a proposal here:
>>>> 
>>>> 
>>>> 1. Permanently ban CHAR for native data source tables, and only keep it
>>>> for Hive compatibility.
>>>> It's OK to forget about padding like what Snowflake and MySQL have done.
>>>> But it's hard for Spark to require consistent behavior about CHAR type in
>>>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>>>> just ban it. Another way is to document that the padding of CHAR type is
>>>> data source dependent, but it's a bit weird to leave this inconsistency in
>>>> Spark.
>>>> 
>>>> 
>>>> 2. Leave VARCHAR unchanged in 3.0
>>>> VARCHAR type is so widely used in databases and it's weird if Spark
>>>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>>>> when the length limitation is not hit, and I'm fine to temporarily leave
>>>> this flaw in 3.0 and users may hit behavior changes when the string values
>>>> hit the VARCHAR length limitation.
>>>> 
>>>> 
>>>> 3. Finalize the VARCHAR behavior in 3.1
>>>> For now I have 2 ideas:
>>>> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
>>>> should support VARCHAR, and CREATE TABLE should fail if a column is
>>>> VARCHAR type and the underlying data source doesn't support it (e.g.
>>>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>>>> updated as well.
>>>> b) Simply document that, the underlying data source may or may not enforce
>>>> the length limitation of VARCHAR(x).
>>>> 
>>>> 
>>>> Please let me know if you have different ideas.
>>>> 
>>>> 
>>>> Thanks,
>>>> Wenchen
>>>> 
>>>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust < michael@ databricks. com
>>>> ( michael@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>>> What I'd oppose is to just ban char for the native data sources, and do
>>>>>> not have a plan to address this problem systematically.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> +1
>>>>> 
>>>>>  
>>>>> 
>>>>>> Just forget about padding, like what Snowflake and MySQL have done.
>>>>>> Document that char(x) is just an alias for string. And then move on.
>>>>>> Almost no work needs to be done...
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Technically, I has been suffered with (1) `CREATE TABLE` due to many
difference for a long time (since 2017). So, I had a wrong assumption for
the implication of that "(2) FYI: SPARK-30098 Use default datasource as
provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel
in the similar way. However, it was a lot to me. Also, switching
`convertMetastoreOrc` at 2.4 was a big change to me although there will be
no difference for Parquet-only users.

Dongjoon.

> References:
> 1. "CHAR implementation?", 2017/09/15
>
https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06
>
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E



On Thu, Mar 19, 2020 at 8:47 PM Reynold Xin <rx...@databricks.com> wrote:

> You are joking when you said " informed widely and discussed in many ways
> twice" right?
>
> This thread doesn't even talk about char/varchar:
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>
> (Yes it talked about changing the default data source provider, but that's
> just one of the ways we are exposing this char/varchar issue).
>
>
>
> On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> +1 for Wenchen's suggestion.
>>
>> I believe that the difference and effects are informed widely and
>> discussed in many ways twice.
>>
>> First, this was shared on last December.
>>
>>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax", 2019/12/06
>>
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>
>> Second (at this time in this thread), this has been discussed according
>> to the new community rubric.
>>
>>     - https://spark.apache.org/versioning-policy.html (Section:
>> "Considerations When Breaking APIs")
>>
>> Thank you all.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan <cl...@gmail.com> wrote:
>>
>>> OK let me put a proposal here:
>>>
>>> 1. Permanently ban CHAR for native data source tables, and only keep it
>>> for Hive compatibility.
>>> It's OK to forget about padding like what Snowflake and MySQL have done.
>>> But it's hard for Spark to require consistent behavior about CHAR type in
>>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>>> just ban it. Another way is to document that the padding of CHAR type is
>>> data source dependent, but it's a bit weird to leave this inconsistency in
>>> Spark.
>>>
>>> 2. Leave VARCHAR unchanged in 3.0
>>> VARCHAR type is so widely used in databases and it's weird if Spark
>>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>>> when the length limitation is not hit, and I'm fine to temporarily leave
>>> this flaw in 3.0 and users may hit behavior changes when the string values
>>> hit the VARCHAR length limitation.
>>>
>>> 3. Finalize the VARCHAR behavior in 3.1
>>> For now I have 2 ideas:
>>> a) Make VARCHAR(x) a first-class data type. This means Spark data
>>> sources should support VARCHAR, and CREATE TABLE should fail if a column is
>>> VARCHAR type and the underlying data source doesn't support it (e.g.
>>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>>> updated as well.
>>> b) Simply document that, the underlying data source may or may not
>>> enforce the length limitation of VARCHAR(x).
>>>
>>> Please let me know if you have different ideas.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>>
>>>> What I'd oppose is to just ban char for the native data sources, and do
>>>>> not have a plan to address this problem systematically.
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>> Just forget about padding, like what Snowflake and MySQL have done.
>>>>> Document that char(x) is just an alias for string. And then move on. Almost
>>>>> no work needs to be done...
>>>>>
>>>>
>>>> +1
>>>>
>>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
You are joking when you said " informed widely and discussed in many ways twice" right?

This thread doesn't even talk about char/varchar: https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

(Yes it talked about changing the default data source provider, but that's just one of the ways we are exposing this char/varchar issue).

On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> +1 for Wenchen's suggestion.
> 
> I believe that the difference and effects are informed widely and
> discussed in many ways twice.
> 
> First, this was shared on last December.
> 
>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> Second (at this time in this thread), this has been discussed according to
> the new community rubric.
> 
>     - https:/ / spark. apache. org/ versioning-policy. html (
> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
> When Breaking APIs")
> 
> 
> Thank you all.
> 
> 
> Bests,
> Dongjoon.
> 
> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0fan@gmail.com ) > wrote:
> 
> 
>> OK let me put a proposal here:
>> 
>> 
>> 1. Permanently ban CHAR for native data source tables, and only keep it
>> for Hive compatibility.
>> It's OK to forget about padding like what Snowflake and MySQL have done.
>> But it's hard for Spark to require consistent behavior about CHAR type in
>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>> just ban it. Another way is to document that the padding of CHAR type is
>> data source dependent, but it's a bit weird to leave this inconsistency in
>> Spark.
>> 
>> 
>> 2. Leave VARCHAR unchanged in 3.0
>> VARCHAR type is so widely used in databases and it's weird if Spark
>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>> when the length limitation is not hit, and I'm fine to temporarily leave
>> this flaw in 3.0 and users may hit behavior changes when the string values
>> hit the VARCHAR length limitation.
>> 
>> 
>> 3. Finalize the VARCHAR behavior in 3.1
>> For now I have 2 ideas:
>> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
>> should support VARCHAR, and CREATE TABLE should fail if a column is
>> VARCHAR type and the underlying data source doesn't support it (e.g.
>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>> updated as well.
>> b) Simply document that, the underlying data source may or may not enforce
>> the length limitation of VARCHAR(x).
>> 
>> 
>> Please let me know if you have different ideas.
>> 
>> 
>> Thanks,
>> Wenchen
>> 
>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust < michael@ databricks. com
>> ( michael@databricks.com ) > wrote:
>> 
>> 
>>> 
>>>> What I'd oppose is to just ban char for the native data sources, and do
>>>> not have a plan to address this problem systematically.
>>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>>  
>>> 
>>>> Just forget about padding, like what Snowflake and MySQL have done.
>>>> Document that char(x) is just an alias for string. And then move on.
>>>> Almost no work needs to be done...
>>>> 
>>> 
>>> 
>>> 
>>> +1 
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
+1 for Wenchen's suggestion.

I believe that the difference and effects are informed widely and discussed
in many ways twice.

First, this was shared on last December.

    "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06

https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

Second (at this time in this thread), this has been discussed according to
the new community rubric.

    - https://spark.apache.org/versioning-policy.html (Section:
"Considerations When Breaking APIs")

Thank you all.

Bests,
Dongjoon.

On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan <cl...@gmail.com> wrote:

> OK let me put a proposal here:
>
> 1. Permanently ban CHAR for native data source tables, and only keep it
> for Hive compatibility.
> It's OK to forget about padding like what Snowflake and MySQL have done.
> But it's hard for Spark to require consistent behavior about CHAR type in
> all data sources. Since CHAR type is not that useful nowadays, seems OK to
> just ban it. Another way is to document that the padding of CHAR type is
> data source dependent, but it's a bit weird to leave this inconsistency in
> Spark.
>
> 2. Leave VARCHAR unchanged in 3.0
> VARCHAR type is so widely used in databases and it's weird if Spark
> doesn't support it. VARCHAR type is exactly the same as Spark StringType
> when the length limitation is not hit, and I'm fine to temporarily leave
> this flaw in 3.0 and users may hit behavior changes when the string values
> hit the VARCHAR length limitation.
>
> 3. Finalize the VARCHAR behavior in 3.1
> For now I have 2 ideas:
> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
> should support VARCHAR, and CREATE TABLE should fail if a column is VARCHAR
> type and the underlying data source doesn't support it (e.g. JSON/CSV).
> Type cast, type coercion, table insertion, etc. should be updated as well.
> b) Simply document that, the underlying data source may or may not enforce
> the length limitation of VARCHAR(x).
>
> Please let me know if you have different ideas.
>
> Thanks,
> Wenchen
>
> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> What I'd oppose is to just ban char for the native data sources, and do
>>> not have a plan to address this problem systematically.
>>>
>>
>> +1
>>
>>
>>> Just forget about padding, like what Snowflake and MySQL have done.
>>> Document that char(x) is just an alias for string. And then move on. Almost
>>> no work needs to be done...
>>>
>>
>> +1
>>
>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Wenchen Fan <cl...@gmail.com>.
OK let me put a proposal here:

1. Permanently ban CHAR for native data source tables, and only keep it for
Hive compatibility.
It's OK to forget about padding like what Snowflake and MySQL have done.
But it's hard for Spark to require consistent behavior about CHAR type in
all data sources. Since CHAR type is not that useful nowadays, seems OK to
just ban it. Another way is to document that the padding of CHAR type is
data source dependent, but it's a bit weird to leave this inconsistency in
Spark.

2. Leave VARCHAR unchanged in 3.0
VARCHAR type is so widely used in databases and it's weird if Spark doesn't
support it. VARCHAR type is exactly the same as Spark StringType when the
length limitation is not hit, and I'm fine to temporarily leave this flaw
in 3.0 and users may hit behavior changes when the string values hit the
VARCHAR length limitation.

3. Finalize the VARCHAR behavior in 3.1
For now I have 2 ideas:
a) Make VARCHAR(x) a first-class data type. This means Spark data sources
should support VARCHAR, and CREATE TABLE should fail if a column is VARCHAR
type and the underlying data source doesn't support it (e.g. JSON/CSV).
Type cast, type coercion, table insertion, etc. should be updated as well.
b) Simply document that, the underlying data source may or may not enforce
the length limitation of VARCHAR(x).

Please let me know if you have different ideas.

Thanks,
Wenchen

On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust <mi...@databricks.com>
wrote:

> What I'd oppose is to just ban char for the native data sources, and do
>> not have a plan to address this problem systematically.
>>
>
> +1
>
>
>> Just forget about padding, like what Snowflake and MySQL have done.
>> Document that char(x) is just an alias for string. And then move on. Almost
>> no work needs to be done...
>>
>
> +1
>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Michael Armbrust <mi...@databricks.com>.
>
> What I'd oppose is to just ban char for the native data sources, and do
> not have a plan to address this problem systematically.
>

+1


> Just forget about padding, like what Snowflake and MySQL have done.
> Document that char(x) is just an alias for string. And then move on. Almost
> no work needs to be done...
>

+1

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
For sure.

There's another reason I feel char is not that important and it's more important to be internally consistent (e.g. all data sources support it with the same behavior, vs one data sources do one behavior and another do the other). char was created at a time when cpu was slow and storage was expensive, and being able to pack things nicely at fixed length was highly useful. The fact that it was padded was initially done for performance, not for the padding itself. A lot has changed since char was invented, and with modern technologies (columnar, dictionary encoding, etc) there is little reason to use a char data type for anything. As a matter of fact, Spark internally converts char type to string to work with.

I see two solutions really.

1. We require padding, and ban all uses of char when it is not properly padded. This would ban all the native data sources, which are the primarily way people are using Spark. This leaves only char support for tables going through Hive serdes, which are slow to begin with. It is basically Dongjoon and Wenchen's suggestion. This turns char support into a compatibility feature only for some Hive tables that cannot be converted into Spark native data sources. This has confusing end-user behavior because depending on whether that Hive table is converted into Spark native data sources, we might or might not support char type.

An extension to the above is to introduce padding for char type across the board, and make char type a first class data type. There are a lot of work to introduce another data type, especially for one that has virtually no usage ( https://trends.google.com/trends/explore?geo=US&q=hive%20char,hive%20string ) and its usage will likely continue to decline in the future (just reason from first principle based on the reason char was introduced in the first place).

Now I'm assuming it's a lot of work to do char properly. But if it is not the case (e.g. just a simple rule to insert padding at planning time), then maybe it's worth doing it this way. I'm totally OK with this too.

What I'd oppose is to just ban char for the native data sources, and do not have a plan to address this problem systematically.

2. Just forget about padding, like what Snowflake and MySQL have done. Document that char(x) is just an alias for string. And then move on. Almost no work needs to be done...

On Mon, Mar 16, 2020 at 5:54 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Thank you for sharing and confirming.
> 
> 
> We had better consider all heterogeneous customers in the world. And, I
> also have experiences with the non-negligible cases in on-prem.
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> −User
>> 
>> 
>> 
>> char barely showed up (honestly negligible). I was comparing select vs
>> select.
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.hyun@gmail.com ) > wrote:
>> 
>>> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
>>> statements with `CHAR`?
>>> 
>>> > I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> We need to discuss more about what to do. This thread is what I expected
>>> exactly. :)
>>> 
>>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>> it). I was merely pointing out that if we deviate away from SQL standard
>>> in any way we are considered "wrong" or "incorrect". That argument itself
>>> is flawed when plenty of other popular database systems also deviate away
>>> from the standard on this specific behavior.
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
>>> rxin@databricks.com ) > wrote:
>>> 
>>> 
>>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>>>> I was merely pointing out that if we deviate away from SQL standard in any
>>>> way we are considered "wrong" or "incorrect". That argument itself is
>>>> flawed when plenty of other popular database systems also deviate away
>>>> from the standard on this specific behavior.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ databricks. com (
>>>> rxin@databricks.com ) > wrote:
>>>> 
>>>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>>>> has at least four orders of magnitude higher usage than char.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>> 
>>>>>> Thank you, Stephen and Reynold.
>>>>>> 
>>>>>> 
>>>>>> To Reynold.
>>>>>> 
>>>>>> 
>>>>>> The way I see the following is a little different.
>>>>>> 
>>>>>> 
>>>>>>       > CHAR is an undocumented data type without clearly defined
>>>>>> semantics.
>>>>>> 
>>>>>> Let me describe in Apache Spark User's View point.
>>>>>> 
>>>>>> 
>>>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>>>>>> Apache Spark 1.x without much documentation. In addition, there still
>>>>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>>>> 
>>>>>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>>>>>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>>>>>        Add back HiveContext and createExternalTable
>>>>>> 
>>>>>> Historically, we tried to make many SQL-based customer migrate their
>>>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>>>> 
>>>>>> Although Apache Spark didn't have a good document about the inconsistent
>>>>>> behavior among its data sources, Apache Hive has been providing its
>>>>>> documentation and many customers rely the behavior.
>>>>>> 
>>>>>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ LanguageManual+Types
>>>>>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>>>>>> 
>>>>>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>>>>>> many existing huge tables were created by Apache Hive, not Apache Spark.
>>>>>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>>>>>> This was true because Apache Spark was added into the Hadoop-vendor
>>>>>> products later than Apache Hive.
>>>>>> 
>>>>>> 
>>>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>>>>> features to be consistent at least with Hive tables in Apache Hive and
>>>>>> Apache Spark because two SQL engines share the same tables.
>>>>>> 
>>>>>> For the following, technically, while Apache Hive doesn't changed its
>>>>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>>>>> away from the original Apache Spark old behaviors one-by-one.
>>>>>> 
>>>>>> 
>>>>>>       >  the value is already fucked up
>>>>>> 
>>>>>> 
>>>>>> The following is the change log.
>>>>>> 
>>>>>>       - When we switched the default value of `convertMetastoreParquet`.
>>>>>> (at Apache Spark 1.2)
>>>>>>       - When we switched the default value of `convertMetastoreOrc` (at
>>>>>> Apache Spark 2.4)
>>>>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>>>>> `PARQUET` table at Apache Spark 3.0)
>>>>>> 
>>>>>> To sum up, this has been a well-known issue in the community and among the
>>>>>> customers.
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>>>>>> scoy@infomedia.com.au ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> Hi there,
>>>>>>> 
>>>>>>> 
>>>>>>> I’m kind of new around here, but I have had experience with all of all the
>>>>>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>>>>>> Server as well as Postgresql.
>>>>>>> 
>>>>>>> 
>>>>>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>>>>>> means that such columns are always space padded, and they default to
>>>>>>> having this enabled (for ANSI compliance).
>>>>>>> 
>>>>>>> 
>>>>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>>>>> historical reasons not unlike what we have here.
>>>>>>> 
>>>>>>> 
>>>>>>> In my opinion we should push toward standards compliance where possible
>>>>>>> and then document where it cannot work.
>>>>>>> 
>>>>>>> 
>>>>>>> If users don’t like the padding on CHAR columns then they should change to
>>>>>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>>>>>> not dictate any sort of “padding".
>>>>>>> 
>>>>>>> 
>>>>>>> I can see why you might “ban” the use of CHAR columns where they cannot be
>>>>>>> consistently supported, but VARCHAR is a different animal and I would
>>>>>>> expect it to work consistently everywhere.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> 
>>>>>>> Steve C
>>>>>>> 
>>>>>>> 
>>>>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>>>>>> dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>> 
>>>>>>>> Hi, Reynold.
>>>>>>>> (And +Michael Armbrust)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> If you think so, do you think it's okay that we change the return value
>>>>>>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> > Are we sure "not padding" is "incorrect"?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>>>>>>> com ( gourav.sengupta@gmail.com ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 100% agree with Reynold.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Gourav Sengupta
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>>>>>>> rxin@databricks.com ) > wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>>>>>>> databases don't actually pad.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>>>>>>>>> (
>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>>>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>>>>>>> strings shorter than the maximum length are not space-padded at the end."
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>>>>>>>>> (
>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>>>>>>> )
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi, Reynold.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Please see the following for the context.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>>>>>>> )
>>>>>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>>>> syntax"
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I raised the above issue according to the new rubric, and the banning was
>>>>>>>>>>> the proposed alternative to reduce the potential issue.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>>>>>>>>> rxin@databricks.com ) > wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>>>>>>>>> of both new and old users?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> For old users, their old code that was working for char(3) would now stop
>>>>>>>>>>>> working. 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>>>>>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>>>>>>>>> deal if we explain it) or not supported. 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi, All.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>>>>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>>>>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>>>>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>>>>>>>>> summary.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>>>>>>>>> Hive behavior.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>>>>> 
>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>>>>> 
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>>>>>>>>> behavior.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>     a   3
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>>>>>>>>> consistent.
>>>>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>>>     a 2
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>>>>>>>>> following syntax to be safe.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>>>>>>> )
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>>>>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> References:
>>>>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>>>>>>> spark. apache. org%3E (
>>>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>>>>>>> )
>>>>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>>>>>> syntax", 2019/12/06
>>>>>>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>>>>>>> spark. apache. org%3E (
>>>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>>>>>>> )
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> This email contains confidential information of and is the copyright of
>>>>>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>>>>>> of the sender. If you received this message by mistake, please advise the
>>>>>>> sender and delete all copies. Security of transmission on the internet
>>>>>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>>>>>> should ensure you have suitable antivirus protection in place. By sending
>>>>>>> us your or any third party personal details, you consent to (or confirm
>>>>>>> you have obtained consent from such third parties) to Infomedia’s privacy
>>>>>>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>>>>>>> http://www.infomedia.com.au/privacy-policy/ )
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you for sharing and confirming.

We had better consider all heterogeneous customers in the world. And, I
also have experiences with the non-negligible cases in on-prem.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin <rx...@databricks.com> wrote:

> −User
>
> char barely showed up (honestly negligible). I was comparing select vs
> select.
>
>
>
> On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
>> statements with `CHAR`?
>>
>> > I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>> We need to discuss more about what to do. This thread is what I expected
>> exactly. :)
>>
>> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
>> it). I was merely pointing out that if we deviate away from SQL standard in
>> any way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away from
>> the standard on this specific behavior.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> BTW I'm not opposing us sticking to SQL standard (I'm in general for
>>> it). I was merely pointing out that if we deviate away from SQL standard in
>>> any way we are considered "wrong" or "incorrect". That argument itself is
>>> flawed when plenty of other popular database systems also deviate away from
>>> the standard on this specific behavior.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>>> has at least four orders of magnitude higher usage than char.
>>>>
>>>>
>>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <dongjoon.hyun@gmail.com
>>>> > wrote:
>>>>
>>>>> Thank you, Stephen and Reynold.
>>>>>
>>>>> To Reynold.
>>>>>
>>>>> The way I see the following is a little different.
>>>>>
>>>>>       > CHAR is an undocumented data type without clearly defined
>>>>> semantics.
>>>>>
>>>>> Let me describe in Apache Spark User's View point.
>>>>>
>>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql`
>>>>> function) at Apache Spark 1.x without much documentation. In addition,
>>>>> there still exists an effort which is trying to keep it in 3.0.0 age.
>>>>>
>>>>>        https://issues.apache.org/jira/browse/SPARK-31088
>>>>>        Add back HiveContext and createExternalTable
>>>>>
>>>>> Historically, we tried to make many SQL-based customer migrate their
>>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>>>
>>>>> Although Apache Spark didn't have a good document about the
>>>>> inconsistent behavior among its data sources, Apache Hive has been
>>>>> providing its documentation and many customers rely the behavior.
>>>>>
>>>>>       -
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>>>
>>>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>>>> *caching*. This was true because Apache Spark was added into the
>>>>> Hadoop-vendor products later than Apache Hive.
>>>>>
>>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>>>> features to be consistent at least with Hive tables in Apache Hive and
>>>>> Apache Spark because two SQL engines share the same tables.
>>>>>
>>>>> For the following, technically, while Apache Hive doesn't changed its
>>>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>>>> away from the original Apache Spark old behaviors one-by-one.
>>>>>
>>>>>       >  the value is already fucked up
>>>>>
>>>>> The following is the change log.
>>>>>
>>>>>       - When we switched the default value of
>>>>> `convertMetastoreParquet`. (at Apache Spark 1.2)
>>>>>       - When we switched the default value of `convertMetastoreOrc`
>>>>> (at Apache Spark 2.4)
>>>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table
>>>>> to `PARQUET` table at Apache Spark 3.0)
>>>>>
>>>>> To sum up, this has been a well-known issue in the community and among
>>>>> the customers.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au>
>>>>> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I’m kind of new around here, but I have had experience with all of
>>>>>> all the so called “big iron” databases such as Oracle, IBM DB2 and
>>>>>> Microsoft SQL Server as well as Postgresql.
>>>>>>
>>>>>> They all support the notion of “ANSI padding” for CHAR columns -
>>>>>> which means that such columns are always space padded, and they default to
>>>>>> having this enabled (for ANSI compliance).
>>>>>>
>>>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>>>> historical reasons not unlike what we have here.
>>>>>>
>>>>>> In my opinion we should push toward standards compliance where
>>>>>> possible and then document where it cannot work.
>>>>>>
>>>>>> If users don’t like the padding on CHAR columns then they should
>>>>>> change to VARCHAR - I believe that was its purpose in the first place, and
>>>>>> it does not dictate any sort of “padding".
>>>>>>
>>>>>> I can see why you might “ban” the use of CHAR columns where they
>>>>>> cannot be consistently supported, but VARCHAR is a different animal and I
>>>>>> would expect it to work consistently everywhere.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Steve C
>>>>>>
>>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi, Reynold.
>>>>>> (And +Michael Armbrust)
>>>>>>
>>>>>> If you think so, do you think it's okay that we change the return
>>>>>> value silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>>>
>>>>>> > Are we sure "not padding" is "incorrect"?
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
>>>>>> gourav.sengupta@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> 100% agree with Reynold.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gourav Sengupta
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>>>
>>>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty
>>>>>>>> of databases don't actually pad.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> :
>>>>>>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>>>>>>> shorter than the maximum length are not space-padded at the end."
>>>>>>>>
>>>>>>>> MySQL:
>>>>>>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, Reynold.
>>>>>>>>>
>>>>>>>>> Please see the following for the context.
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>> TABLE syntax"
>>>>>>>>>
>>>>>>>>> I raised the above issue according to the new rubric, and the
>>>>>>>>> banning was the proposed alternative to reduce the potential issue.
>>>>>>>>>
>>>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the
>>>>>>>>>> hell out of both new and old users?
>>>>>>>>>>
>>>>>>>>>> For old users, their old code that was working for char(3) would
>>>>>>>>>> now stop working.
>>>>>>>>>>
>>>>>>>>>> For new users, depending on whether the underlying metastore
>>>>>>>>>> char(3) is either supported but different from ansi Sql (which is not that
>>>>>>>>>> big of a deal if we explain it) or not supported.
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <
>>>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, All.
>>>>>>>>>>>
>>>>>>>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>>>>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>>>>>>>> evolution direction has been gradually moving forward to be consistent
>>>>>>>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>>>>>>>> is the summary.
>>>>>>>>>>>
>>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>>>>>>>> result.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a
>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>>
>>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback
>>>>>>>>>>> to Hive behavior.)
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>>
>>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS`
>>>>>>>>>>> clause) became consistent.
>>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true`
>>>>>>>>>>> provides a fallback to Hive behavior.)
>>>>>>>>>>>
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>>
>>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR`
>>>>>>>>>>> type in the following syntax to be safe.
>>>>>>>>>>>
>>>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>>>>>>
>>>>>>>>>>> This email is sent out to inform you based on the new policy we
>>>>>>>>>>> voted.
>>>>>>>>>>> The recommendation is always using Apache Spark's native type
>>>>>>>>>>> `String`.
>>>>>>>>>>>
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>
>>>>>>>>>>> References:
>>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>>>
>>>>>>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for
>>>>>>>>>>> CREATE TABLE syntax", 2019/12/06
>>>>>>>>>>>
>>>>>>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>> This email contains confidential information of and is the copyright
>>>>>> of Infomedia. It must not be forwarded, amended or disclosed without
>>>>>> consent of the sender. If you received this message by mistake, please
>>>>>> advise the sender and delete all copies. Security of transmission on the
>>>>>> internet cannot be guaranteed, could be infected, intercepted, or corrupted
>>>>>> and you should ensure you have suitable antivirus protection in place. By
>>>>>> sending us your or any third party personal details, you consent to (or
>>>>>> confirm you have obtained consent from such third parties) to Infomedia’s
>>>>>> privacy policy. http://www.infomedia.com.au/privacy-policy/
>>>>>>
>>>>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
−User

char barely showed up (honestly negligible). I was comparing select vs select.

On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Ur, are you comparing the number of SELECT statement with TRIM and CREATE
> statements with `CHAR`?
> 
> > I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> We need to discuss more about what to do. This thread is what I expected
> exactly. :)
> 
> > BTW I'm not opposing us sticking to SQL standard (I'm in general for
> it). I was merely pointing out that if we deviate away from SQL standard
> in any way we are considered "wrong" or "incorrect". That argument itself
> is flawed when plenty of other popular database systems also deviate away
> from the standard on this specific behavior.
> 
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
>> I was merely pointing out that if we deviate away from SQL standard in any
>> way we are considered "wrong" or "incorrect". That argument itself is
>> flawed when plenty of other popular database systems also deviate away
>> from the standard on this specific behavior.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@ databricks. com (
>> rxin@databricks.com ) > wrote:
>> 
>>> I looked up our usage logs (sorry I can't share this publicly) and trim
>>> has at least four orders of magnitude higher usage than char.
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>> 
>>>> Thank you, Stephen and Reynold.
>>>> 
>>>> 
>>>> To Reynold.
>>>> 
>>>> 
>>>> The way I see the following is a little different.
>>>> 
>>>> 
>>>>       > CHAR is an undocumented data type without clearly defined
>>>> semantics.
>>>> 
>>>> Let me describe in Apache Spark User's View point.
>>>> 
>>>> 
>>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>>>> Apache Spark 1.x without much documentation. In addition, there still
>>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>> 
>>>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>>>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>>>        Add back HiveContext and createExternalTable
>>>> 
>>>> Historically, we tried to make many SQL-based customer migrate their
>>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>> 
>>>> Although Apache Spark didn't have a good document about the inconsistent
>>>> behavior among its data sources, Apache Hive has been providing its
>>>> documentation and many customers rely the behavior.
>>>> 
>>>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ LanguageManual+Types
>>>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>>>> 
>>>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>>>> many existing huge tables were created by Apache Hive, not Apache Spark.
>>>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>>>> This was true because Apache Spark was added into the Hadoop-vendor
>>>> products later than Apache Hive.
>>>> 
>>>> 
>>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>>> features to be consistent at least with Hive tables in Apache Hive and
>>>> Apache Spark because two SQL engines share the same tables.
>>>> 
>>>> For the following, technically, while Apache Hive doesn't changed its
>>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>>> away from the original Apache Spark old behaviors one-by-one.
>>>> 
>>>> 
>>>>       >  the value is already fucked up
>>>> 
>>>> 
>>>> The following is the change log.
>>>> 
>>>>       - When we switched the default value of `convertMetastoreParquet`.
>>>> (at Apache Spark 1.2)
>>>>       - When we switched the default value of `convertMetastoreOrc` (at
>>>> Apache Spark 2.4)
>>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>>> `PARQUET` table at Apache Spark 3.0)
>>>> 
>>>> To sum up, this has been a well-known issue in the community and among the
>>>> customers.
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>>>> scoy@infomedia.com.au ) > wrote:
>>>> 
>>>> 
>>>>> Hi there,
>>>>> 
>>>>> 
>>>>> I’m kind of new around here, but I have had experience with all of all the
>>>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>>>> Server as well as Postgresql.
>>>>> 
>>>>> 
>>>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>>>> means that such columns are always space padded, and they default to
>>>>> having this enabled (for ANSI compliance).
>>>>> 
>>>>> 
>>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>>> historical reasons not unlike what we have here.
>>>>> 
>>>>> 
>>>>> In my opinion we should push toward standards compliance where possible
>>>>> and then document where it cannot work.
>>>>> 
>>>>> 
>>>>> If users don’t like the padding on CHAR columns then they should change to
>>>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>>>> not dictate any sort of “padding".
>>>>> 
>>>>> 
>>>>> I can see why you might “ban” the use of CHAR columns where they cannot be
>>>>> consistently supported, but VARCHAR is a different animal and I would
>>>>> expect it to work consistently everywhere.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> 
>>>>> Steve C
>>>>> 
>>>>> 
>>>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>>>> dongjoon.hyun@gmail.com ) > wrote:
>>>>>> 
>>>>>> Hi, Reynold.
>>>>>> (And +Michael Armbrust)
>>>>>> 
>>>>>> 
>>>>>> If you think so, do you think it's okay that we change the return value
>>>>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>>> 
>>>>>> 
>>>>>> > Are we sure "not padding" is "incorrect"?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>>>>> com ( gourav.sengupta@gmail.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> 
>>>>>>> 100% agree with Reynold.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gourav Sengupta
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>>>>> rxin@databricks.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>>>>> databases don't actually pad.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>>>>>>> (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>>>>> strings shorter than the maximum length are not space-padded at the end."
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>>>>>>> (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>> 
>>>>>>>>> Hi, Reynold.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Please see the following for the context.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>> syntax"
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I raised the above issue according to the new rubric, and the banning was
>>>>>>>>> the proposed alternative to reduce the potential issue.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>> 
>>>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>>>>>>> rxin@databricks.com ) > wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>>>>>>> of both new and old users?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> For old users, their old code that was working for char(3) would now stop
>>>>>>>>>> working. 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>>>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>>>>>>> deal if we explain it) or not supported. 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Hi, All.
>>>>>>>>>>> 
>>>>>>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>>>>>>> summary.
>>>>>>>>>>> 
>>>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>>>>>>> Hive behavior.)
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>> 
>>>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>>>>>>> behavior.)
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a   3
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>> 
>>>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>>>>>>> consistent.
>>>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>>> 
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>>>     a 2
>>>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>>>     a 2
>>>>>>>>>>> 
>>>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>>>>>>> following syntax to be safe.
>>>>>>>>>>> 
>>>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>>>>> )
>>>>>>>>>>> 
>>>>>>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>>>>>>> 
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>> 
>>>>>>>>>>> References:
>>>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>>>>> spark. apache. org%3E (
>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>>>>> )
>>>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>>>> syntax", 2019/12/06
>>>>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>>>>> spark. apache. org%3E (
>>>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>>>>> )
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> This email contains confidential information of and is the copyright of
>>>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>>>> of the sender. If you received this message by mistake, please advise the
>>>>> sender and delete all copies. Security of transmission on the internet
>>>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>>>> should ensure you have suitable antivirus protection in place. By sending
>>>>> us your or any third party personal details, you consent to (or confirm
>>>>> you have obtained consent from such third parties) to Infomedia’s privacy
>>>>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>>>>> http://www.infomedia.com.au/privacy-policy/ )
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Ur, are you comparing the number of SELECT statement with TRIM and CREATE
statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim
has at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected
exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
I was merely pointing out that if we deviate away from SQL standard in any
way we are considered "wrong" or "incorrect". That argument itself is
flawed when plenty of other popular database systems also deviate away from
the standard on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <rx...@databricks.com> wrote:

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
> I was merely pointing out that if we deviate away from SQL standard in any
> way we are considered "wrong" or "incorrect". That argument itself is
> flawed when plenty of other popular database systems also deviate away from
> the standard on this specific behavior.
>
>
>
>
> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>>
>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Thank you, Stephen and Reynold.
>>>
>>> To Reynold.
>>>
>>> The way I see the following is a little different.
>>>
>>>       > CHAR is an undocumented data type without clearly defined
>>> semantics.
>>>
>>> Let me describe in Apache Spark User's View point.
>>>
>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function)
>>> at Apache Spark 1.x without much documentation. In addition, there still
>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>
>>>        https://issues.apache.org/jira/browse/SPARK-31088
>>>        Add back HiveContext and createExternalTable
>>>
>>> Historically, we tried to make many SQL-based customer migrate their
>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>
>>> Although Apache Spark didn't have a good document about the inconsistent
>>> behavior among its data sources, Apache Hive has been providing its
>>> documentation and many customers rely the behavior.
>>>
>>>       -
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>
>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>> *caching*. This was true because Apache Spark was added into the
>>> Hadoop-vendor products later than Apache Hive.
>>>
>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>> features to be consistent at least with Hive tables in Apache Hive and
>>> Apache Spark because two SQL engines share the same tables.
>>>
>>> For the following, technically, while Apache Hive doesn't changed its
>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>> away from the original Apache Spark old behaviors one-by-one.
>>>
>>>       >  the value is already fucked up
>>>
>>> The following is the change log.
>>>
>>>       - When we switched the default value of `convertMetastoreParquet`.
>>> (at Apache Spark 1.2)
>>>       - When we switched the default value of `convertMetastoreOrc` (at
>>> Apache Spark 2.4)
>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>> `PARQUET` table at Apache Spark 3.0)
>>>
>>> To sum up, this has been a well-known issue in the community and among
>>> the customers.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au>
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I’m kind of new around here, but I have had experience with all of all
>>>> the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft
>>>> SQL Server as well as Postgresql.
>>>>
>>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>>> means that such columns are always space padded, and they default to having
>>>> this enabled (for ANSI compliance).
>>>>
>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>> historical reasons not unlike what we have here.
>>>>
>>>> In my opinion we should push toward standards compliance where possible
>>>> and then document where it cannot work.
>>>>
>>>> If users don’t like the padding on CHAR columns then they should change
>>>> to VARCHAR - I believe that was its purpose in the first place, and it does
>>>> not dictate any sort of “padding".
>>>>
>>>> I can see why you might “ban” the use of CHAR columns where they cannot
>>>> be consistently supported, but VARCHAR is a different animal and I would
>>>> expect it to work consistently everywhere.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Steve C
>>>>
>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi, Reynold.
>>>> (And +Michael Armbrust)
>>>>
>>>> If you think so, do you think it's okay that we change the return value
>>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>
>>>> > Are we sure "not padding" is "incorrect"?
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
>>>> gourav.sengupta@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> 100% agree with Reynold.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>
>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty
>>>>>> of databases don't actually pad.
>>>>>>
>>>>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> :
>>>>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>>>>> shorter than the maximum length are not space-padded at the end."
>>>>>>
>>>>>> MySQL:
>>>>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, Reynold.
>>>>>>>
>>>>>>> Please see the following for the context.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE
>>>>>>> TABLE syntax"
>>>>>>>
>>>>>>> I raised the above issue according to the new rubric, and the
>>>>>>> banning was the proposed alternative to reduce the potential issue.
>>>>>>>
>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the
>>>>>>>> hell out of both new and old users?
>>>>>>>>
>>>>>>>> For old users, their old code that was working for char(3) would
>>>>>>>> now stop working.
>>>>>>>>
>>>>>>>> For new users, depending on whether the underlying metastore
>>>>>>>> char(3) is either supported but different from ansi Sql (which is not that
>>>>>>>> big of a deal if we explain it) or not supported.
>>>>>>>>
>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, All.
>>>>>>>>>
>>>>>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>>>>>> evolution direction has been gradually moving forward to be consistent
>>>>>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>>>>>> is the summary.
>>>>>>>>>
>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>>>>>> result.
>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a
>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>
>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>
>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>>
>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>>>>>> Hive behavior.)
>>>>>>>>>
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>>
>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>>>>>> became consistent.
>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides
>>>>>>>>> a fallback to Hive behavior.)
>>>>>>>>>
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>>
>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type
>>>>>>>>> in the following syntax to be safe.
>>>>>>>>>
>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>>>>
>>>>>>>>> This email is sent out to inform you based on the new policy we
>>>>>>>>> voted.
>>>>>>>>> The recommendation is always using Apache Spark's native type
>>>>>>>>> `String`.
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>> References:
>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>
>>>>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>> TABLE syntax", 2019/12/06
>>>>>>>>>
>>>>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>> This email contains confidential information of and is the copyright of
>>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>>> of the sender. If you received this message by mistake, please advise the
>>>> sender and delete all copies. Security of transmission on the internet
>>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>>> should ensure you have suitable antivirus protection in place. By sending
>>>> us your or any third party personal details, you consent to (or confirm you
>>>> have obtained consent from such third parties) to Infomedia’s privacy
>>>> policy. http://www.infomedia.com.au/privacy-policy/
>>>>
>>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Ur, are you comparing the number of SELECT statement with TRIM and CREATE
statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim
has at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected
exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
I was merely pointing out that if we deviate away from SQL standard in any
way we are considered "wrong" or "incorrect". That argument itself is
flawed when plenty of other popular database systems also deviate away from
the standard on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin <rx...@databricks.com> wrote:

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
> I was merely pointing out that if we deviate away from SQL standard in any
> way we are considered "wrong" or "incorrect". That argument itself is
> flawed when plenty of other popular database systems also deviate away from
> the standard on this specific behavior.
>
>
>
>
> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>>
>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Thank you, Stephen and Reynold.
>>>
>>> To Reynold.
>>>
>>> The way I see the following is a little different.
>>>
>>>       > CHAR is an undocumented data type without clearly defined
>>> semantics.
>>>
>>> Let me describe in Apache Spark User's View point.
>>>
>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function)
>>> at Apache Spark 1.x without much documentation. In addition, there still
>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>
>>>        https://issues.apache.org/jira/browse/SPARK-31088
>>>        Add back HiveContext and createExternalTable
>>>
>>> Historically, we tried to make many SQL-based customer migrate their
>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>
>>> Although Apache Spark didn't have a good document about the inconsistent
>>> behavior among its data sources, Apache Hive has been providing its
>>> documentation and many customers rely the behavior.
>>>
>>>       -
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>
>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>> *caching*. This was true because Apache Spark was added into the
>>> Hadoop-vendor products later than Apache Hive.
>>>
>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>> features to be consistent at least with Hive tables in Apache Hive and
>>> Apache Spark because two SQL engines share the same tables.
>>>
>>> For the following, technically, while Apache Hive doesn't changed its
>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>> away from the original Apache Spark old behaviors one-by-one.
>>>
>>>       >  the value is already fucked up
>>>
>>> The following is the change log.
>>>
>>>       - When we switched the default value of `convertMetastoreParquet`.
>>> (at Apache Spark 1.2)
>>>       - When we switched the default value of `convertMetastoreOrc` (at
>>> Apache Spark 2.4)
>>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>> `PARQUET` table at Apache Spark 3.0)
>>>
>>> To sum up, this has been a well-known issue in the community and among
>>> the customers.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au>
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I’m kind of new around here, but I have had experience with all of all
>>>> the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft
>>>> SQL Server as well as Postgresql.
>>>>
>>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>>> means that such columns are always space padded, and they default to having
>>>> this enabled (for ANSI compliance).
>>>>
>>>> MySQL also supports it, but it defaults to leaving it disabled for
>>>> historical reasons not unlike what we have here.
>>>>
>>>> In my opinion we should push toward standards compliance where possible
>>>> and then document where it cannot work.
>>>>
>>>> If users don’t like the padding on CHAR columns then they should change
>>>> to VARCHAR - I believe that was its purpose in the first place, and it does
>>>> not dictate any sort of “padding".
>>>>
>>>> I can see why you might “ban” the use of CHAR columns where they cannot
>>>> be consistently supported, but VARCHAR is a different animal and I would
>>>> expect it to work consistently everywhere.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Steve C
>>>>
>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi, Reynold.
>>>> (And +Michael Armbrust)
>>>>
>>>> If you think so, do you think it's okay that we change the return value
>>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>>
>>>> > Are we sure "not padding" is "incorrect"?
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
>>>> gourav.sengupta@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> 100% agree with Reynold.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>>
>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty
>>>>>> of databases don't actually pad.
>>>>>>
>>>>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> :
>>>>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>>>>> shorter than the maximum length are not space-padded at the end."
>>>>>>
>>>>>> MySQL:
>>>>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <
>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, Reynold.
>>>>>>>
>>>>>>> Please see the following for the context.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE
>>>>>>> TABLE syntax"
>>>>>>>
>>>>>>> I raised the above issue according to the new rubric, and the
>>>>>>> banning was the proposed alternative to reduce the potential issue.
>>>>>>>
>>>>>>> Please give us your opinion since it's still PR.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the
>>>>>>>> hell out of both new and old users?
>>>>>>>>
>>>>>>>> For old users, their old code that was working for char(3) would
>>>>>>>> now stop working.
>>>>>>>>
>>>>>>>> For new users, depending on whether the underlying metastore
>>>>>>>> char(3) is either supported but different from ansi Sql (which is not that
>>>>>>>> big of a deal if we explain it) or not supported.
>>>>>>>>
>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <
>>>>>>>> dongjoon.hyun@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, All.
>>>>>>>>>
>>>>>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>>>>>> evolution direction has been gradually moving forward to be consistent
>>>>>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>>>>>> is the summary.
>>>>>>>>>
>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>>>>>> result.
>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a
>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>>
>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>>
>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>>
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>>
>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>>>>>> Hive behavior.)
>>>>>>>>>
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>>
>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>>>>>> became consistent.
>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides
>>>>>>>>> a fallback to Hive behavior.)
>>>>>>>>>
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>>
>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type
>>>>>>>>> in the following syntax to be safe.
>>>>>>>>>
>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>>>>
>>>>>>>>> This email is sent out to inform you based on the new policy we
>>>>>>>>> voted.
>>>>>>>>> The recommendation is always using Apache Spark's native type
>>>>>>>>> `String`.
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>> References:
>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>
>>>>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>>>>>> TABLE syntax", 2019/12/06
>>>>>>>>>
>>>>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>> This email contains confidential information of and is the copyright of
>>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>>> of the sender. If you received this message by mistake, please advise the
>>>> sender and delete all copies. Security of transmission on the internet
>>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>>> should ensure you have suitable antivirus protection in place. By sending
>>>> us your or any third party personal details, you consent to (or confirm you
>>>> have obtained consent from such third parties) to Infomedia’s privacy
>>>> policy. http://www.infomedia.com.au/privacy-policy/
>>>>
>>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was merely pointing out that if we deviate away from SQL standard in any way we are considered "wrong" or "incorrect". That argument itself is flawed when plenty of other popular database systems also deviate away from the standard on this specific behavior.

On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@databricks.com > wrote:

> 
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> 
> 
> 
> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.hyun@gmail.com ) > wrote:
> 
>> Thank you, Stephen and Reynold.
>> 
>> 
>> To Reynold.
>> 
>> 
>> The way I see the following is a little different.
>> 
>> 
>>       > CHAR is an undocumented data type without clearly defined
>> semantics.
>> 
>> Let me describe in Apache Spark User's View point.
>> 
>> 
>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>> Apache Spark 1.x without much documentation. In addition, there still
>> exists an effort which is trying to keep it in 3.0.0 age.
>> 
>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>        Add back HiveContext and createExternalTable
>> 
>> Historically, we tried to make many SQL-based customer migrate their
>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>> 
>> Although Apache Spark didn't have a good document about the inconsistent
>> behavior among its data sources, Apache Hive has been providing its
>> documentation and many customers rely the behavior.
>> 
>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ LanguageManual+Types
>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>> 
>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>> many existing huge tables were created by Apache Hive, not Apache Spark.
>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>> This was true because Apache Spark was added into the Hadoop-vendor
>> products later than Apache Hive.
>> 
>> 
>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>> features to be consistent at least with Hive tables in Apache Hive and
>> Apache Spark because two SQL engines share the same tables.
>> 
>> For the following, technically, while Apache Hive doesn't changed its
>> existing behavior in this part, Apache Spark evolves inevitably by moving
>> away from the original Apache Spark old behaviors one-by-one.
>> 
>> 
>>       >  the value is already fucked up
>> 
>> 
>> The following is the change log.
>> 
>>       - When we switched the default value of `convertMetastoreParquet`.
>> (at Apache Spark 1.2)
>>       - When we switched the default value of `convertMetastoreOrc` (at
>> Apache Spark 2.4)
>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>> `PARQUET` table at Apache Spark 3.0)
>> 
>> To sum up, this has been a well-known issue in the community and among the
>> customers.
>> 
>> Bests,
>> Dongjoon.
>> 
>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>> scoy@infomedia.com.au ) > wrote:
>> 
>> 
>>> Hi there,
>>> 
>>> 
>>> I’m kind of new around here, but I have had experience with all of all the
>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>> Server as well as Postgresql.
>>> 
>>> 
>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>> means that such columns are always space padded, and they default to
>>> having this enabled (for ANSI compliance).
>>> 
>>> 
>>> MySQL also supports it, but it defaults to leaving it disabled for
>>> historical reasons not unlike what we have here.
>>> 
>>> 
>>> In my opinion we should push toward standards compliance where possible
>>> and then document where it cannot work.
>>> 
>>> 
>>> If users don’t like the padding on CHAR columns then they should change to
>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>> not dictate any sort of “padding".
>>> 
>>> 
>>> I can see why you might “ban” the use of CHAR columns where they cannot be
>>> consistently supported, but VARCHAR is a different animal and I would
>>> expect it to work consistently everywhere.
>>> 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Steve C
>>> 
>>> 
>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>> dongjoon.hyun@gmail.com ) > wrote:
>>>> 
>>>> Hi, Reynold.
>>>> (And +Michael Armbrust)
>>>> 
>>>> 
>>>> If you think so, do you think it's okay that we change the return value
>>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>> 
>>>> 
>>>> > Are we sure "not padding" is "incorrect"?
>>>> 
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> 
>>>> 
>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>>> com ( gourav.sengupta@gmail.com ) > wrote:
>>>> 
>>>> 
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> 100% agree with Reynold.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>> 
>>>>> 
>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>>> rxin@databricks.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>>> databases don't actually pad.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>>>>> (
>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>>> strings shorter than the maximum length are not space-padded at the end."
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>>>>> (
>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>>> )
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>> 
>>>>>>> Hi, Reynold.
>>>>>>> 
>>>>>>> 
>>>>>>> Please see the following for the context.
>>>>>>> 
>>>>>>> 
>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>>> )
>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>> syntax"
>>>>>>> 
>>>>>>> 
>>>>>>> I raised the above issue according to the new rubric, and the banning was
>>>>>>> the proposed alternative to reduce the potential issue.
>>>>>>> 
>>>>>>> 
>>>>>>> Please give us your opinion since it's still PR.
>>>>>>> 
>>>>>>> 
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>> 
>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>>>>> rxin@databricks.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>>>>> of both new and old users?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> For old users, their old code that was working for char(3) would now stop
>>>>>>>> working. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>>>>> deal if we explain it) or not supported. 
>>>>>>>> 
>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi, All.
>>>>>>>>> 
>>>>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>>>>> summary.
>>>>>>>>> 
>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>>>>> Hive behavior.)
>>>>>>>>> 
>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>> 
>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>> 
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>> 
>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>>>>> behavior.)
>>>>>>>>> 
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>> 
>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>>>>> consistent.
>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>> 
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>> 
>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>>>>> following syntax to be safe.
>>>>>>>>> 
>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> 
>>>>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>>>>> 
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>> 
>>>>>>>>> References:
>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>>> spark. apache. org%3E (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>> syntax", 2019/12/06
>>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>>> spark. apache. org%3E (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> This email contains confidential information of and is the copyright of
>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>> of the sender. If you received this message by mistake, please advise the
>>> sender and delete all copies. Security of transmission on the internet
>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>> should ensure you have suitable antivirus protection in place. By sending
>>> us your or any third party personal details, you consent to (or confirm
>>> you have obtained consent from such third parties) to Infomedia’s privacy
>>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>>> http://www.infomedia.com.au/privacy-policy/ )
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was merely pointing out that if we deviate away from SQL standard in any way we are considered "wrong" or "incorrect". That argument itself is flawed when plenty of other popular database systems also deviate away from the standard on this specific behavior.

On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < rxin@databricks.com > wrote:

> 
> I looked up our usage logs (sorry I can't share this publicly) and trim
> has at least four orders of magnitude higher usage than char.
> 
> 
> 
> 
> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
> ( dongjoon.hyun@gmail.com ) > wrote:
> 
>> Thank you, Stephen and Reynold.
>> 
>> 
>> To Reynold.
>> 
>> 
>> The way I see the following is a little different.
>> 
>> 
>>       > CHAR is an undocumented data type without clearly defined
>> semantics.
>> 
>> Let me describe in Apache Spark User's View point.
>> 
>> 
>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
>> Apache Spark 1.x without much documentation. In addition, there still
>> exists an effort which is trying to keep it in 3.0.0 age.
>> 
>>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
>> https://issues.apache.org/jira/browse/SPARK-31088 )
>>        Add back HiveContext and createExternalTable
>> 
>> Historically, we tried to make many SQL-based customer migrate their
>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>> 
>> Although Apache Spark didn't have a good document about the inconsistent
>> behavior among its data sources, Apache Hive has been providing its
>> documentation and many customers rely the behavior.
>> 
>>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ LanguageManual+Types
>> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
>> 
>> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
>> many existing huge tables were created by Apache Hive, not Apache Spark.
>> And, Apache Spark is used for boosting SQL performance with its *caching*.
>> This was true because Apache Spark was added into the Hadoop-vendor
>> products later than Apache Hive.
>> 
>> 
>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>> features to be consistent at least with Hive tables in Apache Hive and
>> Apache Spark because two SQL engines share the same tables.
>> 
>> For the following, technically, while Apache Hive doesn't changed its
>> existing behavior in this part, Apache Spark evolves inevitably by moving
>> away from the original Apache Spark old behaviors one-by-one.
>> 
>> 
>>       >  the value is already fucked up
>> 
>> 
>> The following is the change log.
>> 
>>       - When we switched the default value of `convertMetastoreParquet`.
>> (at Apache Spark 1.2)
>>       - When we switched the default value of `convertMetastoreOrc` (at
>> Apache Spark 2.4)
>>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>> `PARQUET` table at Apache Spark 3.0)
>> 
>> To sum up, this has been a well-known issue in the community and among the
>> customers.
>> 
>> Bests,
>> Dongjoon.
>> 
>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
>> scoy@infomedia.com.au ) > wrote:
>> 
>> 
>>> Hi there,
>>> 
>>> 
>>> I’m kind of new around here, but I have had experience with all of all the
>>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>>> Server as well as Postgresql.
>>> 
>>> 
>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>> means that such columns are always space padded, and they default to
>>> having this enabled (for ANSI compliance).
>>> 
>>> 
>>> MySQL also supports it, but it defaults to leaving it disabled for
>>> historical reasons not unlike what we have here.
>>> 
>>> 
>>> In my opinion we should push toward standards compliance where possible
>>> and then document where it cannot work.
>>> 
>>> 
>>> If users don’t like the padding on CHAR columns then they should change to
>>> VARCHAR - I believe that was its purpose in the first place, and it does
>>> not dictate any sort of “padding".
>>> 
>>> 
>>> I can see why you might “ban” the use of CHAR columns where they cannot be
>>> consistently supported, but VARCHAR is a different animal and I would
>>> expect it to work consistently everywhere.
>>> 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Steve C
>>> 
>>> 
>>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>> dongjoon.hyun@gmail.com ) > wrote:
>>>> 
>>>> Hi, Reynold.
>>>> (And +Michael Armbrust)
>>>> 
>>>> 
>>>> If you think so, do you think it's okay that we change the return value
>>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>>> 
>>>> 
>>>> > Are we sure "not padding" is "incorrect"?
>>>> 
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> 
>>>> 
>>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>>> com ( gourav.sengupta@gmail.com ) > wrote:
>>>> 
>>>> 
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> 100% agree with Reynold.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>> 
>>>>> 
>>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>>> rxin@databricks.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Are we sure "not padding" is "incorrect"?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>>> databases don't actually pad.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>>>>> (
>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>>> strings shorter than the maximum length are not space-padded at the end."
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>>>>> (
>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>>> )
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>> 
>>>>>>> Hi, Reynold.
>>>>>>> 
>>>>>>> 
>>>>>>> Please see the following for the context.
>>>>>>> 
>>>>>>> 
>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>>> )
>>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>> syntax"
>>>>>>> 
>>>>>>> 
>>>>>>> I raised the above issue according to the new rubric, and the banning was
>>>>>>> the proposed alternative to reduce the potential issue.
>>>>>>> 
>>>>>>> 
>>>>>>> Please give us your opinion since it's still PR.
>>>>>>> 
>>>>>>> 
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>> 
>>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>>>>> rxin@databricks.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>>>>> of both new and old users?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> For old users, their old code that was working for char(3) would now stop
>>>>>>>> working. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>>>>> deal if we explain it) or not supported. 
>>>>>>>> 
>>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Hi, All.
>>>>>>>>> 
>>>>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>>>>> summary.
>>>>>>>>> 
>>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>>>>> Hive behavior.)
>>>>>>>>> 
>>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>>> 
>>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>>> 
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>> 
>>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>>>>> behavior.)
>>>>>>>>> 
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a   3
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>> 
>>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>>>>> consistent.
>>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>>>>> fallback to Hive behavior.)
>>>>>>>>> 
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>>     a 2
>>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>>     a 2
>>>>>>>>> 
>>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>>>>> following syntax to be safe.
>>>>>>>>> 
>>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> 
>>>>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>>>>> 
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>> 
>>>>>>>>> References:
>>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>>> spark. apache. org%3E (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>>> syntax", 2019/12/06
>>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>>> spark. apache. org%3E (
>>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>>> )
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> This email contains confidential information of and is the copyright of
>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>> of the sender. If you received this message by mistake, please advise the
>>> sender and delete all copies. Security of transmission on the internet
>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>> should ensure you have suitable antivirus protection in place. By sending
>>> us your or any third party personal details, you consent to (or confirm
>>> you have obtained consent from such third parties) to Infomedia’s privacy
>>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>>> http://www.infomedia.com.au/privacy-policy/ )
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char.

On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Thank you, Stephen and Reynold.
> 
> 
> To Reynold.
> 
> 
> The way I see the following is a little different.
> 
> 
>       > CHAR is an undocumented data type without clearly defined
> semantics.
> 
> Let me describe in Apache Spark User's View point.
> 
> 
> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
> Apache Spark 1.x without much documentation. In addition, there still
> exists an effort which is trying to keep it in 3.0.0 age.
> 
>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
> https://issues.apache.org/jira/browse/SPARK-31088 )
>        Add back HiveContext and createExternalTable
> 
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
> 
> Although Apache Spark didn't have a good document about the inconsistent
> behavior among its data sources, Apache Hive has been providing its
> documentation and many customers rely the behavior.
> 
>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ LanguageManual+Types
> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
> 
> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
> many existing huge tables were created by Apache Hive, not Apache Spark.
> And, Apache Spark is used for boosting SQL performance with its *caching*.
> This was true because Apache Spark was added into the Hadoop-vendor
> products later than Apache Hive.
> 
> 
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
> 
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
> 
> 
>       >  the value is already fucked up
> 
> 
> The following is the change log.
> 
>       - When we switched the default value of `convertMetastoreParquet`.
> (at Apache Spark 1.2)
>       - When we switched the default value of `convertMetastoreOrc` (at
> Apache Spark 2.4)
>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
> `PARQUET` table at Apache Spark 3.0)
> 
> To sum up, this has been a well-known issue in the community and among the
> customers.
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
> scoy@infomedia.com.au ) > wrote:
> 
> 
>> Hi there,
>> 
>> 
>> I’m kind of new around here, but I have had experience with all of all the
>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>> Server as well as Postgresql.
>> 
>> 
>> They all support the notion of “ANSI padding” for CHAR columns - which
>> means that such columns are always space padded, and they default to
>> having this enabled (for ANSI compliance).
>> 
>> 
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.
>> 
>> 
>> In my opinion we should push toward standards compliance where possible
>> and then document where it cannot work.
>> 
>> 
>> If users don’t like the padding on CHAR columns then they should change to
>> VARCHAR - I believe that was its purpose in the first place, and it does
>> not dictate any sort of “padding".
>> 
>> 
>> I can see why you might “ban” the use of CHAR columns where they cannot be
>> consistently supported, but VARCHAR is a different animal and I would
>> expect it to work consistently everywhere.
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> Steve C
>> 
>> 
>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.hyun@gmail.com ) > wrote:
>>> 
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>> 
>>> 
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>> 
>>> 
>>> > Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>> com ( gourav.sengupta@gmail.com ) > wrote:
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 100% agree with Reynold.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>>>> 
>>>> 
>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>> rxin@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>> Are we sure "not padding" is "incorrect"?
>>>>> 
>>>>> 
>>>>> 
>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>> databases don't actually pad.
>>>>> 
>>>>> 
>>>>> 
>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>>>> (
>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>> strings shorter than the maximum length are not space-padded at the end."
>>>>> 
>>>>> 
>>>>> 
>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>>>> (
>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>> )
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>> 
>>>>>> Hi, Reynold.
>>>>>> 
>>>>>> 
>>>>>> Please see the following for the context.
>>>>>> 
>>>>>> 
>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>> )
>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>> syntax"
>>>>>> 
>>>>>> 
>>>>>> I raised the above issue according to the new rubric, and the banning was
>>>>>> the proposed alternative to reduce the potential issue.
>>>>>> 
>>>>>> 
>>>>>> Please give us your opinion since it's still PR.
>>>>>> 
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>>>> rxin@databricks.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>>>> of both new and old users?
>>>>>>> 
>>>>>>> 
>>>>>>> For old users, their old code that was working for char(3) would now stop
>>>>>>> working. 
>>>>>>> 
>>>>>>> 
>>>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>>>> deal if we explain it) or not supported. 
>>>>>>> 
>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> Hi, All.
>>>>>>>> 
>>>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>>>> summary.
>>>>>>>> 
>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>>>> Hive behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>> 
>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>>>> behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>>>> consistent.
>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>>>> fallback to Hive behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>>>> following syntax to be safe.
>>>>>>>> 
>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>>>> 
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>> 
>>>>>>>> References:
>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>> spark. apache. org%3E (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>> )
>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>> syntax", 2019/12/06
>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>> spark. apache. org%3E (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> This email contains confidential information of and is the copyright of
>> Infomedia. It must not be forwarded, amended or disclosed without consent
>> of the sender. If you received this message by mistake, please advise the
>> sender and delete all copies. Security of transmission on the internet
>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>> should ensure you have suitable antivirus protection in place. By sending
>> us your or any third party personal details, you consent to (or confirm
>> you have obtained consent from such third parties) to Infomedia’s privacy
>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>> http://www.infomedia.com.au/privacy-policy/ )
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char.

On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Thank you, Stephen and Reynold.
> 
> 
> To Reynold.
> 
> 
> The way I see the following is a little different.
> 
> 
>       > CHAR is an undocumented data type without clearly defined
> semantics.
> 
> Let me describe in Apache Spark User's View point.
> 
> 
> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
> Apache Spark 1.x without much documentation. In addition, there still
> exists an effort which is trying to keep it in 3.0.0 age.
> 
>        https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 (
> https://issues.apache.org/jira/browse/SPARK-31088 )
>        Add back HiveContext and createExternalTable
> 
> Historically, we tried to make many SQL-based customer migrate their
> workloads from Apache Hive into Apache Spark through `HiveContext`.
> 
> Although Apache Spark didn't have a good document about the inconsistent
> behavior among its data sources, Apache Hive has been providing its
> documentation and many customers rely the behavior.
> 
>       - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ LanguageManual+Types
> ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types )
> 
> At that time, frequently in on-prem Hadoop clusters by well-known vendors,
> many existing huge tables were created by Apache Hive, not Apache Spark.
> And, Apache Spark is used for boosting SQL performance with its *caching*.
> This was true because Apache Spark was added into the Hadoop-vendor
> products later than Apache Hive.
> 
> 
> Until the turning point at Apache Spark 2.0, we tried to catch up more
> features to be consistent at least with Hive tables in Apache Hive and
> Apache Spark because two SQL engines share the same tables.
> 
> For the following, technically, while Apache Hive doesn't changed its
> existing behavior in this part, Apache Spark evolves inevitably by moving
> away from the original Apache Spark old behaviors one-by-one.
> 
> 
>       >  the value is already fucked up
> 
> 
> The following is the change log.
> 
>       - When we switched the default value of `convertMetastoreParquet`.
> (at Apache Spark 1.2)
>       - When we switched the default value of `convertMetastoreOrc` (at
> Apache Spark 2.4)
>       - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
> `PARQUET` table at Apache Spark 3.0)
> 
> To sum up, this has been a well-known issue in the community and among the
> customers.
> 
> Bests,
> Dongjoon.
> 
> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au (
> scoy@infomedia.com.au ) > wrote:
> 
> 
>> Hi there,
>> 
>> 
>> I’m kind of new around here, but I have had experience with all of all the
>> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
>> Server as well as Postgresql.
>> 
>> 
>> They all support the notion of “ANSI padding” for CHAR columns - which
>> means that such columns are always space padded, and they default to
>> having this enabled (for ANSI compliance).
>> 
>> 
>> MySQL also supports it, but it defaults to leaving it disabled for
>> historical reasons not unlike what we have here.
>> 
>> 
>> In my opinion we should push toward standards compliance where possible
>> and then document where it cannot work.
>> 
>> 
>> If users don’t like the padding on CHAR columns then they should change to
>> VARCHAR - I believe that was its purpose in the first place, and it does
>> not dictate any sort of “padding".
>> 
>> 
>> I can see why you might “ban” the use of CHAR columns where they cannot be
>> consistently supported, but VARCHAR is a different animal and I would
>> expect it to work consistently everywhere.
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> Steve C
>> 
>> 
>>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.hyun@gmail.com ) > wrote:
>>> 
>>> Hi, Reynold.
>>> (And +Michael Armbrust)
>>> 
>>> 
>>> If you think so, do you think it's okay that we change the return value
>>> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>>> 
>>> 
>>> > Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
>>> com ( gourav.sengupta@gmail.com ) > wrote:
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> 
>>>> 100% agree with Reynold.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>>>> 
>>>> 
>>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>>>> rxin@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>> Are we sure "not padding" is "incorrect"?
>>>>> 
>>>>> 
>>>>> 
>>>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>>>> databases don't actually pad.
>>>>> 
>>>>> 
>>>>> 
>>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. html
>>>>> (
>>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0
>>>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>>>> strings shorter than the maximum length are not space-padded at the end."
>>>>> 
>>>>> 
>>>>> 
>>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ why-char-dont-have-padding-in-mysql
>>>>> (
>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0
>>>>> )
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>> 
>>>>>> Hi, Reynold.
>>>>>> 
>>>>>> 
>>>>>> Please see the following for the context.
>>>>>> 
>>>>>> 
>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0
>>>>>> )
>>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>> syntax"
>>>>>> 
>>>>>> 
>>>>>> I raised the above issue according to the new rubric, and the banning was
>>>>>> the proposed alternative to reduce the potential issue.
>>>>>> 
>>>>>> 
>>>>>> Please give us your opinion since it's still PR.
>>>>>> 
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>>>> rxin@databricks.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>>>> of both new and old users?
>>>>>>> 
>>>>>>> 
>>>>>>> For old users, their old code that was working for char(3) would now stop
>>>>>>> working. 
>>>>>>> 
>>>>>>> 
>>>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>>>> deal if we explain it) or not supported. 
>>>>>>> 
>>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>>>> ( dongjoon.hyun@gmail.com ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> Hi, All.
>>>>>>>> 
>>>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>>>> direction has been gradually moving forward to be consistent inside Apache
>>>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>>>> summary.
>>>>>>>> 
>>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>>>> Hive behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>>> 
>>>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>>>> behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a   3
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>>>> consistent.
>>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>>>> fallback to Hive behavior.)
>>>>>>>> 
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>>>     a 2
>>>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>>>     a 2
>>>>>>>> 
>>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>>>> following syntax to be safe.
>>>>>>>> 
>>>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>>>> 
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>> 
>>>>>>>> References:
>>>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>>>> spark. apache. org%3E (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0
>>>>>>>> )
>>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>> syntax", 2019/12/06
>>>>>>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>>>> spark. apache. org%3E (
>>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0
>>>>>>>> )
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> This email contains confidential information of and is the copyright of
>> Infomedia. It must not be forwarded, amended or disclosed without consent
>> of the sender. If you received this message by mistake, please advise the
>> sender and delete all copies. Security of transmission on the internet
>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>> should ensure you have suitable antivirus protection in place. By sending
>> us your or any third party personal details, you consent to (or confirm
>> you have obtained consent from such third parties) to Infomedia’s privacy
>> policy. http:/ / www. infomedia. com. au/ privacy-policy/ (
>> http://www.infomedia.com.au/privacy-policy/ )
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

      > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
Apache Spark 1.x without much documentation. In addition, there still
exists an effort which is trying to keep it in 3.0.0 age.

       https://issues.apache.org/jira/browse/SPARK-31088
       Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their
workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent
behavior among its data sources, Apache Hive has been providing its
documentation and many customers rely the behavior.

      -
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors,
many existing huge tables were created by Apache Hive, not Apache Spark.
And, Apache Spark is used for boosting SQL performance with its *caching*.
This was true because Apache Spark was added into the Hadoop-vendor
products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more
features to be consistent at least with Hive tables in Apache Hive and
Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its
existing behavior in this part, Apache Spark evolves inevitably by moving
away from the original Apache Spark old behaviors one-by-one.

      >  the value is already fucked up

The following is the change log.

      - When we switched the default value of `convertMetastoreParquet`.
(at Apache Spark 1.2)
      - When we switched the default value of `convertMetastoreOrc` (at
Apache Spark 2.4)
      - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
`PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the
customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au> wrote:

> Hi there,
>
> I’m kind of new around here, but I have had experience with all of all the
> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
> Server as well as Postgresql.
>
> They all support the notion of “ANSI padding” for CHAR columns - which
> means that such columns are always space padded, and they default to having
> this enabled (for ANSI compliance).
>
> MySQL also supports it, but it defaults to leaving it disabled for
> historical reasons not unlike what we have here.
>
> In my opinion we should push toward standards compliance where possible
> and then document where it cannot work.
>
> If users don’t like the padding on CHAR columns then they should change to
> VARCHAR - I believe that was its purpose in the first place, and it does
> not dictate any sort of “padding".
>
> I can see why you might “ban” the use of CHAR columns where they cannot be
> consistently supported, but VARCHAR is a different animal and I would
> expect it to work consistently everywhere.
>
>
> Cheers,
>
> Steve C
>
> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
> Hi, Reynold.
> (And +Michael Armbrust)
>
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>
> > Are we sure "not padding" is "incorrect"?
>
> Bests,
> Dongjoon.
>
>
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
>
>> Hi,
>>
>> 100% agree with Reynold.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Are we sure "not padding" is "incorrect"?
>>>
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>>
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> :
>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>> shorter than the maximum length are not space-padded at the end."
>>>
>>> MySQL:
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Reynold.
>>>>
>>>> Please see the following for the context.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax"
>>>>
>>>> I raised the above issue according to the new rubric, and the banning
>>>> was the proposed alternative to reduce the potential issue.
>>>>
>>>> Please give us your opinion since it's still PR.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell
>>>>> out of both new and old users?
>>>>>
>>>>> For old users, their old code that was working for char(3) would now
>>>>> stop working.
>>>>>
>>>>> For new users, depending on whether the underlying metastore char(3)
>>>>> is either supported but different from ansi Sql (which is not that big of a
>>>>> deal if we explain it) or not supported.
>>>>>
>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>>> evolution direction has been gradually moving forward to be consistent
>>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>>> is the summary.
>>>>>>
>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>>> result.
>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback
>>>>>> to Hive behavior.)
>>>>>>
>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>
>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>>
>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>>> Hive behavior.)
>>>>>>
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>>
>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>>> became consistent.
>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>> fallback to Hive behavior.)
>>>>>>
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>>
>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>>>> the following syntax to be safe.
>>>>>>
>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>
>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>> The recommendation is always using Apache Spark's native type
>>>>>> `String`.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> References:
>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>
>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>>> TABLE syntax", 2019/12/06
>>>>>>
>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>
>>>>>
>>>
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm you
> have obtained consent from such third parties) to Infomedia’s privacy
> policy. http://www.infomedia.com.au/privacy-policy/
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

      > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
Apache Spark 1.x without much documentation. In addition, there still
exists an effort which is trying to keep it in 3.0.0 age.

       https://issues.apache.org/jira/browse/SPARK-31088
       Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their
workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent
behavior among its data sources, Apache Hive has been providing its
documentation and many customers rely the behavior.

      -
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors,
many existing huge tables were created by Apache Hive, not Apache Spark.
And, Apache Spark is used for boosting SQL performance with its *caching*.
This was true because Apache Spark was added into the Hadoop-vendor
products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more
features to be consistent at least with Hive tables in Apache Hive and
Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its
existing behavior in this part, Apache Spark evolves inevitably by moving
away from the original Apache Spark old behaviors one-by-one.

      >  the value is already fucked up

The following is the change log.

      - When we switched the default value of `convertMetastoreParquet`.
(at Apache Spark 1.2)
      - When we switched the default value of `convertMetastoreOrc` (at
Apache Spark 2.4)
      - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
`PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the
customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy <sc...@infomedia.com.au> wrote:

> Hi there,
>
> I’m kind of new around here, but I have had experience with all of all the
> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
> Server as well as Postgresql.
>
> They all support the notion of “ANSI padding” for CHAR columns - which
> means that such columns are always space padded, and they default to having
> this enabled (for ANSI compliance).
>
> MySQL also supports it, but it defaults to leaving it disabled for
> historical reasons not unlike what we have here.
>
> In my opinion we should push toward standards compliance where possible
> and then document where it cannot work.
>
> If users don’t like the padding on CHAR columns then they should change to
> VARCHAR - I believe that was its purpose in the first place, and it does
> not dictate any sort of “padding".
>
> I can see why you might “ban” the use of CHAR columns where they cannot be
> consistently supported, but VARCHAR is a different animal and I would
> expect it to work consistently everywhere.
>
>
> Cheers,
>
> Steve C
>
> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
> Hi, Reynold.
> (And +Michael Armbrust)
>
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>
> > Are we sure "not padding" is "incorrect"?
>
> Bests,
> Dongjoon.
>
>
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
>
>> Hi,
>>
>> 100% agree with Reynold.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Are we sure "not padding" is "incorrect"?
>>>
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>>
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> :
>>> "Snowflake currently deviates from common CHAR semantics in that strings
>>> shorter than the maximum length are not space-padded at the end."
>>>
>>> MySQL:
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, Reynold.
>>>>
>>>> Please see the following for the context.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-31136
>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax"
>>>>
>>>> I raised the above issue according to the new rubric, and the banning
>>>> was the proposed alternative to reduce the potential issue.
>>>>
>>>> Please give us your opinion since it's still PR.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell
>>>>> out of both new and old users?
>>>>>
>>>>> For old users, their old code that was working for char(3) would now
>>>>> stop working.
>>>>>
>>>>> For new users, depending on whether the underlying metastore char(3)
>>>>> is either supported but different from ansi Sql (which is not that big of a
>>>>> deal if we explain it) or not supported.
>>>>>
>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>>> evolution direction has been gradually moving forward to be consistent
>>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>>> is the summary.
>>>>>>
>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>>> result.
>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback
>>>>>> to Hive behavior.)
>>>>>>
>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>>
>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>>
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>>
>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>>> Hive behavior.)
>>>>>>
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>>
>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>>> became consistent.
>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>> fallback to Hive behavior.)
>>>>>>
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>>
>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>>>> the following syntax to be safe.
>>>>>>
>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>     https://github.com/apache/spark/pull/27902
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>
>>>>>>
>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>> The recommendation is always using Apache Spark's native type
>>>>>> `String`.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> References:
>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>
>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>>> TABLE syntax", 2019/12/06
>>>>>>
>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>> <https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>
>>>>>>
>>>>>
>>>
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm you
> have obtained consent from such third parties) to Infomedia’s privacy
> policy. http://www.infomedia.com.au/privacy-policy/
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Stephen Coy <sc...@infomedia.com.au.INVALID>.
Hi there,

I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always space padded, and they default to having this enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to VARCHAR - I believe that was its purpose in the first place, and it does not dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be consistently supported, but VARCHAR is a different animal and I would expect it to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <go...@gmail.com>> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>> wrote:

Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html<https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."

MySQL: https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>








On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working.

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>


This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/

Re: FYI: The evolution on `CHAR` type behavior

Posted by Stephen Coy <sc...@infomedia.com.au.INVALID>.
Hi there,

I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always space padded, and they default to having this enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to VARCHAR - I believe that was its purpose in the first place, and it does not dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be consistently supported, but VARCHAR is a different animal and I would expect it to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun <do...@gmail.com>> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <go...@gmail.com>> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com>> wrote:

Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html<https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0> : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."

MySQL: https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0>








On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0>
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com>> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users?

For old users, their old code that was working for char(3) would now stop working.

For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi Sql (which is not that big of a deal if we explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive behavior.)

    spark-sql> CREATE TABLE t1(a CHAR(3));
    spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
    spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

    spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
    spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a   3
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a   3
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a fallback to Hive behavior.)

    spark-sql> SELECT a, length(a) FROM t1;
    a 2
    spark-sql> SELECT a, length(a) FROM t2;
    a 2
    spark-sql> SELECT a, length(a) FROM t3;
    a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the following syntax to be safe.

    CREATE TABLE t(a CHAR(3));
    https://github.com/apache/spark/pull/27902<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0>

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15
     https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0>
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06
    https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0>


This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> 100% agree with Reynold.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> Are we sure "not padding" is "incorrect"?
>>
>> I don't know whether ANSI SQL actually requires padding, but plenty of
>> databases don't actually pad.
>>
>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>> <https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.> :
>> "Snowflake currently deviates from common CHAR semantics in that strings
>> shorter than the maximum length are not space-padded at the end."
>>
>> MySQL:
>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, Reynold.
>>>
>>> Please see the following for the context.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31136
>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax"
>>>
>>> I raised the above issue according to the new rubric, and the banning
>>> was the proposed alternative to reduce the potential issue.
>>>
>>> Please give us your opinion since it's still PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell
>>>> out of both new and old users?
>>>>
>>>> For old users, their old code that was working for char(3) would now
>>>> stop working.
>>>>
>>>> For new users, depending on whether the underlying metastore char(3) is
>>>> either supported but different from ansi Sql (which is not that big of a
>>>> deal if we explain it) or not supported.
>>>>
>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>> evolution direction has been gradually moving forward to be consistent
>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>> is the summary.
>>>>>
>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>> result.
>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>
>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>> became consistent.
>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>> fallback to Hive behavior.)
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>>> the following syntax to be safe.
>>>>>
>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>     https://github.com/apache/spark/pull/27902
>>>>>
>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> References:
>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>
>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>> TABLE syntax", 2019/12/06
>>>>>
>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>
>>>>
>>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> 100% agree with Reynold.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> Are we sure "not padding" is "incorrect"?
>>
>> I don't know whether ANSI SQL actually requires padding, but plenty of
>> databases don't actually pad.
>>
>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>> <https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.> :
>> "Snowflake currently deviates from common CHAR semantics in that strings
>> shorter than the maximum length are not space-padded at the end."
>>
>> MySQL:
>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, Reynold.
>>>
>>> Please see the following for the context.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31136
>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax"
>>>
>>> I raised the above issue according to the new rubric, and the banning
>>> was the proposed alternative to reduce the potential issue.
>>>
>>> Please give us your opinion since it's still PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell
>>>> out of both new and old users?
>>>>
>>>> For old users, their old code that was working for char(3) would now
>>>> stop working.
>>>>
>>>> For new users, depending on whether the underlying metastore char(3) is
>>>> either supported but different from ansi Sql (which is not that big of a
>>>> deal if we explain it) or not supported.
>>>>
>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>> evolution direction has been gradually moving forward to be consistent
>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>> is the summary.
>>>>>
>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>> result.
>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>
>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>> became consistent.
>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>> fallback to Hive behavior.)
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>>> the following syntax to be safe.
>>>>>
>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>     https://github.com/apache/spark/pull/27902
>>>>>
>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> References:
>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>
>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>> TABLE syntax", 2019/12/06
>>>>>
>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>
>>>>
>>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <rx...@databricks.com> wrote:

> Are we sure "not padding" is "incorrect"?
>
> I don't know whether ANSI SQL actually requires padding, but plenty of
> databases don't actually pad.
>
> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
> <https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.> :
> "Snowflake currently deviates from common CHAR semantics in that strings
> shorter than the maximum length are not space-padded at the end."
>
> MySQL:
> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>
>
>
>
>
>
>
>
> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, Reynold.
>>
>> Please see the following for the context.
>>
>> https://issues.apache.org/jira/browse/SPARK-31136
>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax"
>>
>> I raised the above issue according to the new rubric, and the banning was
>> the proposed alternative to reduce the potential issue.
>>
>> Please give us your opinion since it's still PR.
>>
>> Bests,
>> Dongjoon.
>>
>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com> wrote:
>>
>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>> of both new and old users?
>>>
>>> For old users, their old code that was working for char(3) would now
>>> stop working.
>>>
>>> For new users, depending on whether the underlying metastore char(3) is
>>> either supported but different from ansi Sql (which is not that big of a
>>> deal if we explain it) or not supported.
>>>
>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>> type behavior among its usages and configurations. However, the evolution
>>>> direction has been gradually moving forward to be consistent inside Apache
>>>> Spark because we don't have `CHAR` offically. The following is the summary.
>>>>
>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>> Hive behavior.)
>>>>
>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>
>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>
>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>     a   3
>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>     a   3
>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>     a 2
>>>>
>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>> behavior.)
>>>>
>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>     a   3
>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>     a 2
>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>     a 2
>>>>
>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>> became consistent.
>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>> fallback to Hive behavior.)
>>>>
>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>     a 2
>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>     a 2
>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>     a 2
>>>>
>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>> the following syntax to be safe.
>>>>
>>>>     CREATE TABLE t(a CHAR(3));
>>>>     https://github.com/apache/spark/pull/27902
>>>>
>>>> This email is sent out to inform you based on the new policy we voted.
>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> References:
>>>> 1. "CHAR implementation?", 2017/09/15
>>>>
>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>> TABLE syntax", 2019/12/06
>>>>
>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>
>>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html ( https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end. ) : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."

MySQL: https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql

On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Hi, Reynold.
> 
> 
> Please see the following for the context.
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
> https://issues.apache.org/jira/browse/SPARK-31136 )
> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax"
> 
> 
> I raised the above issue according to the new rubric, and the banning was
> the proposed alternative to reduce the potential issue.
> 
> 
> Please give us your opinion since it's still PR.
> 
> 
> Bests,
> Dongjoon.
> 
> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>> of both new and old users?
>> 
>> 
>> For old users, their old code that was working for char(3) would now stop
>> working. 
>> 
>> 
>> For new users, depending on whether the underlying metastore char(3) is
>> either supported but different from ansi Sql (which is not that big of a
>> deal if we explain it) or not supported. 
>> 
>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.hyun@gmail.com ) > wrote:
>> 
>> 
>>> Hi, All.
>>> 
>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>> type behavior among its usages and configurations. However, the evolution
>>> direction has been gradually moving forward to be consistent inside Apache
>>> Spark because we don't have `CHAR` offically. The following is the
>>> summary.
>>> 
>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>> Hive behavior.)
>>> 
>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>> 
>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>> behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>> consistent.
>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>> fallback to Hive behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>> following syntax to be safe.
>>> 
>>>     CREATE TABLE t(a CHAR(3));
>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>> https://github.com/apache/spark/pull/27902 )
>>> 
>>> This email is sent out to inform you based on the new policy we voted.
>>> The recommendation is always using Apache Spark's native type `String`.
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> References:
>>> 1. "CHAR implementation?", 2017/09/15
>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>> )
>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html ( https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end. ) : "Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end."

MySQL: https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql

On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Hi, Reynold.
> 
> 
> Please see the following for the context.
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
> https://issues.apache.org/jira/browse/SPARK-31136 )
> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax"
> 
> 
> I raised the above issue according to the new rubric, and the banning was
> the proposed alternative to reduce the potential issue.
> 
> 
> Please give us your opinion since it's still PR.
> 
> 
> Bests,
> Dongjoon.
> 
> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
> rxin@databricks.com ) > wrote:
> 
> 
>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>> of both new and old users?
>> 
>> 
>> For old users, their old code that was working for char(3) would now stop
>> working. 
>> 
>> 
>> For new users, depending on whether the underlying metastore char(3) is
>> either supported but different from ansi Sql (which is not that big of a
>> deal if we explain it) or not supported. 
>> 
>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.hyun@gmail.com ) > wrote:
>> 
>> 
>>> Hi, All.
>>> 
>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>> type behavior among its usages and configurations. However, the evolution
>>> direction has been gradually moving forward to be consistent inside Apache
>>> Spark because we don't have `CHAR` offically. The following is the
>>> summary.
>>> 
>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>> Hive behavior.)
>>> 
>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>> 
>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>> behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>> consistent.
>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>> fallback to Hive behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>> following syntax to be safe.
>>> 
>>>     CREATE TABLE t(a CHAR(3));
>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>> https://github.com/apache/spark/pull/27902 )
>>> 
>>> This email is sent out to inform you based on the new policy we voted.
>>> The recommendation is always using Apache Spark's native type `String`.
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> References:
>>> 1. "CHAR implementation?", 2017/09/15
>>>      https:/ / lists. apache. org/ thread. html/ 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>> )
>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>> 
>> 
> 
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax"

I raised the above issue according to the new rubric, and the banning was
the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com> wrote:

> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
> of both new and old users?
>
> For old users, their old code that was working for char(3) would now stop
> working.
>
> For new users, depending on whether the underlying metastore char(3) is
> either supported but different from ansi Sql (which is not that big of a
> deal if we explain it) or not supported.
>
> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>> type behavior among its usages and configurations. However, the evolution
>> direction has been gradually moving forward to be consistent inside Apache
>> Spark because we don't have `CHAR` offically. The following is the summary.
>>
>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>> Hive behavior.)
>>
>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>
>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> Since 2.4.0, `STORED AS ORC` became consistent.
>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>> behavior.)
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>> consistent.
>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>> fallback to Hive behavior.)
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>> following syntax to be safe.
>>
>>     CREATE TABLE t(a CHAR(3));
>>     https://github.com/apache/spark/pull/27902
>>
>> This email is sent out to inform you based on the new policy we voted.
>> The recommendation is always using Apache Spark's native type `String`.
>>
>> Bests,
>> Dongjoon.
>>
>> References:
>> 1. "CHAR implementation?", 2017/09/15
>>
>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax", 2019/12/06
>>
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax"

I raised the above issue according to the new rubric, and the banning was
the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin <rx...@databricks.com> wrote:

> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
> of both new and old users?
>
> For old users, their old code that was working for char(3) would now stop
> working.
>
> For new users, depending on whether the underlying metastore char(3) is
> either supported but different from ansi Sql (which is not that big of a
> deal if we explain it) or not supported.
>
> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>> type behavior among its usages and configurations. However, the evolution
>> direction has been gradually moving forward to be consistent inside Apache
>> Spark because we don't have `CHAR` offically. The following is the summary.
>>
>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>> Hive behavior.)
>>
>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>
>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> Since 2.4.0, `STORED AS ORC` became consistent.
>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>> behavior.)
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a   3
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>> consistent.
>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>> fallback to Hive behavior.)
>>
>>     spark-sql> SELECT a, length(a) FROM t1;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t2;
>>     a 2
>>     spark-sql> SELECT a, length(a) FROM t3;
>>     a 2
>>
>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>> following syntax to be safe.
>>
>>     CREATE TABLE t(a CHAR(3));
>>     https://github.com/apache/spark/pull/27902
>>
>> This email is sent out to inform you based on the new policy we voted.
>> The recommendation is always using Apache Spark's native type `String`.
>>
>> Bests,
>> Dongjoon.
>>
>> References:
>> 1. "CHAR implementation?", 2017/09/15
>>
>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax", 2019/12/06
>>
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of
both new and old users?

For old users, their old code that was working for char(3) would now stop
working.

For new users, depending on whether the underlying metastore char(3) is
either supported but different from ansi Sql (which is not that big of a
deal if we explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> Apache Spark has been suffered from a known consistency issue on `CHAR`
> type behavior among its usages and configurations. However, the evolution
> direction has been gradually moving forward to be consistent inside Apache
> Spark because we don't have `CHAR` offically. The following is the summary.
>
> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
> Hive behavior.)
>
>     spark-sql> CREATE TABLE t1(a CHAR(3));
>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
>     spark-sql> SELECT a, length(a) FROM t1;
>     a   3
>     spark-sql> SELECT a, length(a) FROM t2;
>     a   3
>     spark-sql> SELECT a, length(a) FROM t3;
>     a 2
>
> Since 2.4.0, `STORED AS ORC` became consistent.
> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
> behavior.)
>
>     spark-sql> SELECT a, length(a) FROM t1;
>     a   3
>     spark-sql> SELECT a, length(a) FROM t2;
>     a 2
>     spark-sql> SELECT a, length(a) FROM t3;
>     a 2
>
> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
> consistent.
> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
> fallback to Hive behavior.)
>
>     spark-sql> SELECT a, length(a) FROM t1;
>     a 2
>     spark-sql> SELECT a, length(a) FROM t2;
>     a 2
>     spark-sql> SELECT a, length(a) FROM t3;
>     a 2
>
> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
> following syntax to be safe.
>
>     CREATE TABLE t(a CHAR(3));
>     https://github.com/apache/spark/pull/27902
>
> This email is sent out to inform you based on the new policy we voted.
> The recommendation is always using Apache Spark's native type `String`.
>
> Bests,
> Dongjoon.
>
> References:
> 1. "CHAR implementation?", 2017/09/15
>
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>

Re: FYI: The evolution on `CHAR` type behavior

Posted by Reynold Xin <rx...@databricks.com>.
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of
both new and old users?

For old users, their old code that was working for char(3) would now stop
working.

For new users, depending on whether the underlying metastore char(3) is
either supported but different from ansi Sql (which is not that big of a
deal if we explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> Apache Spark has been suffered from a known consistency issue on `CHAR`
> type behavior among its usages and configurations. However, the evolution
> direction has been gradually moving forward to be consistent inside Apache
> Spark because we don't have `CHAR` offically. The following is the summary.
>
> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
> Hive behavior.)
>
>     spark-sql> CREATE TABLE t1(a CHAR(3));
>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>
>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>
>     spark-sql> SELECT a, length(a) FROM t1;
>     a   3
>     spark-sql> SELECT a, length(a) FROM t2;
>     a   3
>     spark-sql> SELECT a, length(a) FROM t3;
>     a 2
>
> Since 2.4.0, `STORED AS ORC` became consistent.
> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
> behavior.)
>
>     spark-sql> SELECT a, length(a) FROM t1;
>     a   3
>     spark-sql> SELECT a, length(a) FROM t2;
>     a 2
>     spark-sql> SELECT a, length(a) FROM t3;
>     a 2
>
> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
> consistent.
> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
> fallback to Hive behavior.)
>
>     spark-sql> SELECT a, length(a) FROM t1;
>     a 2
>     spark-sql> SELECT a, length(a) FROM t2;
>     a 2
>     spark-sql> SELECT a, length(a) FROM t3;
>     a 2
>
> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
> following syntax to be safe.
>
>     CREATE TABLE t(a CHAR(3));
>     https://github.com/apache/spark/pull/27902
>
> This email is sent out to inform you based on the new policy we voted.
> The recommendation is always using Apache Spark's native type `String`.
>
> Bests,
> Dongjoon.
>
> References:
> 1. "CHAR implementation?", 2017/09/15
>
> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>