You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Venkatesan Muniappan <m....@gmail.com> on 2022/03/11 17:58:36 UTC
Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
hi Spark Team,
I have raised a question on Spark through Stackoverflow. When you get a
chance, can you please take a look and help me ?.
https://stackoverflow.com/q/71431757/5927843
Thanks,
Venkat
2016173438
Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
Posted by Venkatesan Muniappan <m....@gmail.com>.
hi,
Does anybody else have a better suggestion for my problem?.
Thanks,
Venkat
2016173438
On Fri, Mar 11, 2022 at 4:43 PM Venkatesan Muniappan <
m.venkatbeece@gmail.com> wrote:
> ok. I work for an org where such upgrades take a few months. Not an
> immediate task.
>
> Thanks,
> Venkat
> 2016173438
>
>
> On Fri, Mar 11, 2022 at 4:38 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> yes in spark 3.1.1. Best to upgrade it to spark 3+.
>>
>>
>>
>> view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <
>> m.venkatbeece@gmail.com> wrote:
>>
>>> Thank you. I am trying to get the table definition for the existing
>>> tables. BTW, the create and show command that you executed, was it on Spark
>>> 3.x ? .
>>>
>>> Thanks,
>>> Venkat
>>> 2016173438
>>>
>>>
>>> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Well I do not know what has changed. However, this should not affect
>>>> your work.
>>>>
>>>>
>>>> Try to create table in Spark
>>>>
>>>>
>>>> sqltext: String =
>>>>
>>>> CREATE TABLE if not exists test.etcs(
>>>>
>>>> ID INT
>>>>
>>>> , CLUSTERED INT
>>>>
>>>> , SCATTERED INT
>>>>
>>>> , RANDOMISED INT
>>>>
>>>> , RANDOM_STRING VARCHAR(50)
>>>>
>>>> , SMALL_VC VARCHAR(10)
>>>>
>>>> , PADDING VARCHAR(4000)
>>>>
>>>> , PADDING2 STRING
>>>>
>>>> )
>>>>
>>>> CLUSTERED BY (ID) INTO 256 BUCKETS
>>>>
>>>> STORED AS PARQUET
>>>>
>>>> TBLPROPERTIES (
>>>>
>>>> "parquet.compress"="SNAPPY"
>>>>
>>>> )
>>>>
>>>>
>>>> scala> spark.sql (sqltext)
>>>>
>>>> scala> spark.sql("show create table test.etcs").show(false)
>>>>
>>>>
>>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |createtab_stmt
>>>>
>>>>
>>>>
>>>> |
>>>>
>>>>
>>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |CREATE TABLE `test`.`etcs` (
>>>>
>>>> `ID` INT,
>>>>
>>>> `CLUSTERED` INT,
>>>>
>>>> `SCATTERED` INT,
>>>>
>>>> `RANDOMISED` INT,
>>>>
>>>> `RANDOM_STRING` VARCHAR(50),
>>>>
>>>> `SMALL_VC` VARCHAR(10),
>>>>
>>>> `PADDING` VARCHAR(4000),
>>>>
>>>> `PADDING2` STRING)
>>>>
>>>> USING parquet
>>>>
>>>> CLUSTERED BY (ID)
>>>>
>>>> INTO 256 BUCKETS
>>>>
>>>> TBLPROPERTIES (
>>>>
>>>> 'transient_lastDdlTime' = '1647033659',
>>>>
>>>> 'parquet.compress' = 'SNAPPY')
>>>>
>>>> |
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> Note that columns are OK.
>>>>
>>>>
>>>> Also check this link for the differences between CHAR, VARCHAR and
>>>> STRING types in Hive
>>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>>>> m.venkatbeece@gmail.com> wrote:
>>>>
>>>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>>>
>>>>> Thanks,
>>>>> Venkat
>>>>> 2016173438
>>>>>
>>>>>
>>>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Hive 3.1.1
>>>>>> Spark 3.1.1
>>>>>>
>>>>>> Your stack overflow issue raised and I quote:
>>>>>>
>>>>>> "I have a need to generate DDL statements for Hive tables & views
>>>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>>>>> since it is faster compared to beeline. One downside of using spark for
>>>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>>>>> as String and it doesn't preserve the length information that goes with
>>>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>>>
>>>>>> Given below the sample create table command and its show create table
>>>>>> output."
>>>>>>
>>>>>> Create a simple table in *Hive* in test database
>>>>>>
>>>>>> hive> *use test;*
>>>>>> OK
>>>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>>>> OK
>>>>>> hive> *desc formatted etc;*
>>>>>> # col_name data_type comment
>>>>>> *id bigint*
>>>>>> *col1 varchar(30)*
>>>>>> *col2 string*
>>>>>>
>>>>>> # Detailed Table Information
>>>>>> Database: test
>>>>>> OwnerType: USER
>>>>>> Owner: hduser
>>>>>> CreateTime: Fri Mar 11 18:29:34 GMT 2022
>>>>>> LastAccessTime: UNKNOWN
>>>>>> Retention: 0
>>>>>> Location:
>>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>>>> Table Type: MANAGED_TABLE
>>>>>> Table Parameters:
>>>>>> COLUMN_STATS_ACCURATE
>>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>> bucketing_version 2
>>>>>> numFiles 0
>>>>>> numRows 0
>>>>>> rawDataSize 0
>>>>>> totalSize 0
>>>>>> transient_lastDdlTime 1647023374
>>>>>>
>>>>>> # Storage Information
>>>>>> SerDe Library:
>>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>>> InputFormat: org.apache.hadoop.mapred.TextInputFormat
>>>>>> OutputFormat:
>>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>>> Compressed: No
>>>>>> Num Buckets: -1
>>>>>> Bucket Columns: []
>>>>>> Sort Columns: []
>>>>>> Storage Desc Params:
>>>>>> serialization.format 1
>>>>>>
>>>>>> Now let's go to spark-shell
>>>>>> ^
>>>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>> |createtab_stmt
>>>>>>
>>>>>> |
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>> |CREATE TABLE `test`.`etc` (
>>>>>> * `id` BIGINT,*
>>>>>> * `col1` VARCHAR(30),*
>>>>>> * `col2` STRING)*
>>>>>> USING text
>>>>>> TBLPROPERTIES (
>>>>>> 'bucketing_version' = '2',
>>>>>> 'transient_lastDdlTime' = '1647023374')
>>>>>> |
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> You can see Spark shows columns correctly
>>>>>>
>>>>>> Now let us go and create the same table in hive through beeline
>>>>>>
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>>>
>>>>>> No rows affected (0.019 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>>>> col1 VARCHAR(30), col2 STRING)*
>>>>>>
>>>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>>>
>>>>>> . . . . . . . . . . . . . . . . . . >
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> | col_name | data_type
>>>>>> | comment |
>>>>>>
>>>>>>
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> | # col_name | data_type
>>>>>> | comment |
>>>>>>
>>>>>> *| id | bigint
>>>>>> | |*
>>>>>>
>>>>>> *| col1 | varchar(30)
>>>>>> | |*
>>>>>>
>>>>>> *| col2 | string
>>>>>> | |*
>>>>>>
>>>>>> | | NULL
>>>>>> | NULL |
>>>>>>
>>>>>> | # Detailed Table Information | NULL
>>>>>> | NULL |
>>>>>>
>>>>>> | Database: | test
>>>>>> | NULL |
>>>>>>
>>>>>> | OwnerType: | USER
>>>>>> | NULL |
>>>>>>
>>>>>> | Owner: | hduser
>>>>>> | NULL |
>>>>>>
>>>>>> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022
>>>>>> | NULL |
>>>>>>
>>>>>> | LastAccessTime: | UNKNOWN
>>>>>> | NULL |
>>>>>>
>>>>>> | Retention: | 0
>>>>>> | NULL |
>>>>>>
>>>>>> | Location: |
>>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>>>> |
>>>>>>
>>>>>> | Table Type: | MANAGED_TABLE
>>>>>> | NULL |
>>>>>>
>>>>>> | Table Parameters: | NULL
>>>>>> | NULL |
>>>>>>
>>>>>> | | COLUMN_STATS_ACCURATE
>>>>>> |
>>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>> |
>>>>>>
>>>>>> | | bucketing_version
>>>>>> | 2 |
>>>>>>
>>>>>> | | numFiles
>>>>>> | 0 |
>>>>>>
>>>>>> | | numRows
>>>>>> | 0
>>>>>> |
>>>>>>
>>>>>> | | rawDataSize
>>>>>> | 0 |
>>>>>>
>>>>>> | | totalSize
>>>>>> | 0 |
>>>>>>
>>>>>> | | transient_lastDdlTime
>>>>>> | 1647024660 |
>>>>>>
>>>>>> | | NULL
>>>>>> | NULL |
>>>>>>
>>>>>> | # Storage Information | NULL
>>>>>> | NULL |
>>>>>>
>>>>>> | SerDe Library: |
>>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>>>> |
>>>>>>
>>>>>> | InputFormat: |
>>>>>> org.apache.hadoop.mapred.TextInputFormat | NULL
>>>>>> |
>>>>>>
>>>>>> | OutputFormat: |
>>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>>>> |
>>>>>>
>>>>>> | Compressed: | No
>>>>>> | NULL |
>>>>>>
>>>>>> | Num Buckets: | -1
>>>>>> | NULL |
>>>>>>
>>>>>> | Bucket Columns: | []
>>>>>> | NULL |
>>>>>>
>>>>>> | Sort Columns: | []
>>>>>> | NULL |
>>>>>>
>>>>>> | Storage Desc Params: | NULL
>>>>>> | NULL |
>>>>>>
>>>>>> | | serialization.format
>>>>>> | 1 |
>>>>>>
>>>>>>
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> 33 rows selected (0.159 seconds)
>>>>>>
>>>>>> Now check that in spark-shell again
>>>>>>
>>>>>>
>>>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> |createtab_stmt
>>>>>>
>>>>>> |
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> |CREATE TABLE `test`.`etc` (
>>>>>>
>>>>>> * `id` BIGINT,*
>>>>>>
>>>>>> * `col1` VARCHAR(30),*
>>>>>>
>>>>>> * `col2` STRING)*
>>>>>>
>>>>>> USING text
>>>>>>
>>>>>> TBLPROPERTIES (
>>>>>>
>>>>>> 'bucketing_version' = '2',
>>>>>>
>>>>>> 'transient_lastDdlTime' = '1647024660')
>>>>>>
>>>>>> |
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>>
>>>>>> It shows OK. Soo in summary you get column definitions in Spark as
>>>>>> you have defined them in Hive
>>>>>>
>>>>>>
>>>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>>>> have had such issues.
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>> view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>>>> m.venkatbeece@gmail.com> wrote:
>>>>>>
>>>>>>> hi Spark Team,
>>>>>>>
>>>>>>> I have raised a question on Spark through Stackoverflow. When you
>>>>>>> get a chance, can you please take a look and help me ?.
>>>>>>>
>>>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Venkat
>>>>>>> 2016173438
>>>>>>>
>>>>>>
Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
Posted by Venkatesan Muniappan <m....@gmail.com>.
ok. I work for an org where such upgrades take a few months. Not an
immediate task.
Thanks,
Venkat
2016173438
On Fri, Mar 11, 2022 at 4:38 PM Mich Talebzadeh <mi...@gmail.com>
wrote:
> yes in spark 3.1.1. Best to upgrade it to spark 3+.
>
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <
> m.venkatbeece@gmail.com> wrote:
>
>> Thank you. I am trying to get the table definition for the existing
>> tables. BTW, the create and show command that you executed, was it on Spark
>> 3.x ? .
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>>
>> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Well I do not know what has changed. However, this should not affect
>>> your work.
>>>
>>>
>>> Try to create table in Spark
>>>
>>>
>>> sqltext: String =
>>>
>>> CREATE TABLE if not exists test.etcs(
>>>
>>> ID INT
>>>
>>> , CLUSTERED INT
>>>
>>> , SCATTERED INT
>>>
>>> , RANDOMISED INT
>>>
>>> , RANDOM_STRING VARCHAR(50)
>>>
>>> , SMALL_VC VARCHAR(10)
>>>
>>> , PADDING VARCHAR(4000)
>>>
>>> , PADDING2 STRING
>>>
>>> )
>>>
>>> CLUSTERED BY (ID) INTO 256 BUCKETS
>>>
>>> STORED AS PARQUET
>>>
>>> TBLPROPERTIES (
>>>
>>> "parquet.compress"="SNAPPY"
>>>
>>> )
>>>
>>>
>>> scala> spark.sql (sqltext)
>>>
>>> scala> spark.sql("show create table test.etcs").show(false)
>>>
>>>
>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |createtab_stmt
>>>
>>>
>>>
>>> |
>>>
>>>
>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |CREATE TABLE `test`.`etcs` (
>>>
>>> `ID` INT,
>>>
>>> `CLUSTERED` INT,
>>>
>>> `SCATTERED` INT,
>>>
>>> `RANDOMISED` INT,
>>>
>>> `RANDOM_STRING` VARCHAR(50),
>>>
>>> `SMALL_VC` VARCHAR(10),
>>>
>>> `PADDING` VARCHAR(4000),
>>>
>>> `PADDING2` STRING)
>>>
>>> USING parquet
>>>
>>> CLUSTERED BY (ID)
>>>
>>> INTO 256 BUCKETS
>>>
>>> TBLPROPERTIES (
>>>
>>> 'transient_lastDdlTime' = '1647033659',
>>>
>>> 'parquet.compress' = 'SNAPPY')
>>>
>>> |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> Note that columns are OK.
>>>
>>>
>>> Also check this link for the differences between CHAR, VARCHAR and
>>> STRING types in Hive
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>> view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>>> m.venkatbeece@gmail.com> wrote:
>>>
>>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>>
>>>> Thanks,
>>>> Venkat
>>>> 2016173438
>>>>
>>>>
>>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Hive 3.1.1
>>>>> Spark 3.1.1
>>>>>
>>>>> Your stack overflow issue raised and I quote:
>>>>>
>>>>> "I have a need to generate DDL statements for Hive tables & views
>>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>>>> since it is faster compared to beeline. One downside of using spark for
>>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>>>> as String and it doesn't preserve the length information that goes with
>>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>>
>>>>> Given below the sample create table command and its show create table
>>>>> output."
>>>>>
>>>>> Create a simple table in *Hive* in test database
>>>>>
>>>>> hive> *use test;*
>>>>> OK
>>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>>> OK
>>>>> hive> *desc formatted etc;*
>>>>> # col_name data_type comment
>>>>> *id bigint*
>>>>> *col1 varchar(30)*
>>>>> *col2 string*
>>>>>
>>>>> # Detailed Table Information
>>>>> Database: test
>>>>> OwnerType: USER
>>>>> Owner: hduser
>>>>> CreateTime: Fri Mar 11 18:29:34 GMT 2022
>>>>> LastAccessTime: UNKNOWN
>>>>> Retention: 0
>>>>> Location:
>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>>> Table Type: MANAGED_TABLE
>>>>> Table Parameters:
>>>>> COLUMN_STATS_ACCURATE
>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>> bucketing_version 2
>>>>> numFiles 0
>>>>> numRows 0
>>>>> rawDataSize 0
>>>>> totalSize 0
>>>>> transient_lastDdlTime 1647023374
>>>>>
>>>>> # Storage Information
>>>>> SerDe Library:
>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>> InputFormat: org.apache.hadoop.mapred.TextInputFormat
>>>>> OutputFormat:
>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>> Compressed: No
>>>>> Num Buckets: -1
>>>>> Bucket Columns: []
>>>>> Sort Columns: []
>>>>> Storage Desc Params:
>>>>> serialization.format 1
>>>>>
>>>>> Now let's go to spark-shell
>>>>> ^
>>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>> |createtab_stmt
>>>>>
>>>>> |
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>> |CREATE TABLE `test`.`etc` (
>>>>> * `id` BIGINT,*
>>>>> * `col1` VARCHAR(30),*
>>>>> * `col2` STRING)*
>>>>> USING text
>>>>> TBLPROPERTIES (
>>>>> 'bucketing_version' = '2',
>>>>> 'transient_lastDdlTime' = '1647023374')
>>>>> |
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> You can see Spark shows columns correctly
>>>>>
>>>>> Now let us go and create the same table in hive through beeline
>>>>>
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>>
>>>>> No rows affected (0.019 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>>> col1 VARCHAR(30), col2 STRING)*
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . >
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> | col_name | data_type
>>>>> | comment |
>>>>>
>>>>>
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> | # col_name | data_type
>>>>> | comment |
>>>>>
>>>>> *| id | bigint
>>>>> | |*
>>>>>
>>>>> *| col1 | varchar(30)
>>>>> | |*
>>>>>
>>>>> *| col2 | string
>>>>> | |*
>>>>>
>>>>> | | NULL
>>>>> | NULL |
>>>>>
>>>>> | # Detailed Table Information | NULL
>>>>> | NULL |
>>>>>
>>>>> | Database: | test
>>>>> | NULL |
>>>>>
>>>>> | OwnerType: | USER
>>>>> | NULL |
>>>>>
>>>>> | Owner: | hduser
>>>>> | NULL |
>>>>>
>>>>> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022
>>>>> | NULL |
>>>>>
>>>>> | LastAccessTime: | UNKNOWN
>>>>> | NULL |
>>>>>
>>>>> | Retention: | 0
>>>>> | NULL |
>>>>>
>>>>> | Location: |
>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>>> |
>>>>>
>>>>> | Table Type: | MANAGED_TABLE
>>>>> | NULL |
>>>>>
>>>>> | Table Parameters: | NULL
>>>>> | NULL |
>>>>>
>>>>> | | COLUMN_STATS_ACCURATE
>>>>> |
>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>> |
>>>>>
>>>>> | | bucketing_version
>>>>> | 2 |
>>>>>
>>>>> | | numFiles
>>>>> | 0 |
>>>>>
>>>>> | | numRows
>>>>> | 0 |
>>>>>
>>>>> | | rawDataSize
>>>>> | 0 |
>>>>>
>>>>> | | totalSize
>>>>> | 0 |
>>>>>
>>>>> | | transient_lastDdlTime
>>>>> | 1647024660 |
>>>>>
>>>>> | | NULL
>>>>> | NULL |
>>>>>
>>>>> | # Storage Information | NULL
>>>>> | NULL |
>>>>>
>>>>> | SerDe Library: |
>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>>> |
>>>>>
>>>>> | InputFormat: |
>>>>> org.apache.hadoop.mapred.TextInputFormat | NULL
>>>>> |
>>>>>
>>>>> | OutputFormat: |
>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>>> |
>>>>>
>>>>> | Compressed: | No
>>>>> | NULL |
>>>>>
>>>>> | Num Buckets: | -1
>>>>> | NULL |
>>>>>
>>>>> | Bucket Columns: | []
>>>>> | NULL |
>>>>>
>>>>> | Sort Columns: | []
>>>>> | NULL |
>>>>>
>>>>> | Storage Desc Params: | NULL
>>>>> | NULL |
>>>>>
>>>>> | | serialization.format
>>>>> | 1 |
>>>>>
>>>>>
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> 33 rows selected (0.159 seconds)
>>>>>
>>>>> Now check that in spark-shell again
>>>>>
>>>>>
>>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> |createtab_stmt
>>>>>
>>>>> |
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> |CREATE TABLE `test`.`etc` (
>>>>>
>>>>> * `id` BIGINT,*
>>>>>
>>>>> * `col1` VARCHAR(30),*
>>>>>
>>>>> * `col2` STRING)*
>>>>>
>>>>> USING text
>>>>>
>>>>> TBLPROPERTIES (
>>>>>
>>>>> 'bucketing_version' = '2',
>>>>>
>>>>> 'transient_lastDdlTime' = '1647024660')
>>>>>
>>>>> |
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> It shows OK. Soo in summary you get column definitions in Spark as
>>>>> you have defined them in Hive
>>>>>
>>>>>
>>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>>> have had such issues.
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>> view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>>> m.venkatbeece@gmail.com> wrote:
>>>>>
>>>>>> hi Spark Team,
>>>>>>
>>>>>> I have raised a question on Spark through Stackoverflow. When you get
>>>>>> a chance, can you please take a look and help me ?.
>>>>>>
>>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>>
>>>>>> Thanks,
>>>>>> Venkat
>>>>>> 2016173438
>>>>>>
>>>>>
Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
Posted by Mich Talebzadeh <mi...@gmail.com>.
yes in spark 3.1.1. Best to upgrade it to spark 3+.
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <m....@gmail.com>
wrote:
> Thank you. I am trying to get the table definition for the existing
> tables. BTW, the create and show command that you executed, was it on Spark
> 3.x ? .
>
> Thanks,
> Venkat
> 2016173438
>
>
> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Well I do not know what has changed. However, this should not affect your
>> work.
>>
>>
>> Try to create table in Spark
>>
>>
>> sqltext: String =
>>
>> CREATE TABLE if not exists test.etcs(
>>
>> ID INT
>>
>> , CLUSTERED INT
>>
>> , SCATTERED INT
>>
>> , RANDOMISED INT
>>
>> , RANDOM_STRING VARCHAR(50)
>>
>> , SMALL_VC VARCHAR(10)
>>
>> , PADDING VARCHAR(4000)
>>
>> , PADDING2 STRING
>>
>> )
>>
>> CLUSTERED BY (ID) INTO 256 BUCKETS
>>
>> STORED AS PARQUET
>>
>> TBLPROPERTIES (
>>
>> "parquet.compress"="SNAPPY"
>>
>> )
>>
>>
>> scala> spark.sql (sqltext)
>>
>> scala> spark.sql("show create table test.etcs").show(false)
>>
>>
>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |createtab_stmt
>>
>>
>>
>> |
>>
>>
>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |CREATE TABLE `test`.`etcs` (
>>
>> `ID` INT,
>>
>> `CLUSTERED` INT,
>>
>> `SCATTERED` INT,
>>
>> `RANDOMISED` INT,
>>
>> `RANDOM_STRING` VARCHAR(50),
>>
>> `SMALL_VC` VARCHAR(10),
>>
>> `PADDING` VARCHAR(4000),
>>
>> `PADDING2` STRING)
>>
>> USING parquet
>>
>> CLUSTERED BY (ID)
>>
>> INTO 256 BUCKETS
>>
>> TBLPROPERTIES (
>>
>> 'transient_lastDdlTime' = '1647033659',
>>
>> 'parquet.compress' = 'SNAPPY')
>>
>> |
>>
>>
>> +------------------------------------------------------------------------------------------------------------------
>>
>>
>> Note that columns are OK.
>>
>>
>> Also check this link for the differences between CHAR, VARCHAR and STRING
>> types in Hive
>>
>>
>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>
>>
>> HTH
>>
>>
>>
>> view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>> m.venkatbeece@gmail.com> wrote:
>>
>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>
>>> Thanks,
>>> Venkat
>>> 2016173438
>>>
>>>
>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hive 3.1.1
>>>> Spark 3.1.1
>>>>
>>>> Your stack overflow issue raised and I quote:
>>>>
>>>> "I have a need to generate DDL statements for Hive tables & views
>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>>> since it is faster compared to beeline. One downside of using spark for
>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>>> as String and it doesn't preserve the length information that goes with
>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>
>>>> Given below the sample create table command and its show create table
>>>> output."
>>>>
>>>> Create a simple table in *Hive* in test database
>>>>
>>>> hive> *use test;*
>>>> OK
>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>> OK
>>>> hive> *desc formatted etc;*
>>>> # col_name data_type comment
>>>> *id bigint*
>>>> *col1 varchar(30)*
>>>> *col2 string*
>>>>
>>>> # Detailed Table Information
>>>> Database: test
>>>> OwnerType: USER
>>>> Owner: hduser
>>>> CreateTime: Fri Mar 11 18:29:34 GMT 2022
>>>> LastAccessTime: UNKNOWN
>>>> Retention: 0
>>>> Location:
>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>> Table Type: MANAGED_TABLE
>>>> Table Parameters:
>>>> COLUMN_STATS_ACCURATE
>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>> bucketing_version 2
>>>> numFiles 0
>>>> numRows 0
>>>> rawDataSize 0
>>>> totalSize 0
>>>> transient_lastDdlTime 1647023374
>>>>
>>>> # Storage Information
>>>> SerDe Library:
>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>> InputFormat: org.apache.hadoop.mapred.TextInputFormat
>>>> OutputFormat:
>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>> Compressed: No
>>>> Num Buckets: -1
>>>> Bucket Columns: []
>>>> Sort Columns: []
>>>> Storage Desc Params:
>>>> serialization.format 1
>>>>
>>>> Now let's go to spark-shell
>>>> ^
>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>> |createtab_stmt
>>>>
>>>> |
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>> |CREATE TABLE `test`.`etc` (
>>>> * `id` BIGINT,*
>>>> * `col1` VARCHAR(30),*
>>>> * `col2` STRING)*
>>>> USING text
>>>> TBLPROPERTIES (
>>>> 'bucketing_version' = '2',
>>>> 'transient_lastDdlTime' = '1647023374')
>>>> |
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> You can see Spark shows columns correctly
>>>>
>>>> Now let us go and create the same table in hive through beeline
>>>>
>>>>
>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>
>>>> No rows affected (0.019 seconds)
>>>>
>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>> col1 VARCHAR(30), col2 STRING)*
>>>>
>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>
>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>
>>>> . . . . . . . . . . . . . . . . . . >
>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>
>>>> | col_name | data_type
>>>> | comment |
>>>>
>>>>
>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>
>>>> | # col_name | data_type
>>>> | comment |
>>>>
>>>> *| id | bigint
>>>> | |*
>>>>
>>>> *| col1 | varchar(30)
>>>> | |*
>>>>
>>>> *| col2 | string
>>>> | |*
>>>>
>>>> | | NULL
>>>> | NULL |
>>>>
>>>> | # Detailed Table Information | NULL
>>>> | NULL |
>>>>
>>>> | Database: | test
>>>> | NULL |
>>>>
>>>> | OwnerType: | USER
>>>> | NULL |
>>>>
>>>> | Owner: | hduser
>>>> | NULL |
>>>>
>>>> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022
>>>> | NULL |
>>>>
>>>> | LastAccessTime: | UNKNOWN
>>>> | NULL |
>>>>
>>>> | Retention: | 0
>>>> | NULL |
>>>>
>>>> | Location: |
>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>> |
>>>>
>>>> | Table Type: | MANAGED_TABLE
>>>> | NULL |
>>>>
>>>> | Table Parameters: | NULL
>>>> | NULL |
>>>>
>>>> | | COLUMN_STATS_ACCURATE
>>>> |
>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>> |
>>>>
>>>> | | bucketing_version
>>>> | 2 |
>>>>
>>>> | | numFiles
>>>> | 0 |
>>>>
>>>> | | numRows
>>>> | 0 |
>>>>
>>>> | | rawDataSize
>>>> | 0 |
>>>>
>>>> | | totalSize
>>>> | 0 |
>>>>
>>>> | | transient_lastDdlTime
>>>> | 1647024660 |
>>>>
>>>> | | NULL
>>>> | NULL |
>>>>
>>>> | # Storage Information | NULL
>>>> | NULL |
>>>>
>>>> | SerDe Library: |
>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>> |
>>>>
>>>> | InputFormat: |
>>>> org.apache.hadoop.mapred.TextInputFormat | NULL
>>>> |
>>>>
>>>> | OutputFormat: |
>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>> |
>>>>
>>>> | Compressed: | No
>>>> | NULL |
>>>>
>>>> | Num Buckets: | -1
>>>> | NULL |
>>>>
>>>> | Bucket Columns: | []
>>>> | NULL |
>>>>
>>>> | Sort Columns: | []
>>>> | NULL |
>>>>
>>>> | Storage Desc Params: | NULL
>>>> | NULL |
>>>>
>>>> | | serialization.format
>>>> | 1 |
>>>>
>>>>
>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>
>>>> 33 rows selected (0.159 seconds)
>>>>
>>>> Now check that in spark-shell again
>>>>
>>>>
>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |createtab_stmt
>>>>
>>>> |
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |CREATE TABLE `test`.`etc` (
>>>>
>>>> * `id` BIGINT,*
>>>>
>>>> * `col1` VARCHAR(30),*
>>>>
>>>> * `col2` STRING)*
>>>>
>>>> USING text
>>>>
>>>> TBLPROPERTIES (
>>>>
>>>> 'bucketing_version' = '2',
>>>>
>>>> 'transient_lastDdlTime' = '1647024660')
>>>>
>>>> |
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>>
>>>> It shows OK. Soo in summary you get column definitions in Spark as you
>>>> have defined them in Hive
>>>>
>>>>
>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>> have had such issues.
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>> m.venkatbeece@gmail.com> wrote:
>>>>
>>>>> hi Spark Team,
>>>>>
>>>>> I have raised a question on Spark through Stackoverflow. When you get
>>>>> a chance, can you please take a look and help me ?.
>>>>>
>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>
>>>>> Thanks,
>>>>> Venkat
>>>>> 2016173438
>>>>>
>>>>
Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
Posted by Venkatesan Muniappan <m....@gmail.com>.
Thank you. I am trying to get the table definition for the existing tables.
BTW, the create and show command that you executed, was it on Spark 3.x ? .
Thanks,
Venkat
2016173438
On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <mi...@gmail.com>
wrote:
> Well I do not know what has changed. However, this should not affect your
> work.
>
>
> Try to create table in Spark
>
>
> sqltext: String =
>
> CREATE TABLE if not exists test.etcs(
>
> ID INT
>
> , CLUSTERED INT
>
> , SCATTERED INT
>
> , RANDOMISED INT
>
> , RANDOM_STRING VARCHAR(50)
>
> , SMALL_VC VARCHAR(10)
>
> , PADDING VARCHAR(4000)
>
> , PADDING2 STRING
>
> )
>
> CLUSTERED BY (ID) INTO 256 BUCKETS
>
> STORED AS PARQUET
>
> TBLPROPERTIES (
>
> "parquet.compress"="SNAPPY"
>
> )
>
>
> scala> spark.sql (sqltext)
>
> scala> spark.sql("show create table test.etcs").show(false)
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |createtab_stmt
>
>
>
> |
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |CREATE TABLE `test`.`etcs` (
>
> `ID` INT,
>
> `CLUSTERED` INT,
>
> `SCATTERED` INT,
>
> `RANDOMISED` INT,
>
> `RANDOM_STRING` VARCHAR(50),
>
> `SMALL_VC` VARCHAR(10),
>
> `PADDING` VARCHAR(4000),
>
> `PADDING2` STRING)
>
> USING parquet
>
> CLUSTERED BY (ID)
>
> INTO 256 BUCKETS
>
> TBLPROPERTIES (
>
> 'transient_lastDdlTime' = '1647033659',
>
> 'parquet.compress' = 'SNAPPY')
>
> |
>
>
> +------------------------------------------------------------------------------------------------------------------
>
>
> Note that columns are OK.
>
>
> Also check this link for the differences between CHAR, VARCHAR and STRING
> types in Hive
>
>
> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>
>
> HTH
>
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
> m.venkatbeece@gmail.com> wrote:
>
>> Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR
>> and CHAR are properly showing in Spark 3. Do you know what changed in Spark
>> 3 that made this possible?. Or how can I achieve the same output in Spark
>> 2.4.1? If there are some conf options, that would be helpful.
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>>
>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hive 3.1.1
>>> Spark 3.1.1
>>>
>>> Your stack overflow issue raised and I quote:
>>>
>>> "I have a need to generate DDL statements for Hive tables & views
>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>> takes around 5-10 seconds for each of the statements whereas Spark
>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>> since it is faster compared to beeline. One downside of using spark for
>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>> as String and it doesn't preserve the length information that goes with
>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>> and the length information for CHAR,VARCHAR data types. *I am using
>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>
>>> Given below the sample create table command and its show create table
>>> output."
>>>
>>> Create a simple table in *Hive* in test database
>>>
>>> hive> *use test;*
>>> OK
>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>> OK
>>> hive> *desc formatted etc;*
>>> # col_name data_type comment
>>> *id bigint*
>>> *col1 varchar(30)*
>>> *col2 string*
>>>
>>> # Detailed Table Information
>>> Database: test
>>> OwnerType: USER
>>> Owner: hduser
>>> CreateTime: Fri Mar 11 18:29:34 GMT 2022
>>> LastAccessTime: UNKNOWN
>>> Retention: 0
>>> Location:
>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>> Table Type: MANAGED_TABLE
>>> Table Parameters:
>>> COLUMN_STATS_ACCURATE
>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>> bucketing_version 2
>>> numFiles 0
>>> numRows 0
>>> rawDataSize 0
>>> totalSize 0
>>> transient_lastDdlTime 1647023374
>>>
>>> # Storage Information
>>> SerDe Library:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>> InputFormat: org.apache.hadoop.mapred.TextInputFormat
>>> OutputFormat:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>> Compressed: No
>>> Num Buckets: -1
>>> Bucket Columns: []
>>> Sort Columns: []
>>> Storage Desc Params:
>>> serialization.format 1
>>>
>>> Now let's go to spark-shell
>>> ^
>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> |createtab_stmt
>>>
>>> |
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> |CREATE TABLE `test`.`etc` (
>>> * `id` BIGINT,*
>>> * `col1` VARCHAR(30),*
>>> * `col2` STRING)*
>>> USING text
>>> TBLPROPERTIES (
>>> 'bucketing_version' = '2',
>>> 'transient_lastDdlTime' = '1647023374')
>>> |
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> You can see Spark shows columns correctly
>>>
>>> Now let us go and create the same table in hive through beeline
>>>
>>>
>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>
>>> No rows affected (0.019 seconds)
>>>
>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
>>> VARCHAR(30), col2 STRING)*
>>>
>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>
>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>
>>> . . . . . . . . . . . . . . . . . . >
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> | col_name | data_type
>>> | comment |
>>>
>>>
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> | # col_name | data_type
>>> | comment |
>>>
>>> *| id | bigint
>>> | |*
>>>
>>> *| col1 | varchar(30)
>>> | |*
>>>
>>> *| col2 | string
>>> | |*
>>>
>>> | | NULL
>>> | NULL |
>>>
>>> | # Detailed Table Information | NULL
>>> | NULL |
>>>
>>> | Database: | test
>>> | NULL |
>>>
>>> | OwnerType: | USER
>>> | NULL |
>>>
>>> | Owner: | hduser
>>> | NULL |
>>>
>>> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022
>>> | NULL |
>>>
>>> | LastAccessTime: | UNKNOWN
>>> | NULL |
>>>
>>> | Retention: | 0
>>> | NULL |
>>>
>>> | Location: |
>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>> |
>>>
>>> | Table Type: | MANAGED_TABLE
>>> | NULL |
>>>
>>> | Table Parameters: | NULL
>>> | NULL |
>>>
>>> | | COLUMN_STATS_ACCURATE
>>> |
>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>> |
>>>
>>> | | bucketing_version
>>> | 2 |
>>>
>>> | | numFiles
>>> | 0 |
>>>
>>> | | numRows
>>> | 0 |
>>>
>>> | | rawDataSize
>>> | 0 |
>>>
>>> | | totalSize
>>> | 0 |
>>>
>>> | | transient_lastDdlTime
>>> | 1647024660 |
>>>
>>> | | NULL
>>> | NULL |
>>>
>>> | # Storage Information | NULL
>>> | NULL |
>>>
>>> | SerDe Library: |
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>> |
>>>
>>> | InputFormat: |
>>> org.apache.hadoop.mapred.TextInputFormat | NULL
>>> |
>>>
>>> | OutputFormat: |
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>> |
>>>
>>> | Compressed: | No
>>> | NULL |
>>>
>>> | Num Buckets: | -1
>>> | NULL |
>>>
>>> | Bucket Columns: | []
>>> | NULL |
>>>
>>> | Sort Columns: | []
>>> | NULL |
>>>
>>> | Storage Desc Params: | NULL
>>> | NULL |
>>>
>>> | | serialization.format
>>> | 1 |
>>>
>>>
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> 33 rows selected (0.159 seconds)
>>>
>>> Now check that in spark-shell again
>>>
>>>
>>> scala> spark.sql("show create table test.etc").show(false)
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |createtab_stmt
>>>
>>> |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |CREATE TABLE `test`.`etc` (
>>>
>>> * `id` BIGINT,*
>>>
>>> * `col1` VARCHAR(30),*
>>>
>>> * `col2` STRING)*
>>>
>>> USING text
>>>
>>> TBLPROPERTIES (
>>>
>>> 'bucketing_version' = '2',
>>>
>>> 'transient_lastDdlTime' = '1647024660')
>>>
>>> |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>>
>>> It shows OK. Soo in summary you get column definitions in Spark as you
>>> have defined them in Hive
>>>
>>>
>>> In your statement above and I quote "I am using Spark 2.4.1 and Beeline
>>> 2.1.1", refers to older versions of Spark and hive which may have had
>>> such issues.
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>> view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>> m.venkatbeece@gmail.com> wrote:
>>>
>>>> hi Spark Team,
>>>>
>>>> I have raised a question on Spark through Stackoverflow. When you get a
>>>> chance, can you please take a look and help me ?.
>>>>
>>>> https://stackoverflow.com/q/71431757/5927843
>>>>
>>>> Thanks,
>>>> Venkat
>>>> 2016173438
>>>>
>>>
Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
Posted by Mich Talebzadeh <mi...@gmail.com>.
Well I do not know what has changed. However, this should not affect your
work.
Try to create table in Spark
sqltext: String =
CREATE TABLE if not exists test.etcs(
ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(10)
, PADDING VARCHAR(4000)
, PADDING2 STRING
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS PARQUET
TBLPROPERTIES (
"parquet.compress"="SNAPPY"
)
scala> spark.sql (sqltext)
scala> spark.sql("show create table test.etcs").show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt
|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etcs` (
`ID` INT,
`CLUSTERED` INT,
`SCATTERED` INT,
`RANDOMISED` INT,
`RANDOM_STRING` VARCHAR(50),
`SMALL_VC` VARCHAR(10),
`PADDING` VARCHAR(4000),
`PADDING2` STRING)
USING parquet
CLUSTERED BY (ID)
INTO 256 BUCKETS
TBLPROPERTIES (
'transient_lastDdlTime' = '1647033659',
'parquet.compress' = 'SNAPPY')
|
+------------------------------------------------------------------------------------------------------------------
Note that columns are OK.
Also check this link for the differences between CHAR, VARCHAR and STRING
types in Hive
https://cwiki.apache.org/confluence/display/hive/languagemanual+types
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <m....@gmail.com>
wrote:
> Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR
> and CHAR are properly showing in Spark 3. Do you know what changed in Spark
> 3 that made this possible?. Or how can I achieve the same output in Spark
> 2.4.1? If there are some conf options, that would be helpful.
>
> Thanks,
> Venkat
> 2016173438
>
>
> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hive 3.1.1
>> Spark 3.1.1
>>
>> Your stack overflow issue raised and I quote:
>>
>> "I have a need to generate DDL statements for Hive tables & views
>> programmatically. I tried using Spark and Beeline for this task. Beeline
>> takes around 5-10 seconds for each of the statements whereas Spark
>> completes the same thing in a few milliseconds. I am planning to use Spark
>> since it is faster compared to beeline. One downside of using spark for
>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>> as String and it doesn't preserve the length information that goes with
>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>> and the length information for CHAR,VARCHAR data types. *I am using
>> Spark 2.4.1 and Beeline 2.1.1.*
>>
>> Given below the sample create table command and its show create table
>> output."
>>
>> Create a simple table in *Hive* in test database
>>
>> hive> *use test;*
>> OK
>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>> OK
>> hive> *desc formatted etc;*
>> # col_name data_type comment
>> *id bigint*
>> *col1 varchar(30)*
>> *col2 string*
>>
>> # Detailed Table Information
>> Database: test
>> OwnerType: USER
>> Owner: hduser
>> CreateTime: Fri Mar 11 18:29:34 GMT 2022
>> LastAccessTime: UNKNOWN
>> Retention: 0
>> Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>> Table Type: MANAGED_TABLE
>> Table Parameters:
>> COLUMN_STATS_ACCURATE
>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>> bucketing_version 2
>> numFiles 0
>> numRows 0
>> rawDataSize 0
>> totalSize 0
>> transient_lastDdlTime 1647023374
>>
>> # Storage Information
>> SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>> InputFormat: org.apache.hadoop.mapred.TextInputFormat
>> OutputFormat:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>> Compressed: No
>> Num Buckets: -1
>> Bucket Columns: []
>> Sort Columns: []
>> Storage Desc Params:
>> serialization.format 1
>>
>> Now let's go to spark-shell
>> ^
>> scala> *spark.sql("show create table test.etc").show(false)*
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>> |createtab_stmt
>>
>> |
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>> |CREATE TABLE `test`.`etc` (
>> * `id` BIGINT,*
>> * `col1` VARCHAR(30),*
>> * `col2` STRING)*
>> USING text
>> TBLPROPERTIES (
>> 'bucketing_version' = '2',
>> 'transient_lastDdlTime' = '1647023374')
>> |
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> You can see Spark shows columns correctly
>>
>> Now let us go and create the same table in hive through beeline
>>
>>
>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>
>> No rows affected (0.019 seconds)
>>
>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
>> VARCHAR(30), col2 STRING)*
>>
>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>
>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>
>> . . . . . . . . . . . . . . . . . . >
>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>
>> | col_name | data_type
>> | comment |
>>
>>
>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>
>> | # col_name | data_type
>> | comment |
>>
>> *| id | bigint
>> | |*
>>
>> *| col1 | varchar(30)
>> | |*
>>
>> *| col2 | string
>> | |*
>>
>> | | NULL
>> | NULL |
>>
>> | # Detailed Table Information | NULL
>> | NULL |
>>
>> | Database: | test
>> | NULL |
>>
>> | OwnerType: | USER
>> | NULL |
>>
>> | Owner: | hduser
>> | NULL |
>>
>> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022
>> | NULL |
>>
>> | LastAccessTime: | UNKNOWN
>> | NULL |
>>
>> | Retention: | 0
>> | NULL |
>>
>> | Location: |
>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>> |
>>
>> | Table Type: | MANAGED_TABLE
>> | NULL |
>>
>> | Table Parameters: | NULL
>> | NULL |
>>
>> | | COLUMN_STATS_ACCURATE
>> |
>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>> |
>>
>> | | bucketing_version
>> | 2 |
>>
>> | | numFiles
>> | 0 |
>>
>> | | numRows
>> | 0 |
>>
>> | | rawDataSize
>> | 0 |
>>
>> | | totalSize
>> | 0 |
>>
>> | | transient_lastDdlTime
>> | 1647024660 |
>>
>> | | NULL
>> | NULL |
>>
>> | # Storage Information | NULL
>> | NULL |
>>
>> | SerDe Library: |
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>> |
>>
>> | InputFormat: |
>> org.apache.hadoop.mapred.TextInputFormat | NULL
>> |
>>
>> | OutputFormat: |
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>> |
>>
>> | Compressed: | No
>> | NULL |
>>
>> | Num Buckets: | -1
>> | NULL |
>>
>> | Bucket Columns: | []
>> | NULL |
>>
>> | Sort Columns: | []
>> | NULL |
>>
>> | Storage Desc Params: | NULL
>> | NULL |
>>
>> | | serialization.format
>> | 1 |
>>
>>
>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>
>> 33 rows selected (0.159 seconds)
>>
>> Now check that in spark-shell again
>>
>>
>> scala> spark.sql("show create table test.etc").show(false)
>>
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |createtab_stmt
>>
>> |
>>
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |CREATE TABLE `test`.`etc` (
>>
>> * `id` BIGINT,*
>>
>> * `col1` VARCHAR(30),*
>>
>> * `col2` STRING)*
>>
>> USING text
>>
>> TBLPROPERTIES (
>>
>> 'bucketing_version' = '2',
>>
>> 'transient_lastDdlTime' = '1647024660')
>>
>> |
>>
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>>
>> It shows OK. Soo in summary you get column definitions in Spark as you
>> have defined them in Hive
>>
>>
>> In your statement above and I quote "I am using Spark 2.4.1 and Beeline
>> 2.1.1", refers to older versions of Spark and hive which may have had
>> such issues.
>>
>>
>> HTH
>>
>>
>>
>> view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>> m.venkatbeece@gmail.com> wrote:
>>
>>> hi Spark Team,
>>>
>>> I have raised a question on Spark through Stackoverflow. When you get a
>>> chance, can you please take a look and help me ?.
>>>
>>> https://stackoverflow.com/q/71431757/5927843
>>>
>>> Thanks,
>>> Venkat
>>> 2016173438
>>>
>>
Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
Posted by Venkatesan Muniappan <m....@gmail.com>.
Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR
and CHAR are properly showing in Spark 3. Do you know what changed in Spark
3 that made this possible?. Or how can I achieve the same output in Spark
2.4.1? If there are some conf options, that would be helpful.
Thanks,
Venkat
2016173438
On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <mi...@gmail.com>
wrote:
> Hive 3.1.1
> Spark 3.1.1
>
> Your stack overflow issue raised and I quote:
>
> "I have a need to generate DDL statements for Hive tables & views
> programmatically. I tried using Spark and Beeline for this task. Beeline
> takes around 5-10 seconds for each of the statements whereas Spark
> completes the same thing in a few milliseconds. I am planning to use Spark
> since it is faster compared to beeline. One downside of using spark for
> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
> as String and it doesn't preserve the length information that goes with
> CHAR,VARCHAR data types. At the same time beeline preserves the data type
> and the length information for CHAR,VARCHAR data types. *I am using Spark
> 2.4.1 and Beeline 2.1.1.*
>
> Given below the sample create table command and its show create table
> output."
>
> Create a simple table in *Hive* in test database
>
> hive> *use test;*
> OK
> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
> OK
> hive> *desc formatted etc;*
> # col_name data_type comment
> *id bigint*
> *col1 varchar(30)*
> *col2 string*
>
> # Detailed Table Information
> Database: test
> OwnerType: USER
> Owner: hduser
> CreateTime: Fri Mar 11 18:29:34 GMT 2022
> LastAccessTime: UNKNOWN
> Retention: 0
> Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
> bucketing_version 2
> numFiles 0
> numRows 0
> rawDataSize 0
> totalSize 0
> transient_lastDdlTime 1647023374
>
> # Storage Information
> SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat: org.apache.hadoop.mapred.TextInputFormat
> OutputFormat:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Compressed: No
> Num Buckets: -1
> Bucket Columns: []
> Sort Columns: []
> Storage Desc Params:
> serialization.format 1
>
> Now let's go to spark-shell
> ^
> scala> *spark.sql("show create table test.etc").show(false)*
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |createtab_stmt
>
> |
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |CREATE TABLE `test`.`etc` (
> * `id` BIGINT,*
> * `col1` VARCHAR(30),*
> * `col2` STRING)*
> USING text
> TBLPROPERTIES (
> 'bucketing_version' = '2',
> 'transient_lastDdlTime' = '1647023374')
> |
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> You can see Spark shows columns correctly
>
> Now let us go and create the same table in hive through beeline
>
>
> 0: jdbc:hive2://rhes75:10099/default>* use test*
>
> No rows affected (0.019 seconds)
>
> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
> VARCHAR(30), col2 STRING)*
>
> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>
> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>
> . . . . . . . . . . . . . . . . . . >
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>
> | col_name | data_type
> | comment |
>
>
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>
> | # col_name | data_type
> | comment |
>
> *| id | bigint
> | |*
>
> *| col1 | varchar(30)
> | |*
>
> *| col2 | string
> | |*
>
> | | NULL
> | NULL |
>
> | # Detailed Table Information | NULL
> | NULL |
>
> | Database: | test
> | NULL |
>
> | OwnerType: | USER
> | NULL |
>
> | Owner: | hduser
> | NULL |
>
> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022
> | NULL |
>
> | LastAccessTime: | UNKNOWN
> | NULL |
>
> | Retention: | 0
> | NULL |
>
> | Location: |
> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
> |
>
> | Table Type: | MANAGED_TABLE
> | NULL |
>
> | Table Parameters: | NULL
> | NULL |
>
> | | COLUMN_STATS_ACCURATE
> |
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
> |
>
> | | bucketing_version
> | 2 |
>
> | | numFiles
> | 0 |
>
> | | numRows
> | 0 |
>
> | | rawDataSize
> | 0 |
>
> | | totalSize
> | 0 |
>
> | | transient_lastDdlTime
> | 1647024660 |
>
> | | NULL
> | NULL |
>
> | # Storage Information | NULL
> | NULL |
>
> | SerDe Library: |
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
> |
>
> | InputFormat: |
> org.apache.hadoop.mapred.TextInputFormat | NULL
> |
>
> | OutputFormat: |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
> |
>
> | Compressed: | No
> | NULL |
>
> | Num Buckets: | -1
> | NULL |
>
> | Bucket Columns: | []
> | NULL |
>
> | Sort Columns: | []
> | NULL |
>
> | Storage Desc Params: | NULL
> | NULL |
>
> | | serialization.format
> | 1 |
>
>
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>
> 33 rows selected (0.159 seconds)
>
> Now check that in spark-shell again
>
>
> scala> spark.sql("show create table test.etc").show(false)
>
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |createtab_stmt
>
> |
>
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |CREATE TABLE `test`.`etc` (
>
> * `id` BIGINT,*
>
> * `col1` VARCHAR(30),*
>
> * `col2` STRING)*
>
> USING text
>
> TBLPROPERTIES (
>
> 'bucketing_version' = '2',
>
> 'transient_lastDdlTime' = '1647024660')
>
> |
>
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
>
> It shows OK. Soo in summary you get column definitions in Spark as you
> have defined them in Hive
>
>
> In your statement above and I quote "I am using Spark 2.4.1 and Beeline
> 2.1.1", refers to older versions of Spark and hive which may have had
> such issues.
>
>
> HTH
>
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
> m.venkatbeece@gmail.com> wrote:
>
>> hi Spark Team,
>>
>> I have raised a question on Spark through Stackoverflow. When you get a
>> chance, can you please take a look and help me ?.
>>
>> https://stackoverflow.com/q/71431757/5927843
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>
Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING
Posted by Mich Talebzadeh <mi...@gmail.com>.
Hive 3.1.1
Spark 3.1.1
Your stack overflow issue raised and I quote:
"I have a need to generate DDL statements for Hive tables & views
programmatically. I tried using Spark and Beeline for this task. Beeline
takes around 5-10 seconds for each of the statements whereas Spark
completes the same thing in a few milliseconds. I am planning to use Spark
since it is faster compared to beeline. One downside of using spark for
getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
as String and it doesn't preserve the length information that goes with
CHAR,VARCHAR data types. At the same time beeline preserves the data type
and the length information for CHAR,VARCHAR data types. *I am using Spark
2.4.1 and Beeline 2.1.1.*
Given below the sample create table command and its show create table
output."
Create a simple table in *Hive* in test database
hive> *use test;*
OK
hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
OK
hive> *desc formatted etc;*
# col_name data_type comment
*id bigint*
*col1 varchar(30)*
*col2 string*
# Detailed Table Information
Database: test
OwnerType: USER
Owner: hduser
CreateTime: Fri Mar 11 18:29:34 GMT 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1647023374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Now let's go to spark-shell
^
scala> *spark.sql("show create table test.etc").show(false)*
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
* `id` BIGINT,*
* `col1` VARCHAR(30),*
* `col2` STRING)*
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You can see Spark shows columns correctly
Now let us go and create the same table in hive through beeline
0: jdbc:hive2://rhes75:10099/default>* use test*
No rows affected (0.019 seconds)
0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
VARCHAR(30), col2 STRING)*
. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
. . . . . . . . . . . . . . . . . . >
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type
| comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type
| comment |
*| id | bigint
| |*
*| col1 | varchar(30)
| |*
*| col2 | string
| |*
| | NULL
| NULL |
| # Detailed Table Information | NULL
| NULL |
| Database: | test
| NULL |
| OwnerType: | USER
| NULL |
| Owner: | hduser
| NULL |
| CreateTime: | Fri Mar 11 18:51:00 GMT 2022
| NULL |
| LastAccessTime: | UNKNOWN
| NULL |
| Retention: | 0
| NULL |
| Location: |
hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
|
| Table Type: | MANAGED_TABLE
| NULL |
| Table Parameters: | NULL
| NULL |
| | COLUMN_STATS_ACCURATE
|
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
|
| | bucketing_version
| 2 |
| | numFiles
| 0 |
| | numRows
| 0 |
| | rawDataSize
| 0 |
| | totalSize
| 0 |
| | transient_lastDdlTime
| 1647024660 |
| | NULL
| NULL |
| # Storage Information | NULL
| NULL |
| SerDe Library: |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
|
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat
| NULL |
| OutputFormat: |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
|
| Compressed: | No
| NULL |
| Num Buckets: | -1
| NULL |
| Bucket Columns: | []
| NULL |
| Sort Columns: | []
| NULL |
| Storage Desc Params: | NULL
| NULL |
| | serialization.format
| 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
33 rows selected (0.159 seconds)
Now check that in spark-shell again
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
* `id` BIGINT,*
* `col1` VARCHAR(30),*
* `col2` STRING)*
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647024660')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
It shows OK. Soo in summary you get column definitions in Spark as you
have defined them in Hive
In your statement above and I quote "I am using Spark 2.4.1 and Beeline
2.1.1", refers to older versions of Spark and hive which may have had such
issues.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <m....@gmail.com>
wrote:
> hi Spark Team,
>
> I have raised a question on Spark through Stackoverflow. When you get a
> chance, can you please take a look and help me ?.
>
> https://stackoverflow.com/q/71431757/5927843
>
> Thanks,
> Venkat
> 2016173438
>