You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Venkatesan Muniappan <m....@gmail.com> on 2022/03/11 17:58:36 UTC

Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

hi Spark Team,

I have raised a question on Spark through Stackoverflow. When you get a
chance, can you please take a look and help me ?.

https://stackoverflow.com/q/71431757/5927843

Thanks,
Venkat
2016173438

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Posted by Venkatesan Muniappan <m....@gmail.com>.

hi,
Does anybody else have a better suggestion for my problem?.

Thanks,
Venkat
2016173438


On Fri, Mar 11, 2022 at 4:43 PM Venkatesan Muniappan <
m.venkatbeece@gmail.com> wrote:

> ok. I work for an org where such upgrades take a few months. Not an
> immediate task.
>
> Thanks,
> Venkat
> 2016173438
>
>
> On Fri, Mar 11, 2022 at 4:38 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> yes in spark 3.1.1. Best to upgrade it to spark 3+.
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <
>> m.venkatbeece@gmail.com> wrote:
>>
>>> Thank you. I am trying to get the table definition for the existing
>>> tables. BTW, the create and show command that you executed, was it on Spark
>>> 3.x ? .
>>>
>>> Thanks,
>>> Venkat
>>> 2016173438
>>>
>>>
>>> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Well I do not know what has changed. However, this should not affect
>>>> your work.
>>>>
>>>>
>>>> Try to create table in Spark
>>>>
>>>>
>>>> sqltext: String =
>>>>
>>>>   CREATE TABLE if not exists test.etcs(
>>>>
>>>>      ID INT
>>>>
>>>>    , CLUSTERED INT
>>>>
>>>>    , SCATTERED INT
>>>>
>>>>    , RANDOMISED INT
>>>>
>>>>    , RANDOM_STRING VARCHAR(50)
>>>>
>>>>    , SMALL_VC VARCHAR(10)
>>>>
>>>>    , PADDING  VARCHAR(4000)
>>>>
>>>>    , PADDING2 STRING
>>>>
>>>>   )
>>>>
>>>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>>>
>>>>   STORED AS PARQUET
>>>>
>>>>   TBLPROPERTIES (
>>>>
>>>>   "parquet.compress"="SNAPPY"
>>>>
>>>>  )
>>>>
>>>>
>>>> scala> spark.sql (sqltext)
>>>>
>>>> scala> spark.sql("show create table test.etcs").show(false)
>>>>
>>>>
>>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |createtab_stmt
>>>>
>>>>
>>>>
>>>>                                               |
>>>>
>>>>
>>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |CREATE TABLE `test`.`etcs` (
>>>>
>>>>   `ID` INT,
>>>>
>>>>   `CLUSTERED` INT,
>>>>
>>>>   `SCATTERED` INT,
>>>>
>>>>   `RANDOMISED` INT,
>>>>
>>>>   `RANDOM_STRING` VARCHAR(50),
>>>>
>>>>   `SMALL_VC` VARCHAR(10),
>>>>
>>>>   `PADDING` VARCHAR(4000),
>>>>
>>>>   `PADDING2` STRING)
>>>>
>>>> USING parquet
>>>>
>>>> CLUSTERED BY (ID)
>>>>
>>>> INTO 256 BUCKETS
>>>>
>>>> TBLPROPERTIES (
>>>>
>>>>   'transient_lastDdlTime' = '1647033659',
>>>>
>>>>   'parquet.compress' = 'SNAPPY')
>>>>
>>>> |
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> Note that columns are OK.
>>>>
>>>>
>>>> Also check this link for the differences between CHAR, VARCHAR and
>>>> STRING types in Hive
>>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>   view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>>>> m.venkatbeece@gmail.com> wrote:
>>>>
>>>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>>>
>>>>> Thanks,
>>>>> Venkat
>>>>> 2016173438
>>>>>
>>>>>
>>>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Hive 3.1.1
>>>>>> Spark 3.1.1
>>>>>>
>>>>>> Your stack overflow issue raised and I quote:
>>>>>>
>>>>>> "I have a need to generate DDL statements for Hive tables & views
>>>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>>>>> since it is faster compared to beeline. One downside of using spark for
>>>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>>>>> as String and it doesn't preserve the length information that goes with
>>>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>>>
>>>>>> Given below the sample create table command and its show create table
>>>>>> output."
>>>>>>
>>>>>> Create a simple table in *Hive* in test database
>>>>>>
>>>>>> hive> *use test;*
>>>>>> OK
>>>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>>>> OK
>>>>>> hive> *desc formatted etc;*
>>>>>> # col_name              data_type               comment
>>>>>> *id                      bigint*
>>>>>> *col1                    varchar(30)*
>>>>>> *col2                    string*
>>>>>>
>>>>>> # Detailed Table Information
>>>>>> Database:               test
>>>>>> OwnerType:              USER
>>>>>> Owner:                  hduser
>>>>>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>>>>>> LastAccessTime:         UNKNOWN
>>>>>> Retention:              0
>>>>>> Location:
>>>>>>  hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>>>> Table Type:             MANAGED_TABLE
>>>>>> Table Parameters:
>>>>>>         COLUMN_STATS_ACCURATE
>>>>>>  {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>>         bucketing_version       2
>>>>>>         numFiles                0
>>>>>>         numRows                 0
>>>>>>         rawDataSize             0
>>>>>>         totalSize               0
>>>>>>         transient_lastDdlTime   1647023374
>>>>>>
>>>>>> # Storage Information
>>>>>> SerDe Library:
>>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>>>>>> OutputFormat:
>>>>>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>>> Compressed:             No
>>>>>> Num Buckets:            -1
>>>>>> Bucket Columns:         []
>>>>>> Sort Columns:           []
>>>>>> Storage Desc Params:
>>>>>>         serialization.format    1
>>>>>>
>>>>>> Now let's go to spark-shell
>>>>>>                                               ^
>>>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>> |createtab_stmt
>>>>>>
>>>>>>                                     |
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>> |CREATE TABLE `test`.`etc` (
>>>>>> *  `id` BIGINT,*
>>>>>> *  `col1` VARCHAR(30),*
>>>>>> *  `col2` STRING)*
>>>>>> USING text
>>>>>> TBLPROPERTIES (
>>>>>>   'bucketing_version' = '2',
>>>>>>   'transient_lastDdlTime' = '1647023374')
>>>>>> |
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> You can see Spark shows columns correctly
>>>>>>
>>>>>> Now let us go and create the same table in hive through beeline
>>>>>>
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>>>
>>>>>> No rows affected (0.019 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>>>> col1 VARCHAR(30), col2 STRING)*
>>>>>>
>>>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>>>
>>>>>> . . . . . . . . . . . . . . . . . . >
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> |           col_name            |                     data_type
>>>>>>                 |                      comment                       |
>>>>>>
>>>>>>
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> | # col_name                    | data_type
>>>>>>                 | comment                                            |
>>>>>>
>>>>>> *| id                            | bigint
>>>>>>                  |                                                    |*
>>>>>>
>>>>>> *| col1                          | varchar(30)
>>>>>>                 |                                                    |*
>>>>>>
>>>>>> *| col2                          | string
>>>>>>                  |                                                    |*
>>>>>>
>>>>>> |                               | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | # Detailed Table Information  | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Database:                     | test
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | OwnerType:                    | USER
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Owner:                        | hduser
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | LastAccessTime:               | UNKNOWN
>>>>>>                 | NULL                                               |
>>>>>>
>>>>>> | Retention:                    | 0
>>>>>>                 | NULL                                               |
>>>>>>
>>>>>> | Location:                     |
>>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>>>>                              |
>>>>>>
>>>>>> | Table Type:                   | MANAGED_TABLE
>>>>>>                 | NULL                                               |
>>>>>>
>>>>>> | Table Parameters:             | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> |                               | COLUMN_STATS_ACCURATE
>>>>>>                 |
>>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>> |
>>>>>>
>>>>>> |                               | bucketing_version
>>>>>>                 | 2                                                  |
>>>>>>
>>>>>> |                               | numFiles
>>>>>>                | 0                                                  |
>>>>>>
>>>>>> |                               | numRows
>>>>>>                 | 0
>>>>>> |
>>>>>>
>>>>>> |                               | rawDataSize
>>>>>>                 | 0                                                  |
>>>>>>
>>>>>> |                               | totalSize
>>>>>>                 | 0                                                  |
>>>>>>
>>>>>> |                               | transient_lastDdlTime
>>>>>>                 | 1647024660                                         |
>>>>>>
>>>>>> |                               | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | # Storage Information         | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | SerDe Library:                |
>>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>>>>                              |
>>>>>>
>>>>>> | InputFormat:                  |
>>>>>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>>>>>                              |
>>>>>>
>>>>>> | OutputFormat:                 |
>>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>>>>                                      |
>>>>>>
>>>>>> | Compressed:                   | No
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Num Buckets:                  | -1
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Bucket Columns:               | []
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Sort Columns:                 | []
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Storage Desc Params:          | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> |                               | serialization.format
>>>>>>                | 1                                                  |
>>>>>>
>>>>>>
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> 33 rows selected (0.159 seconds)
>>>>>>
>>>>>> Now check that in spark-shell again
>>>>>>
>>>>>>
>>>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> |createtab_stmt
>>>>>>
>>>>>>                                     |
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> |CREATE TABLE `test`.`etc` (
>>>>>>
>>>>>> *  `id` BIGINT,*
>>>>>>
>>>>>> *  `col1` VARCHAR(30),*
>>>>>>
>>>>>> *  `col2` STRING)*
>>>>>>
>>>>>> USING text
>>>>>>
>>>>>> TBLPROPERTIES (
>>>>>>
>>>>>>   'bucketing_version' = '2',
>>>>>>
>>>>>>   'transient_lastDdlTime' = '1647024660')
>>>>>>
>>>>>> |
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>>
>>>>>> It shows OK.  Soo in summary you get column definitions in Spark as
>>>>>> you have defined them in Hive
>>>>>>
>>>>>>
>>>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>>>> have had such issues.
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>>>> m.venkatbeece@gmail.com> wrote:
>>>>>>
>>>>>>> hi Spark Team,
>>>>>>>
>>>>>>> I have raised a question on Spark through Stackoverflow. When you
>>>>>>> get a chance, can you please take a look and help me ?.
>>>>>>>
>>>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Venkat
>>>>>>> 2016173438
>>>>>>>
>>>>>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Posted by Venkatesan Muniappan <m....@gmail.com>.

ok. I work for an org where such upgrades take a few months. Not an
immediate task.

Thanks,
Venkat
2016173438


On Fri, Mar 11, 2022 at 4:38 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> yes in spark 3.1.1. Best to upgrade it to spark 3+.
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <
> m.venkatbeece@gmail.com> wrote:
>
>> Thank you. I am trying to get the table definition for the existing
>> tables. BTW, the create and show command that you executed, was it on Spark
>> 3.x ? .
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>>
>> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Well I do not know what has changed. However, this should not affect
>>> your work.
>>>
>>>
>>> Try to create table in Spark
>>>
>>>
>>> sqltext: String =
>>>
>>>   CREATE TABLE if not exists test.etcs(
>>>
>>>      ID INT
>>>
>>>    , CLUSTERED INT
>>>
>>>    , SCATTERED INT
>>>
>>>    , RANDOMISED INT
>>>
>>>    , RANDOM_STRING VARCHAR(50)
>>>
>>>    , SMALL_VC VARCHAR(10)
>>>
>>>    , PADDING  VARCHAR(4000)
>>>
>>>    , PADDING2 STRING
>>>
>>>   )
>>>
>>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>>
>>>   STORED AS PARQUET
>>>
>>>   TBLPROPERTIES (
>>>
>>>   "parquet.compress"="SNAPPY"
>>>
>>>  )
>>>
>>>
>>> scala> spark.sql (sqltext)
>>>
>>> scala> spark.sql("show create table test.etcs").show(false)
>>>
>>>
>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |createtab_stmt
>>>
>>>
>>>
>>>                                             |
>>>
>>>
>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |CREATE TABLE `test`.`etcs` (
>>>
>>>   `ID` INT,
>>>
>>>   `CLUSTERED` INT,
>>>
>>>   `SCATTERED` INT,
>>>
>>>   `RANDOMISED` INT,
>>>
>>>   `RANDOM_STRING` VARCHAR(50),
>>>
>>>   `SMALL_VC` VARCHAR(10),
>>>
>>>   `PADDING` VARCHAR(4000),
>>>
>>>   `PADDING2` STRING)
>>>
>>> USING parquet
>>>
>>> CLUSTERED BY (ID)
>>>
>>> INTO 256 BUCKETS
>>>
>>> TBLPROPERTIES (
>>>
>>>   'transient_lastDdlTime' = '1647033659',
>>>
>>>   'parquet.compress' = 'SNAPPY')
>>>
>>> |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> Note that columns are OK.
>>>
>>>
>>> Also check this link for the differences between CHAR, VARCHAR and
>>> STRING types in Hive
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>   view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>>> m.venkatbeece@gmail.com> wrote:
>>>
>>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>>
>>>> Thanks,
>>>> Venkat
>>>> 2016173438
>>>>
>>>>
>>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Hive 3.1.1
>>>>> Spark 3.1.1
>>>>>
>>>>> Your stack overflow issue raised and I quote:
>>>>>
>>>>> "I have a need to generate DDL statements for Hive tables & views
>>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>>>> since it is faster compared to beeline. One downside of using spark for
>>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>>>> as String and it doesn't preserve the length information that goes with
>>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>>
>>>>> Given below the sample create table command and its show create table
>>>>> output."
>>>>>
>>>>> Create a simple table in *Hive* in test database
>>>>>
>>>>> hive> *use test;*
>>>>> OK
>>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>>> OK
>>>>> hive> *desc formatted etc;*
>>>>> # col_name              data_type               comment
>>>>> *id                      bigint*
>>>>> *col1                    varchar(30)*
>>>>> *col2                    string*
>>>>>
>>>>> # Detailed Table Information
>>>>> Database:               test
>>>>> OwnerType:              USER
>>>>> Owner:                  hduser
>>>>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>>>>> LastAccessTime:         UNKNOWN
>>>>> Retention:              0
>>>>> Location:
>>>>>  hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>>> Table Type:             MANAGED_TABLE
>>>>> Table Parameters:
>>>>>         COLUMN_STATS_ACCURATE
>>>>>  {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>         bucketing_version       2
>>>>>         numFiles                0
>>>>>         numRows                 0
>>>>>         rawDataSize             0
>>>>>         totalSize               0
>>>>>         transient_lastDdlTime   1647023374
>>>>>
>>>>> # Storage Information
>>>>> SerDe Library:
>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>>>>> OutputFormat:
>>>>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>> Compressed:             No
>>>>> Num Buckets:            -1
>>>>> Bucket Columns:         []
>>>>> Sort Columns:           []
>>>>> Storage Desc Params:
>>>>>         serialization.format    1
>>>>>
>>>>> Now let's go to spark-shell
>>>>>                                               ^
>>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>> |createtab_stmt
>>>>>
>>>>>                                   |
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>> |CREATE TABLE `test`.`etc` (
>>>>> *  `id` BIGINT,*
>>>>> *  `col1` VARCHAR(30),*
>>>>> *  `col2` STRING)*
>>>>> USING text
>>>>> TBLPROPERTIES (
>>>>>   'bucketing_version' = '2',
>>>>>   'transient_lastDdlTime' = '1647023374')
>>>>> |
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> You can see Spark shows columns correctly
>>>>>
>>>>> Now let us go and create the same table in hive through beeline
>>>>>
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>>
>>>>> No rows affected (0.019 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>>> col1 VARCHAR(30), col2 STRING)*
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . >
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> |           col_name            |                     data_type
>>>>>               |                      comment                       |
>>>>>
>>>>>
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> | # col_name                    | data_type
>>>>>               | comment                                            |
>>>>>
>>>>> *| id                            | bigint
>>>>>                |                                                    |*
>>>>>
>>>>> *| col1                          | varchar(30)
>>>>>                 |                                                    |*
>>>>>
>>>>> *| col2                          | string
>>>>>                |                                                    |*
>>>>>
>>>>> |                               | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | # Detailed Table Information  | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | Database:                     | test
>>>>>                | NULL                                               |
>>>>>
>>>>> | OwnerType:                    | USER
>>>>>                | NULL                                               |
>>>>>
>>>>> | Owner:                        | hduser
>>>>>                | NULL                                               |
>>>>>
>>>>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>>>>                | NULL                                               |
>>>>>
>>>>> | LastAccessTime:               | UNKNOWN
>>>>>               | NULL                                               |
>>>>>
>>>>> | Retention:                    | 0
>>>>>               | NULL                                               |
>>>>>
>>>>> | Location:                     |
>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>>>                              |
>>>>>
>>>>> | Table Type:                   | MANAGED_TABLE
>>>>>               | NULL                                               |
>>>>>
>>>>> | Table Parameters:             | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> |                               | COLUMN_STATS_ACCURATE
>>>>>               |
>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>> |
>>>>>
>>>>> |                               | bucketing_version
>>>>>               | 2                                                  |
>>>>>
>>>>> |                               | numFiles
>>>>>                | 0                                                  |
>>>>>
>>>>> |                               | numRows
>>>>>               | 0                                                  |
>>>>>
>>>>> |                               | rawDataSize
>>>>>               | 0                                                  |
>>>>>
>>>>> |                               | totalSize
>>>>>               | 0                                                  |
>>>>>
>>>>> |                               | transient_lastDdlTime
>>>>>               | 1647024660                                         |
>>>>>
>>>>> |                               | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | # Storage Information         | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | SerDe Library:                |
>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>>>                              |
>>>>>
>>>>> | InputFormat:                  |
>>>>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>>>>                              |
>>>>>
>>>>> | OutputFormat:                 |
>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>>>                                      |
>>>>>
>>>>> | Compressed:                   | No
>>>>>                | NULL                                               |
>>>>>
>>>>> | Num Buckets:                  | -1
>>>>>                | NULL                                               |
>>>>>
>>>>> | Bucket Columns:               | []
>>>>>                | NULL                                               |
>>>>>
>>>>> | Sort Columns:                 | []
>>>>>                | NULL                                               |
>>>>>
>>>>> | Storage Desc Params:          | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> |                               | serialization.format
>>>>>                | 1                                                  |
>>>>>
>>>>>
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> 33 rows selected (0.159 seconds)
>>>>>
>>>>> Now check that in spark-shell again
>>>>>
>>>>>
>>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> |createtab_stmt
>>>>>
>>>>>                                   |
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> |CREATE TABLE `test`.`etc` (
>>>>>
>>>>> *  `id` BIGINT,*
>>>>>
>>>>> *  `col1` VARCHAR(30),*
>>>>>
>>>>> *  `col2` STRING)*
>>>>>
>>>>> USING text
>>>>>
>>>>> TBLPROPERTIES (
>>>>>
>>>>>   'bucketing_version' = '2',
>>>>>
>>>>>   'transient_lastDdlTime' = '1647024660')
>>>>>
>>>>> |
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> It shows OK.  Soo in summary you get column definitions in Spark as
>>>>> you have defined them in Hive
>>>>>
>>>>>
>>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>>> have had such issues.
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>>> m.venkatbeece@gmail.com> wrote:
>>>>>
>>>>>> hi Spark Team,
>>>>>>
>>>>>> I have raised a question on Spark through Stackoverflow. When you get
>>>>>> a chance, can you please take a look and help me ?.
>>>>>>
>>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>>
>>>>>> Thanks,
>>>>>> Venkat
>>>>>> 2016173438
>>>>>>
>>>>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Posted by Mich Talebzadeh <mi...@gmail.com>.

yes in spark 3.1.1. Best to upgrade it to spark 3+.



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <m....@gmail.com>
wrote:

> Thank you. I am trying to get the table definition for the existing
> tables. BTW, the create and show command that you executed, was it on Spark
> 3.x ? .
>
> Thanks,
> Venkat
> 2016173438
>
>
> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Well I do not know what has changed. However, this should not affect your
>> work.
>>
>>
>> Try to create table in Spark
>>
>>
>> sqltext: String =
>>
>>   CREATE TABLE if not exists test.etcs(
>>
>>      ID INT
>>
>>    , CLUSTERED INT
>>
>>    , SCATTERED INT
>>
>>    , RANDOMISED INT
>>
>>    , RANDOM_STRING VARCHAR(50)
>>
>>    , SMALL_VC VARCHAR(10)
>>
>>    , PADDING  VARCHAR(4000)
>>
>>    , PADDING2 STRING
>>
>>   )
>>
>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>
>>   STORED AS PARQUET
>>
>>   TBLPROPERTIES (
>>
>>   "parquet.compress"="SNAPPY"
>>
>>  )
>>
>>
>> scala> spark.sql (sqltext)
>>
>> scala> spark.sql("show create table test.etcs").show(false)
>>
>>
>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |createtab_stmt
>>
>>
>>
>>                                             |
>>
>>
>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |CREATE TABLE `test`.`etcs` (
>>
>>   `ID` INT,
>>
>>   `CLUSTERED` INT,
>>
>>   `SCATTERED` INT,
>>
>>   `RANDOMISED` INT,
>>
>>   `RANDOM_STRING` VARCHAR(50),
>>
>>   `SMALL_VC` VARCHAR(10),
>>
>>   `PADDING` VARCHAR(4000),
>>
>>   `PADDING2` STRING)
>>
>> USING parquet
>>
>> CLUSTERED BY (ID)
>>
>> INTO 256 BUCKETS
>>
>> TBLPROPERTIES (
>>
>>   'transient_lastDdlTime' = '1647033659',
>>
>>   'parquet.compress' = 'SNAPPY')
>>
>> |
>>
>>
>> +------------------------------------------------------------------------------------------------------------------
>>
>>
>> Note that columns are OK.
>>
>>
>> Also check this link for the differences between CHAR, VARCHAR and STRING
>> types in Hive
>>
>>
>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>
>>
>> HTH
>>
>>
>>
>>   view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>> m.venkatbeece@gmail.com> wrote:
>>
>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>
>>> Thanks,
>>> Venkat
>>> 2016173438
>>>
>>>
>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hive 3.1.1
>>>> Spark 3.1.1
>>>>
>>>> Your stack overflow issue raised and I quote:
>>>>
>>>> "I have a need to generate DDL statements for Hive tables & views
>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>>> since it is faster compared to beeline. One downside of using spark for
>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>>> as String and it doesn't preserve the length information that goes with
>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>
>>>> Given below the sample create table command and its show create table
>>>> output."
>>>>
>>>> Create a simple table in *Hive* in test database
>>>>
>>>> hive> *use test;*
>>>> OK
>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>> OK
>>>> hive> *desc formatted etc;*
>>>> # col_name              data_type               comment
>>>> *id                      bigint*
>>>> *col1                    varchar(30)*
>>>> *col2                    string*
>>>>
>>>> # Detailed Table Information
>>>> Database:               test
>>>> OwnerType:              USER
>>>> Owner:                  hduser
>>>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>>>> LastAccessTime:         UNKNOWN
>>>> Retention:              0
>>>> Location:
>>>>  hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>> Table Type:             MANAGED_TABLE
>>>> Table Parameters:
>>>>         COLUMN_STATS_ACCURATE
>>>>  {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>         bucketing_version       2
>>>>         numFiles                0
>>>>         numRows                 0
>>>>         rawDataSize             0
>>>>         totalSize               0
>>>>         transient_lastDdlTime   1647023374
>>>>
>>>> # Storage Information
>>>> SerDe Library:
>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>>>> OutputFormat:
>>>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>> Compressed:             No
>>>> Num Buckets:            -1
>>>> Bucket Columns:         []
>>>> Sort Columns:           []
>>>> Storage Desc Params:
>>>>         serialization.format    1
>>>>
>>>> Now let's go to spark-shell
>>>>                                               ^
>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>> |createtab_stmt
>>>>
>>>>                                   |
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>> |CREATE TABLE `test`.`etc` (
>>>> *  `id` BIGINT,*
>>>> *  `col1` VARCHAR(30),*
>>>> *  `col2` STRING)*
>>>> USING text
>>>> TBLPROPERTIES (
>>>>   'bucketing_version' = '2',
>>>>   'transient_lastDdlTime' = '1647023374')
>>>> |
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> You can see Spark shows columns correctly
>>>>
>>>> Now let us go and create the same table in hive through beeline
>>>>
>>>>
>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>
>>>> No rows affected (0.019 seconds)
>>>>
>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>> col1 VARCHAR(30), col2 STRING)*
>>>>
>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>
>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>
>>>> . . . . . . . . . . . . . . . . . . >
>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>
>>>> |           col_name            |                     data_type
>>>>               |                      comment                       |
>>>>
>>>>
>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>
>>>> | # col_name                    | data_type
>>>>               | comment                                            |
>>>>
>>>> *| id                            | bigint
>>>>                |                                                    |*
>>>>
>>>> *| col1                          | varchar(30)
>>>>               |                                                    |*
>>>>
>>>> *| col2                          | string
>>>>                |                                                    |*
>>>>
>>>> |                               | NULL
>>>>              | NULL                                               |
>>>>
>>>> | # Detailed Table Information  | NULL
>>>>              | NULL                                               |
>>>>
>>>> | Database:                     | test
>>>>              | NULL                                               |
>>>>
>>>> | OwnerType:                    | USER
>>>>              | NULL                                               |
>>>>
>>>> | Owner:                        | hduser
>>>>              | NULL                                               |
>>>>
>>>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>>>              | NULL                                               |
>>>>
>>>> | LastAccessTime:               | UNKNOWN
>>>>               | NULL                                               |
>>>>
>>>> | Retention:                    | 0
>>>>               | NULL                                               |
>>>>
>>>> | Location:                     |
>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>>                              |
>>>>
>>>> | Table Type:                   | MANAGED_TABLE
>>>>               | NULL                                               |
>>>>
>>>> | Table Parameters:             | NULL
>>>>              | NULL                                               |
>>>>
>>>> |                               | COLUMN_STATS_ACCURATE
>>>>               |
>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>> |
>>>>
>>>> |                               | bucketing_version
>>>>               | 2                                                  |
>>>>
>>>> |                               | numFiles
>>>>              | 0                                                  |
>>>>
>>>> |                               | numRows
>>>>               | 0                                                  |
>>>>
>>>> |                               | rawDataSize
>>>>               | 0                                                  |
>>>>
>>>> |                               | totalSize
>>>>               | 0                                                  |
>>>>
>>>> |                               | transient_lastDdlTime
>>>>               | 1647024660                                         |
>>>>
>>>> |                               | NULL
>>>>              | NULL                                               |
>>>>
>>>> | # Storage Information         | NULL
>>>>              | NULL                                               |
>>>>
>>>> | SerDe Library:                |
>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>>                              |
>>>>
>>>> | InputFormat:                  |
>>>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>>>                              |
>>>>
>>>> | OutputFormat:                 |
>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>>                                      |
>>>>
>>>> | Compressed:                   | No
>>>>              | NULL                                               |
>>>>
>>>> | Num Buckets:                  | -1
>>>>              | NULL                                               |
>>>>
>>>> | Bucket Columns:               | []
>>>>              | NULL                                               |
>>>>
>>>> | Sort Columns:                 | []
>>>>              | NULL                                               |
>>>>
>>>> | Storage Desc Params:          | NULL
>>>>              | NULL                                               |
>>>>
>>>> |                               | serialization.format
>>>>              | 1                                                  |
>>>>
>>>>
>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>
>>>> 33 rows selected (0.159 seconds)
>>>>
>>>> Now check that in spark-shell again
>>>>
>>>>
>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |createtab_stmt
>>>>
>>>>                                   |
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |CREATE TABLE `test`.`etc` (
>>>>
>>>> *  `id` BIGINT,*
>>>>
>>>> *  `col1` VARCHAR(30),*
>>>>
>>>> *  `col2` STRING)*
>>>>
>>>> USING text
>>>>
>>>> TBLPROPERTIES (
>>>>
>>>>   'bucketing_version' = '2',
>>>>
>>>>   'transient_lastDdlTime' = '1647024660')
>>>>
>>>> |
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>>
>>>> It shows OK.  Soo in summary you get column definitions in Spark as you
>>>> have defined them in Hive
>>>>
>>>>
>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>> have had such issues.
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>> m.venkatbeece@gmail.com> wrote:
>>>>
>>>>> hi Spark Team,
>>>>>
>>>>> I have raised a question on Spark through Stackoverflow. When you get
>>>>> a chance, can you please take a look and help me ?.
>>>>>
>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>
>>>>> Thanks,
>>>>> Venkat
>>>>> 2016173438
>>>>>
>>>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Posted by Venkatesan Muniappan <m....@gmail.com>.

Thank you. I am trying to get the table definition for the existing tables.
BTW, the create and show command that you executed, was it on Spark 3.x ? .

Thanks,
Venkat
2016173438


On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Well I do not know what has changed. However, this should not affect your
> work.
>
>
> Try to create table in Spark
>
>
> sqltext: String =
>
>   CREATE TABLE if not exists test.etcs(
>
>      ID INT
>
>    , CLUSTERED INT
>
>    , SCATTERED INT
>
>    , RANDOMISED INT
>
>    , RANDOM_STRING VARCHAR(50)
>
>    , SMALL_VC VARCHAR(10)
>
>    , PADDING  VARCHAR(4000)
>
>    , PADDING2 STRING
>
>   )
>
>   CLUSTERED BY (ID) INTO 256 BUCKETS
>
>   STORED AS PARQUET
>
>   TBLPROPERTIES (
>
>   "parquet.compress"="SNAPPY"
>
>  )
>
>
> scala> spark.sql (sqltext)
>
> scala> spark.sql("show create table test.etcs").show(false)
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |createtab_stmt
>
>
>
>                                           |
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |CREATE TABLE `test`.`etcs` (
>
>   `ID` INT,
>
>   `CLUSTERED` INT,
>
>   `SCATTERED` INT,
>
>   `RANDOMISED` INT,
>
>   `RANDOM_STRING` VARCHAR(50),
>
>   `SMALL_VC` VARCHAR(10),
>
>   `PADDING` VARCHAR(4000),
>
>   `PADDING2` STRING)
>
> USING parquet
>
> CLUSTERED BY (ID)
>
> INTO 256 BUCKETS
>
> TBLPROPERTIES (
>
>   'transient_lastDdlTime' = '1647033659',
>
>   'parquet.compress' = 'SNAPPY')
>
> |
>
>
> +------------------------------------------------------------------------------------------------------------------
>
>
> Note that columns are OK.
>
>
> Also check this link for the differences between CHAR, VARCHAR and STRING
> types in Hive
>
>
> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>
>
> HTH
>
>
>
>   view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
> m.venkatbeece@gmail.com> wrote:
>
>> Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR
>> and CHAR are properly showing in Spark 3. Do you know what changed in Spark
>> 3 that made this possible?. Or how can I achieve the same output in Spark
>> 2.4.1? If there are some conf options, that would be helpful.
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>>
>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hive 3.1.1
>>> Spark 3.1.1
>>>
>>> Your stack overflow issue raised and I quote:
>>>
>>> "I have a need to generate DDL statements for Hive tables & views
>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>> takes around 5-10 seconds for each of the statements whereas Spark
>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>> since it is faster compared to beeline. One downside of using spark for
>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>> as String and it doesn't preserve the length information that goes with
>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>> and the length information for CHAR,VARCHAR data types. *I am using
>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>
>>> Given below the sample create table command and its show create table
>>> output."
>>>
>>> Create a simple table in *Hive* in test database
>>>
>>> hive> *use test;*
>>> OK
>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>> OK
>>> hive> *desc formatted etc;*
>>> # col_name              data_type               comment
>>> *id                      bigint*
>>> *col1                    varchar(30)*
>>> *col2                    string*
>>>
>>> # Detailed Table Information
>>> Database:               test
>>> OwnerType:              USER
>>> Owner:                  hduser
>>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>>> LastAccessTime:         UNKNOWN
>>> Retention:              0
>>> Location:
>>>  hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>> Table Type:             MANAGED_TABLE
>>> Table Parameters:
>>>         COLUMN_STATS_ACCURATE
>>>  {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>         bucketing_version       2
>>>         numFiles                0
>>>         numRows                 0
>>>         rawDataSize             0
>>>         totalSize               0
>>>         transient_lastDdlTime   1647023374
>>>
>>> # Storage Information
>>> SerDe Library:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>>> OutputFormat:
>>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>> Compressed:             No
>>> Num Buckets:            -1
>>> Bucket Columns:         []
>>> Sort Columns:           []
>>> Storage Desc Params:
>>>         serialization.format    1
>>>
>>> Now let's go to spark-shell
>>>                                               ^
>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> |createtab_stmt
>>>
>>>                                 |
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> |CREATE TABLE `test`.`etc` (
>>> *  `id` BIGINT,*
>>> *  `col1` VARCHAR(30),*
>>> *  `col2` STRING)*
>>> USING text
>>> TBLPROPERTIES (
>>>   'bucketing_version' = '2',
>>>   'transient_lastDdlTime' = '1647023374')
>>> |
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> You can see Spark shows columns correctly
>>>
>>> Now let us go and create the same table in hive through beeline
>>>
>>>
>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>
>>> No rows affected (0.019 seconds)
>>>
>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
>>> VARCHAR(30), col2 STRING)*
>>>
>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>
>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>
>>> . . . . . . . . . . . . . . . . . . >
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> |           col_name            |                     data_type
>>>             |                      comment                       |
>>>
>>>
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> | # col_name                    | data_type
>>>             | comment                                            |
>>>
>>> *| id                            | bigint
>>>              |                                                    |*
>>>
>>> *| col1                          | varchar(30)
>>>               |                                                    |*
>>>
>>> *| col2                          | string
>>>              |                                                    |*
>>>
>>> |                               | NULL
>>>              | NULL                                               |
>>>
>>> | # Detailed Table Information  | NULL
>>>              | NULL                                               |
>>>
>>> | Database:                     | test
>>>              | NULL                                               |
>>>
>>> | OwnerType:                    | USER
>>>              | NULL                                               |
>>>
>>> | Owner:                        | hduser
>>>              | NULL                                               |
>>>
>>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>>              | NULL                                               |
>>>
>>> | LastAccessTime:               | UNKNOWN
>>>             | NULL                                               |
>>>
>>> | Retention:                    | 0
>>>             | NULL                                               |
>>>
>>> | Location:                     |
>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>                              |
>>>
>>> | Table Type:                   | MANAGED_TABLE
>>>             | NULL                                               |
>>>
>>> | Table Parameters:             | NULL
>>>              | NULL                                               |
>>>
>>> |                               | COLUMN_STATS_ACCURATE
>>>             |
>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>> |
>>>
>>> |                               | bucketing_version
>>>             | 2                                                  |
>>>
>>> |                               | numFiles
>>>              | 0                                                  |
>>>
>>> |                               | numRows
>>>             | 0                                                  |
>>>
>>> |                               | rawDataSize
>>>             | 0                                                  |
>>>
>>> |                               | totalSize
>>>             | 0                                                  |
>>>
>>> |                               | transient_lastDdlTime
>>>             | 1647024660                                         |
>>>
>>> |                               | NULL
>>>              | NULL                                               |
>>>
>>> | # Storage Information         | NULL
>>>              | NULL                                               |
>>>
>>> | SerDe Library:                |
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>                              |
>>>
>>> | InputFormat:                  |
>>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>>                              |
>>>
>>> | OutputFormat:                 |
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>                                      |
>>>
>>> | Compressed:                   | No
>>>              | NULL                                               |
>>>
>>> | Num Buckets:                  | -1
>>>              | NULL                                               |
>>>
>>> | Bucket Columns:               | []
>>>              | NULL                                               |
>>>
>>> | Sort Columns:                 | []
>>>              | NULL                                               |
>>>
>>> | Storage Desc Params:          | NULL
>>>              | NULL                                               |
>>>
>>> |                               | serialization.format
>>>              | 1                                                  |
>>>
>>>
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> 33 rows selected (0.159 seconds)
>>>
>>> Now check that in spark-shell again
>>>
>>>
>>> scala> spark.sql("show create table test.etc").show(false)
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |createtab_stmt
>>>
>>>                                 |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |CREATE TABLE `test`.`etc` (
>>>
>>> *  `id` BIGINT,*
>>>
>>> *  `col1` VARCHAR(30),*
>>>
>>> *  `col2` STRING)*
>>>
>>> USING text
>>>
>>> TBLPROPERTIES (
>>>
>>>   'bucketing_version' = '2',
>>>
>>>   'transient_lastDdlTime' = '1647024660')
>>>
>>> |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>>
>>> It shows OK.  Soo in summary you get column definitions in Spark as you
>>> have defined them in Hive
>>>
>>>
>>> In your statement above and I quote "I am using Spark 2.4.1 and Beeline
>>> 2.1.1", refers to older versions of Spark and hive which may have had
>>> such issues.
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>> m.venkatbeece@gmail.com> wrote:
>>>
>>>> hi Spark Team,
>>>>
>>>> I have raised a question on Spark through Stackoverflow. When you get a
>>>> chance, can you please take a look and help me ?.
>>>>
>>>> https://stackoverflow.com/q/71431757/5927843
>>>>
>>>> Thanks,
>>>> Venkat
>>>> 2016173438
>>>>
>>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Posted by Mich Talebzadeh <mi...@gmail.com>.

Well I do not know what has changed. However, this should not affect your
work.


Try to create table in Spark


sqltext: String =

  CREATE TABLE if not exists test.etcs(

     ID INT

   , CLUSTERED INT

   , SCATTERED INT

   , RANDOMISED INT

   , RANDOM_STRING VARCHAR(50)

   , SMALL_VC VARCHAR(10)

   , PADDING  VARCHAR(4000)

   , PADDING2 STRING

  )

  CLUSTERED BY (ID) INTO 256 BUCKETS

  STORED AS PARQUET

  TBLPROPERTIES (

  "parquet.compress"="SNAPPY"

 )


scala> spark.sql (sqltext)

scala> spark.sql("show create table test.etcs").show(false)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|createtab_stmt



                                          |

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|CREATE TABLE `test`.`etcs` (

  `ID` INT,

  `CLUSTERED` INT,

  `SCATTERED` INT,

  `RANDOMISED` INT,

  `RANDOM_STRING` VARCHAR(50),

  `SMALL_VC` VARCHAR(10),

  `PADDING` VARCHAR(4000),

  `PADDING2` STRING)

USING parquet

CLUSTERED BY (ID)

INTO 256 BUCKETS

TBLPROPERTIES (

  'transient_lastDdlTime' = '1647033659',

  'parquet.compress' = 'SNAPPY')

|

+------------------------------------------------------------------------------------------------------------------


Note that columns are OK.


Also check this link for the differences between CHAR, VARCHAR and STRING
types in Hive


https://cwiki.apache.org/confluence/display/hive/languagemanual+types


HTH



  view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <m....@gmail.com>
wrote:

> Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR
> and CHAR are properly showing in Spark 3. Do you know what changed in Spark
> 3 that made this possible?. Or how can I achieve the same output in Spark
> 2.4.1? If there are some conf options, that would be helpful.
>
> Thanks,
> Venkat
> 2016173438
>
>
> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hive 3.1.1
>> Spark 3.1.1
>>
>> Your stack overflow issue raised and I quote:
>>
>> "I have a need to generate DDL statements for Hive tables & views
>> programmatically. I tried using Spark and Beeline for this task. Beeline
>> takes around 5-10 seconds for each of the statements whereas Spark
>> completes the same thing in a few milliseconds. I am planning to use Spark
>> since it is faster compared to beeline. One downside of using spark for
>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>> as String and it doesn't preserve the length information that goes with
>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>> and the length information for CHAR,VARCHAR data types. *I am using
>> Spark 2.4.1 and Beeline 2.1.1.*
>>
>> Given below the sample create table command and its show create table
>> output."
>>
>> Create a simple table in *Hive* in test database
>>
>> hive> *use test;*
>> OK
>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>> OK
>> hive> *desc formatted etc;*
>> # col_name              data_type               comment
>> *id                      bigint*
>> *col1                    varchar(30)*
>> *col2                    string*
>>
>> # Detailed Table Information
>> Database:               test
>> OwnerType:              USER
>> Owner:                  hduser
>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>> LastAccessTime:         UNKNOWN
>> Retention:              0
>> Location:               hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>> Table Type:             MANAGED_TABLE
>> Table Parameters:
>>         COLUMN_STATS_ACCURATE
>>  {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>         bucketing_version       2
>>         numFiles                0
>>         numRows                 0
>>         rawDataSize             0
>>         totalSize               0
>>         transient_lastDdlTime   1647023374
>>
>> # Storage Information
>> SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>> OutputFormat:
>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>> Compressed:             No
>> Num Buckets:            -1
>> Bucket Columns:         []
>> Sort Columns:           []
>> Storage Desc Params:
>>         serialization.format    1
>>
>> Now let's go to spark-shell
>>                                               ^
>> scala> *spark.sql("show create table test.etc").show(false)*
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>> |createtab_stmt
>>
>>                                 |
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>> |CREATE TABLE `test`.`etc` (
>> *  `id` BIGINT,*
>> *  `col1` VARCHAR(30),*
>> *  `col2` STRING)*
>> USING text
>> TBLPROPERTIES (
>>   'bucketing_version' = '2',
>>   'transient_lastDdlTime' = '1647023374')
>> |
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> You can see Spark shows columns correctly
>>
>> Now let us go and create the same table in hive through beeline
>>
>>
>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>
>> No rows affected (0.019 seconds)
>>
>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
>> VARCHAR(30), col2 STRING)*
>>
>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>
>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>
>> . . . . . . . . . . . . . . . . . . >
>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>
>> |           col_name            |                     data_type
>>             |                      comment                       |
>>
>>
>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>
>> | # col_name                    | data_type
>>             | comment                                            |
>>
>> *| id                            | bigint
>>              |                                                    |*
>>
>> *| col1                          | varchar(30)
>>             |                                                    |*
>>
>> *| col2                          | string
>>              |                                                    |*
>>
>> |                               | NULL
>>            | NULL                                               |
>>
>> | # Detailed Table Information  | NULL
>>            | NULL                                               |
>>
>> | Database:                     | test
>>            | NULL                                               |
>>
>> | OwnerType:                    | USER
>>            | NULL                                               |
>>
>> | Owner:                        | hduser
>>            | NULL                                               |
>>
>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>            | NULL                                               |
>>
>> | LastAccessTime:               | UNKNOWN
>>             | NULL                                               |
>>
>> | Retention:                    | 0
>>             | NULL                                               |
>>
>> | Location:                     |
>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>                              |
>>
>> | Table Type:                   | MANAGED_TABLE
>>             | NULL                                               |
>>
>> | Table Parameters:             | NULL
>>            | NULL                                               |
>>
>> |                               | COLUMN_STATS_ACCURATE
>>             |
>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>> |
>>
>> |                               | bucketing_version
>>             | 2                                                  |
>>
>> |                               | numFiles
>>            | 0                                                  |
>>
>> |                               | numRows
>>             | 0                                                  |
>>
>> |                               | rawDataSize
>>             | 0                                                  |
>>
>> |                               | totalSize
>>             | 0                                                  |
>>
>> |                               | transient_lastDdlTime
>>             | 1647024660                                         |
>>
>> |                               | NULL
>>            | NULL                                               |
>>
>> | # Storage Information         | NULL
>>            | NULL                                               |
>>
>> | SerDe Library:                |
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>                              |
>>
>> | InputFormat:                  |
>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>                              |
>>
>> | OutputFormat:                 |
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>                                      |
>>
>> | Compressed:                   | No
>>            | NULL                                               |
>>
>> | Num Buckets:                  | -1
>>            | NULL                                               |
>>
>> | Bucket Columns:               | []
>>            | NULL                                               |
>>
>> | Sort Columns:                 | []
>>            | NULL                                               |
>>
>> | Storage Desc Params:          | NULL
>>            | NULL                                               |
>>
>> |                               | serialization.format
>>            | 1                                                  |
>>
>>
>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>
>> 33 rows selected (0.159 seconds)
>>
>> Now check that in spark-shell again
>>
>>
>> scala> spark.sql("show create table test.etc").show(false)
>>
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |createtab_stmt
>>
>>                                 |
>>
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>> |CREATE TABLE `test`.`etc` (
>>
>> *  `id` BIGINT,*
>>
>> *  `col1` VARCHAR(30),*
>>
>> *  `col2` STRING)*
>>
>> USING text
>>
>> TBLPROPERTIES (
>>
>>   'bucketing_version' = '2',
>>
>>   'transient_lastDdlTime' = '1647024660')
>>
>> |
>>
>>
>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>
>>
>> It shows OK.  Soo in summary you get column definitions in Spark as you
>> have defined them in Hive
>>
>>
>> In your statement above and I quote "I am using Spark 2.4.1 and Beeline
>> 2.1.1", refers to older versions of Spark and hive which may have had
>> such issues.
>>
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>> m.venkatbeece@gmail.com> wrote:
>>
>>> hi Spark Team,
>>>
>>> I have raised a question on Spark through Stackoverflow. When you get a
>>> chance, can you please take a look and help me ?.
>>>
>>> https://stackoverflow.com/q/71431757/5927843
>>>
>>> Thanks,
>>> Venkat
>>> 2016173438
>>>
>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Posted by Venkatesan Muniappan <m....@gmail.com>.

Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR
and CHAR are properly showing in Spark 3. Do you know what changed in Spark
3 that made this possible?. Or how can I achieve the same output in Spark
2.4.1? If there are some conf options, that would be helpful.

Thanks,
Venkat
2016173438


On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hive 3.1.1
> Spark 3.1.1
>
> Your stack overflow issue raised and I quote:
>
> "I have a need to generate DDL statements for Hive tables & views
> programmatically. I tried using Spark and Beeline for this task. Beeline
> takes around 5-10 seconds for each of the statements whereas Spark
> completes the same thing in a few milliseconds. I am planning to use Spark
> since it is faster compared to beeline. One downside of using spark for
> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
> as String and it doesn't preserve the length information that goes with
> CHAR,VARCHAR data types. At the same time beeline preserves the data type
> and the length information for CHAR,VARCHAR data types. *I am using Spark
> 2.4.1 and Beeline 2.1.1.*
>
> Given below the sample create table command and its show create table
> output."
>
> Create a simple table in *Hive* in test database
>
> hive> *use test;*
> OK
> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
> OK
> hive> *desc formatted etc;*
> # col_name              data_type               comment
> *id                      bigint*
> *col1                    varchar(30)*
> *col2                    string*
>
> # Detailed Table Information
> Database:               test
> OwnerType:              USER
> Owner:                  hduser
> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
> LastAccessTime:         UNKNOWN
> Retention:              0
> Location:               hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
> Table Type:             MANAGED_TABLE
> Table Parameters:
>         COLUMN_STATS_ACCURATE
>  {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>         bucketing_version       2
>         numFiles                0
>         numRows                 0
>         rawDataSize             0
>         totalSize               0
>         transient_lastDdlTime   1647023374
>
> # Storage Information
> SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
> OutputFormat:
>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Compressed:             No
> Num Buckets:            -1
> Bucket Columns:         []
> Sort Columns:           []
> Storage Desc Params:
>         serialization.format    1
>
> Now let's go to spark-shell
>                                               ^
> scala> *spark.sql("show create table test.etc").show(false)*
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |createtab_stmt
>
>                               |
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |CREATE TABLE `test`.`etc` (
> *  `id` BIGINT,*
> *  `col1` VARCHAR(30),*
> *  `col2` STRING)*
> USING text
> TBLPROPERTIES (
>   'bucketing_version' = '2',
>   'transient_lastDdlTime' = '1647023374')
> |
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> You can see Spark shows columns correctly
>
> Now let us go and create the same table in hive through beeline
>
>
> 0: jdbc:hive2://rhes75:10099/default>* use test*
>
> No rows affected (0.019 seconds)
>
> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
> VARCHAR(30), col2 STRING)*
>
> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>
> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>
> . . . . . . . . . . . . . . . . . . >
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>
> |           col_name            |                     data_type
>           |                      comment                       |
>
>
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>
> | # col_name                    | data_type
>           | comment                                            |
>
> *| id                            | bigint
>            |                                                    |*
>
> *| col1                          | varchar(30)
>             |                                                    |*
>
> *| col2                          | string
>            |                                                    |*
>
> |                               | NULL
>            | NULL                                               |
>
> | # Detailed Table Information  | NULL
>            | NULL                                               |
>
> | Database:                     | test
>            | NULL                                               |
>
> | OwnerType:                    | USER
>            | NULL                                               |
>
> | Owner:                        | hduser
>            | NULL                                               |
>
> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>            | NULL                                               |
>
> | LastAccessTime:               | UNKNOWN
>           | NULL                                               |
>
> | Retention:                    | 0
>           | NULL                                               |
>
> | Location:                     |
> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>                              |
>
> | Table Type:                   | MANAGED_TABLE
>           | NULL                                               |
>
> | Table Parameters:             | NULL
>            | NULL                                               |
>
> |                               | COLUMN_STATS_ACCURATE
>           |
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
> |
>
> |                               | bucketing_version
>           | 2                                                  |
>
> |                               | numFiles
>            | 0                                                  |
>
> |                               | numRows
>           | 0                                                  |
>
> |                               | rawDataSize
>           | 0                                                  |
>
> |                               | totalSize
>           | 0                                                  |
>
> |                               | transient_lastDdlTime
>           | 1647024660                                         |
>
> |                               | NULL
>            | NULL                                               |
>
> | # Storage Information         | NULL
>            | NULL                                               |
>
> | SerDe Library:                |
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>                              |
>
> | InputFormat:                  |
> org.apache.hadoop.mapred.TextInputFormat           | NULL
>                              |
>
> | OutputFormat:                 |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>                                      |
>
> | Compressed:                   | No
>            | NULL                                               |
>
> | Num Buckets:                  | -1
>            | NULL                                               |
>
> | Bucket Columns:               | []
>            | NULL                                               |
>
> | Sort Columns:                 | []
>            | NULL                                               |
>
> | Storage Desc Params:          | NULL
>            | NULL                                               |
>
> |                               | serialization.format
>            | 1                                                  |
>
>
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>
> 33 rows selected (0.159 seconds)
>
> Now check that in spark-shell again
>
>
> scala> spark.sql("show create table test.etc").show(false)
>
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |createtab_stmt
>
>                               |
>
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |CREATE TABLE `test`.`etc` (
>
> *  `id` BIGINT,*
>
> *  `col1` VARCHAR(30),*
>
> *  `col2` STRING)*
>
> USING text
>
> TBLPROPERTIES (
>
>   'bucketing_version' = '2',
>
>   'transient_lastDdlTime' = '1647024660')
>
> |
>
>
> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
>
> It shows OK.  Soo in summary you get column definitions in Spark as you
> have defined them in Hive
>
>
> In your statement above and I quote "I am using Spark 2.4.1 and Beeline
> 2.1.1", refers to older versions of Spark and hive which may have had
> such issues.
>
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
> m.venkatbeece@gmail.com> wrote:
>
>> hi Spark Team,
>>
>> I have raised a question on Spark through Stackoverflow. When you get a
>> chance, can you please take a look and help me ?.
>>
>> https://stackoverflow.com/q/71431757/5927843
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hive 3.1.1
Spark 3.1.1

Your stack overflow issue raised and I quote:

"I have a need to generate DDL statements for Hive tables & views
programmatically. I tried using Spark and Beeline for this task. Beeline
takes around 5-10 seconds for each of the statements whereas Spark
completes the same thing in a few milliseconds. I am planning to use Spark
since it is faster compared to beeline. One downside of using spark for
getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
as String and it doesn't preserve the length information that goes with
CHAR,VARCHAR data types. At the same time beeline preserves the data type
and the length information for CHAR,VARCHAR data types. *I am using Spark
2.4.1 and Beeline 2.1.1.*

Given below the sample create table command and its show create table
output."

Create a simple table in *Hive* in test database

hive> *use test;*
OK
hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
OK
hive> *desc formatted etc;*
# col_name              data_type               comment
*id                      bigint*
*col1                    varchar(30)*
*col2                    string*

# Detailed Table Information
Database:               test
OwnerType:              USER
Owner:                  hduser
CreateTime:             Fri Mar 11 18:29:34 GMT 2022
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE
 {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
        bucketing_version       2
        numFiles                0
        numRows                 0
        rawDataSize             0
        totalSize               0
        transient_lastDdlTime   1647023374

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1

Now let's go to spark-shell
                                              ^
scala> *spark.sql("show create table test.etc").show(false)*
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt

                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
*  `id` BIGINT,*
*  `col1` VARCHAR(30),*
*  `col2` STRING)*
USING text
TBLPROPERTIES (
  'bucketing_version' = '2',
  'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

You can see Spark shows columns correctly

Now let us go and create the same table in hive through beeline


0: jdbc:hive2://rhes75:10099/default>* use test*

No rows affected (0.019 seconds)

0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
VARCHAR(30), col2 STRING)*

. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)

0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*

. . . . . . . . . . . . . . . . . . >
+-------------------------------+----------------------------------------------------+----------------------------------------------------+

|           col_name            |                     data_type
          |                      comment                       |

+-------------------------------+----------------------------------------------------+----------------------------------------------------+

| # col_name                    | data_type
          | comment                                            |

*| id                            | bigint
           |                                                    |*

*| col1                          | varchar(30)
          |                                                    |*

*| col2                          | string
           |                                                    |*

|                               | NULL
         | NULL                                               |

| # Detailed Table Information  | NULL
         | NULL                                               |

| Database:                     | test
         | NULL                                               |

| OwnerType:                    | USER
         | NULL                                               |

| Owner:                        | hduser
         | NULL                                               |

| CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
         | NULL                                               |

| LastAccessTime:               | UNKNOWN
          | NULL                                               |

| Retention:                    | 0
          | NULL                                               |

| Location:                     |
hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
                             |

| Table Type:                   | MANAGED_TABLE
          | NULL                                               |

| Table Parameters:             | NULL
         | NULL                                               |

|                               | COLUMN_STATS_ACCURATE
          |
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
|

|                               | bucketing_version
          | 2                                                  |

|                               | numFiles
         | 0                                                  |

|                               | numRows
          | 0                                                  |

|                               | rawDataSize
          | 0                                                  |

|                               | totalSize
          | 0                                                  |

|                               | transient_lastDdlTime
          | 1647024660                                         |

|                               | NULL
         | NULL                                               |

| # Storage Information         | NULL
         | NULL                                               |

| SerDe Library:                |
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
                             |

| InputFormat:                  | org.apache.hadoop.mapred.TextInputFormat
         | NULL                                               |

| OutputFormat:                 |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
                                     |

| Compressed:                   | No
         | NULL                                               |

| Num Buckets:                  | -1
         | NULL                                               |

| Bucket Columns:               | []
         | NULL                                               |

| Sort Columns:                 | []
         | NULL                                               |

| Storage Desc Params:          | NULL
         | NULL                                               |

|                               | serialization.format
         | 1                                                  |

+-------------------------------+----------------------------------------------------+----------------------------------------------------+

33 rows selected (0.159 seconds)

Now check that in spark-shell again


scala> spark.sql("show create table test.etc").show(false)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|createtab_stmt

                              |

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|CREATE TABLE `test`.`etc` (

*  `id` BIGINT,*

*  `col1` VARCHAR(30),*

*  `col2` STRING)*

USING text

TBLPROPERTIES (

  'bucketing_version' = '2',

  'transient_lastDdlTime' = '1647024660')

|

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


It shows OK.  Soo in summary you get column definitions in Spark as you
have defined them in Hive


In your statement above and I quote "I am using Spark 2.4.1 and Beeline
2.1.1", refers to older versions of Spark and hive which may have had such
issues.


HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <m....@gmail.com>
wrote:

> hi Spark Team,
>
> I have raised a question on Spark through Stackoverflow. When you get a
> chance, can you please take a look and help me ?.
>
> https://stackoverflow.com/q/71431757/5927843
>
> Thanks,
> Venkat
> 2016173438
>