You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jianshi Huang <ji...@gmail.com> on 2014/12/07 05:28:48 UTC

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Ok, found another possible bug in Hive.

My current solution is to use ALTER TABLE CHANGE to rename the column names.

The problem is after renaming the column names, the value of the columns
became all NULL.

Before renaming:
scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])

Execute renaming:
scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
res13: org.apache.spark.sql.SchemaRDD =
SchemaRDD[972] at RDD at SchemaRDD.scala:108
== Query Plan ==
<Native command: executed by Hive>

After renaming:
scala> sql("select cre_ts from pmt limit 1").collect
res16: Array[org.apache.spark.sql.Row] = Array([null])

I created a JIRA for it:

  https://issues.apache.org/jira/browse/SPARK-4781


Jianshi

On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
wrote:

> Hmm... another issue I found doing this approach is that ANALYZE TABLE ...
> COMPUTE STATISTICS will fail to attach the metadata to the table, and later
> broadcast join and such will fail...
>
> Any idea how to fix this issue?
>
> Jianshi
>
> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Very interesting, the line doing drop table will throws an exception.
>> After removing it all works.
>>
>> Jianshi
>>
>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <ji...@gmail.com>
>> wrote:
>>
>>> Here's the solution I got after talking with Liancheng:
>>>
>>> 1) using backquote `..` to wrap up all illegal characters
>>>
>>>     val rdd = parquetFile(file)
>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>
>>>     val ddl_13 = s"""
>>>       |CREATE EXTERNAL TABLE $name (
>>>       |  $schema
>>>       |)
>>>       |STORED AS PARQUET
>>>       |LOCATION '$file'
>>>       """.stripMargin
>>>
>>>     sql(ddl_13)
>>>
>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>> had to drop and register table
>>>
>>>     val t = table(name)
>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>> s.name.replaceAll(".*?::", ""))))
>>>     sql(s"drop table $name")
>>>     applySchema(t, newSchema).registerTempTable(name)
>>>
>>> I'm testing it for now.
>>>
>>> Thanks for the help!
>>>
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>> for Spark to consume.
>>>>
>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>> identifier
>>>>
>>>> e.g.
>>>> sorted::id, sorted::cre_ts, ...
>>>>
>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>
>>>>   create external table pmt (
>>>>     sorted::id bigint
>>>>   )
>>>>   stored as parquet
>>>>   location '...'
>>>>
>>>> Obviously it didn't work, I also tried removing the identifier
>>>> sorted::, but the resulting rows contain only nulls.
>>>>
>>>> Any idea how to create a table in HiveContext from these Parquet files?
>>>>
>>>> Thanks,
>>>> Jianshi
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Jianshi Huang <ji...@gmail.com>.
FYI,

Latest hive 0.14/parquet will have column renaming support.

Jianshi

On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> You might also try out the recently added support for views.
>
> On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Ah... I see. Thanks for pointing it out.
>>
>> Then it means we cannot mount external table using customized column
>> names. hmm...
>>
>> Then the only option left is to use a subquery to add a bunch of column
>> alias. I'll try it later.
>>
>> Thanks,
>> Jianshi
>>
>> On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> This is by hive's design.  From the Hive documentation:
>>>
>>> The column change command will only modify Hive's metadata, and will not
>>>> modify data. Users should make sure the actual data layout of the
>>>> table/partition conforms with the metadata definition.
>>>
>>>
>>>
>>> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Ok, found another possible bug in Hive.
>>>>
>>>> My current solution is to use ALTER TABLE CHANGE to rename the column
>>>> names.
>>>>
>>>> The problem is after renaming the column names, the value of the
>>>> columns became all NULL.
>>>>
>>>> Before renaming:
>>>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>>>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>>>
>>>> Execute renaming:
>>>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>>>> res13: org.apache.spark.sql.SchemaRDD =
>>>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>>>> == Query Plan ==
>>>> <Native command: executed by Hive>
>>>>
>>>> After renaming:
>>>> scala> sql("select cre_ts from pmt limit 1").collect
>>>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>>>
>>>> I created a JIRA for it:
>>>>
>>>>   https://issues.apache.org/jira/browse/SPARK-4781
>>>>
>>>>
>>>> Jianshi
>>>>
>>>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>>>> later broadcast join and such will fail...
>>>>>
>>>>> Any idea how to fix this issue?
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Very interesting, the line doing drop table will throws an exception.
>>>>>> After removing it all works.
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <
>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>
>>>>>>> Here's the solution I got after talking with Liancheng:
>>>>>>>
>>>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>>>
>>>>>>>     val rdd = parquetFile(file)
>>>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>>>
>>>>>>>     val ddl_13 = s"""
>>>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>>>       |  $schema
>>>>>>>       |)
>>>>>>>       |STORED AS PARQUET
>>>>>>>       |LOCATION '$file'
>>>>>>>       """.stripMargin
>>>>>>>
>>>>>>>     sql(ddl_13)
>>>>>>>
>>>>>>> 2) create a new Schema and do applySchema to generate a new
>>>>>>> SchemaRDD, had to drop and register table
>>>>>>>
>>>>>>>     val t = table(name)
>>>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name
>>>>>>> = s.name.replaceAll(".*?::", ""))))
>>>>>>>     sql(s"drop table $name")
>>>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>>>
>>>>>>> I'm testing it for now.
>>>>>>>
>>>>>>> Thanks for the help!
>>>>>>>
>>>>>>>
>>>>>>> Jianshi
>>>>>>>
>>>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <
>>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I had to use Pig for some preprocessing and to generate Parquet
>>>>>>>> files for Spark to consume.
>>>>>>>>
>>>>>>>> However, due to Pig's limitation, the generated schema contains
>>>>>>>> Pig's identifier
>>>>>>>>
>>>>>>>> e.g.
>>>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>>>
>>>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>>>
>>>>>>>>   create external table pmt (
>>>>>>>>     sorted::id bigint
>>>>>>>>   )
>>>>>>>>   stored as parquet
>>>>>>>>   location '...'
>>>>>>>>
>>>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>>>
>>>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>>>> files?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jianshi
>>>>>>>> --
>>>>>>>> Jianshi Huang
>>>>>>>>
>>>>>>>> LinkedIn: jianshi
>>>>>>>> Twitter: @jshuang
>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jianshi Huang
>>>>>>>
>>>>>>> LinkedIn: jianshi
>>>>>>> Twitter: @jshuang
>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Jianshi Huang <ji...@gmail.com>.
FYI,

Latest hive 0.14/parquet will have column renaming support.

Jianshi

On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> You might also try out the recently added support for views.
>
> On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Ah... I see. Thanks for pointing it out.
>>
>> Then it means we cannot mount external table using customized column
>> names. hmm...
>>
>> Then the only option left is to use a subquery to add a bunch of column
>> alias. I'll try it later.
>>
>> Thanks,
>> Jianshi
>>
>> On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> This is by hive's design.  From the Hive documentation:
>>>
>>> The column change command will only modify Hive's metadata, and will not
>>>> modify data. Users should make sure the actual data layout of the
>>>> table/partition conforms with the metadata definition.
>>>
>>>
>>>
>>> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Ok, found another possible bug in Hive.
>>>>
>>>> My current solution is to use ALTER TABLE CHANGE to rename the column
>>>> names.
>>>>
>>>> The problem is after renaming the column names, the value of the
>>>> columns became all NULL.
>>>>
>>>> Before renaming:
>>>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>>>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>>>
>>>> Execute renaming:
>>>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>>>> res13: org.apache.spark.sql.SchemaRDD =
>>>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>>>> == Query Plan ==
>>>> <Native command: executed by Hive>
>>>>
>>>> After renaming:
>>>> scala> sql("select cre_ts from pmt limit 1").collect
>>>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>>>
>>>> I created a JIRA for it:
>>>>
>>>>   https://issues.apache.org/jira/browse/SPARK-4781
>>>>
>>>>
>>>> Jianshi
>>>>
>>>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>>>> later broadcast join and such will fail...
>>>>>
>>>>> Any idea how to fix this issue?
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Very interesting, the line doing drop table will throws an exception.
>>>>>> After removing it all works.
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <
>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>
>>>>>>> Here's the solution I got after talking with Liancheng:
>>>>>>>
>>>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>>>
>>>>>>>     val rdd = parquetFile(file)
>>>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>>>
>>>>>>>     val ddl_13 = s"""
>>>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>>>       |  $schema
>>>>>>>       |)
>>>>>>>       |STORED AS PARQUET
>>>>>>>       |LOCATION '$file'
>>>>>>>       """.stripMargin
>>>>>>>
>>>>>>>     sql(ddl_13)
>>>>>>>
>>>>>>> 2) create a new Schema and do applySchema to generate a new
>>>>>>> SchemaRDD, had to drop and register table
>>>>>>>
>>>>>>>     val t = table(name)
>>>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name
>>>>>>> = s.name.replaceAll(".*?::", ""))))
>>>>>>>     sql(s"drop table $name")
>>>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>>>
>>>>>>> I'm testing it for now.
>>>>>>>
>>>>>>> Thanks for the help!
>>>>>>>
>>>>>>>
>>>>>>> Jianshi
>>>>>>>
>>>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <
>>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I had to use Pig for some preprocessing and to generate Parquet
>>>>>>>> files for Spark to consume.
>>>>>>>>
>>>>>>>> However, due to Pig's limitation, the generated schema contains
>>>>>>>> Pig's identifier
>>>>>>>>
>>>>>>>> e.g.
>>>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>>>
>>>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>>>
>>>>>>>>   create external table pmt (
>>>>>>>>     sorted::id bigint
>>>>>>>>   )
>>>>>>>>   stored as parquet
>>>>>>>>   location '...'
>>>>>>>>
>>>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>>>
>>>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>>>> files?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jianshi
>>>>>>>> --
>>>>>>>> Jianshi Huang
>>>>>>>>
>>>>>>>> LinkedIn: jianshi
>>>>>>>> Twitter: @jshuang
>>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jianshi Huang
>>>>>>>
>>>>>>> LinkedIn: jianshi
>>>>>>> Twitter: @jshuang
>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Michael Armbrust <mi...@databricks.com>.
You might also try out the recently added support for views.

On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Ah... I see. Thanks for pointing it out.
>
> Then it means we cannot mount external table using customized column
> names. hmm...
>
> Then the only option left is to use a subquery to add a bunch of column
> alias. I'll try it later.
>
> Thanks,
> Jianshi
>
> On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> This is by hive's design.  From the Hive documentation:
>>
>> The column change command will only modify Hive's metadata, and will not
>>> modify data. Users should make sure the actual data layout of the
>>> table/partition conforms with the metadata definition.
>>
>>
>>
>> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
>> wrote:
>>
>>> Ok, found another possible bug in Hive.
>>>
>>> My current solution is to use ALTER TABLE CHANGE to rename the column
>>> names.
>>>
>>> The problem is after renaming the column names, the value of the columns
>>> became all NULL.
>>>
>>> Before renaming:
>>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>>
>>> Execute renaming:
>>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>>> res13: org.apache.spark.sql.SchemaRDD =
>>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>>> == Query Plan ==
>>> <Native command: executed by Hive>
>>>
>>> After renaming:
>>> scala> sql("select cre_ts from pmt limit 1").collect
>>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>>
>>> I created a JIRA for it:
>>>
>>>   https://issues.apache.org/jira/browse/SPARK-4781
>>>
>>>
>>> Jianshi
>>>
>>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>>> later broadcast join and such will fail...
>>>>
>>>> Any idea how to fix this issue?
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Very interesting, the line doing drop table will throws an exception.
>>>>> After removing it all works.
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Here's the solution I got after talking with Liancheng:
>>>>>>
>>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>>
>>>>>>     val rdd = parquetFile(file)
>>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>>
>>>>>>     val ddl_13 = s"""
>>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>>       |  $schema
>>>>>>       |)
>>>>>>       |STORED AS PARQUET
>>>>>>       |LOCATION '$file'
>>>>>>       """.stripMargin
>>>>>>
>>>>>>     sql(ddl_13)
>>>>>>
>>>>>> 2) create a new Schema and do applySchema to generate a new
>>>>>> SchemaRDD, had to drop and register table
>>>>>>
>>>>>>     val t = table(name)
>>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>>>>> s.name.replaceAll(".*?::", ""))))
>>>>>>     sql(s"drop table $name")
>>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>>
>>>>>> I'm testing it for now.
>>>>>>
>>>>>> Thanks for the help!
>>>>>>
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <
>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I had to use Pig for some preprocessing and to generate Parquet
>>>>>>> files for Spark to consume.
>>>>>>>
>>>>>>> However, due to Pig's limitation, the generated schema contains
>>>>>>> Pig's identifier
>>>>>>>
>>>>>>> e.g.
>>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>>
>>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>>
>>>>>>>   create external table pmt (
>>>>>>>     sorted::id bigint
>>>>>>>   )
>>>>>>>   stored as parquet
>>>>>>>   location '...'
>>>>>>>
>>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>>
>>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>>> files?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jianshi
>>>>>>> --
>>>>>>> Jianshi Huang
>>>>>>>
>>>>>>> LinkedIn: jianshi
>>>>>>> Twitter: @jshuang
>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Michael Armbrust <mi...@databricks.com>.
You might also try out the recently added support for views.

On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Ah... I see. Thanks for pointing it out.
>
> Then it means we cannot mount external table using customized column
> names. hmm...
>
> Then the only option left is to use a subquery to add a bunch of column
> alias. I'll try it later.
>
> Thanks,
> Jianshi
>
> On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> This is by hive's design.  From the Hive documentation:
>>
>> The column change command will only modify Hive's metadata, and will not
>>> modify data. Users should make sure the actual data layout of the
>>> table/partition conforms with the metadata definition.
>>
>>
>>
>> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
>> wrote:
>>
>>> Ok, found another possible bug in Hive.
>>>
>>> My current solution is to use ALTER TABLE CHANGE to rename the column
>>> names.
>>>
>>> The problem is after renaming the column names, the value of the columns
>>> became all NULL.
>>>
>>> Before renaming:
>>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>>
>>> Execute renaming:
>>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>>> res13: org.apache.spark.sql.SchemaRDD =
>>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>>> == Query Plan ==
>>> <Native command: executed by Hive>
>>>
>>> After renaming:
>>> scala> sql("select cre_ts from pmt limit 1").collect
>>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>>
>>> I created a JIRA for it:
>>>
>>>   https://issues.apache.org/jira/browse/SPARK-4781
>>>
>>>
>>> Jianshi
>>>
>>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>>> later broadcast join and such will fail...
>>>>
>>>> Any idea how to fix this issue?
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Very interesting, the line doing drop table will throws an exception.
>>>>> After removing it all works.
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Here's the solution I got after talking with Liancheng:
>>>>>>
>>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>>
>>>>>>     val rdd = parquetFile(file)
>>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>>
>>>>>>     val ddl_13 = s"""
>>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>>       |  $schema
>>>>>>       |)
>>>>>>       |STORED AS PARQUET
>>>>>>       |LOCATION '$file'
>>>>>>       """.stripMargin
>>>>>>
>>>>>>     sql(ddl_13)
>>>>>>
>>>>>> 2) create a new Schema and do applySchema to generate a new
>>>>>> SchemaRDD, had to drop and register table
>>>>>>
>>>>>>     val t = table(name)
>>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>>>>> s.name.replaceAll(".*?::", ""))))
>>>>>>     sql(s"drop table $name")
>>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>>
>>>>>> I'm testing it for now.
>>>>>>
>>>>>> Thanks for the help!
>>>>>>
>>>>>>
>>>>>> Jianshi
>>>>>>
>>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <
>>>>>> jianshi.huang@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I had to use Pig for some preprocessing and to generate Parquet
>>>>>>> files for Spark to consume.
>>>>>>>
>>>>>>> However, due to Pig's limitation, the generated schema contains
>>>>>>> Pig's identifier
>>>>>>>
>>>>>>> e.g.
>>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>>
>>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>>
>>>>>>>   create external table pmt (
>>>>>>>     sorted::id bigint
>>>>>>>   )
>>>>>>>   stored as parquet
>>>>>>>   location '...'
>>>>>>>
>>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>>
>>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>>> files?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jianshi
>>>>>>> --
>>>>>>> Jianshi Huang
>>>>>>>
>>>>>>> LinkedIn: jianshi
>>>>>>> Twitter: @jshuang
>>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Jianshi Huang <ji...@gmail.com>.
Ah... I see. Thanks for pointing it out.

Then it means we cannot mount external table using customized column names.
hmm...

Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.

Thanks,
Jianshi

On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> This is by hive's design.  From the Hive documentation:
>
> The column change command will only modify Hive's metadata, and will not
>> modify data. Users should make sure the actual data layout of the
>> table/partition conforms with the metadata definition.
>
>
>
> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Ok, found another possible bug in Hive.
>>
>> My current solution is to use ALTER TABLE CHANGE to rename the column
>> names.
>>
>> The problem is after renaming the column names, the value of the columns
>> became all NULL.
>>
>> Before renaming:
>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>
>> Execute renaming:
>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>> res13: org.apache.spark.sql.SchemaRDD =
>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>> == Query Plan ==
>> <Native command: executed by Hive>
>>
>> After renaming:
>> scala> sql("select cre_ts from pmt limit 1").collect
>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>
>> I created a JIRA for it:
>>
>>   https://issues.apache.org/jira/browse/SPARK-4781
>>
>>
>> Jianshi
>>
>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
>> wrote:
>>
>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>> later broadcast join and such will fail...
>>>
>>> Any idea how to fix this issue?
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Very interesting, the line doing drop table will throws an exception.
>>>> After removing it all works.
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Here's the solution I got after talking with Liancheng:
>>>>>
>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>
>>>>>     val rdd = parquetFile(file)
>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>
>>>>>     val ddl_13 = s"""
>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>       |  $schema
>>>>>       |)
>>>>>       |STORED AS PARQUET
>>>>>       |LOCATION '$file'
>>>>>       """.stripMargin
>>>>>
>>>>>     sql(ddl_13)
>>>>>
>>>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>>>> had to drop and register table
>>>>>
>>>>>     val t = table(name)
>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>>>> s.name.replaceAll(".*?::", ""))))
>>>>>     sql(s"drop table $name")
>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>
>>>>> I'm testing it for now.
>>>>>
>>>>> Thanks for the help!
>>>>>
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>>>> for Spark to consume.
>>>>>>
>>>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>>>> identifier
>>>>>>
>>>>>> e.g.
>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>
>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>
>>>>>>   create external table pmt (
>>>>>>     sorted::id bigint
>>>>>>   )
>>>>>>   stored as parquet
>>>>>>   location '...'
>>>>>>
>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>
>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>> files?
>>>>>>
>>>>>> Thanks,
>>>>>> Jianshi
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Jianshi Huang <ji...@gmail.com>.
Ah... I see. Thanks for pointing it out.

Then it means we cannot mount external table using customized column names.
hmm...

Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.

Thanks,
Jianshi

On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust <mi...@databricks.com>
wrote:

> This is by hive's design.  From the Hive documentation:
>
> The column change command will only modify Hive's metadata, and will not
>> modify data. Users should make sure the actual data layout of the
>> table/partition conforms with the metadata definition.
>
>
>
> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Ok, found another possible bug in Hive.
>>
>> My current solution is to use ALTER TABLE CHANGE to rename the column
>> names.
>>
>> The problem is after renaming the column names, the value of the columns
>> became all NULL.
>>
>> Before renaming:
>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>
>> Execute renaming:
>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>> res13: org.apache.spark.sql.SchemaRDD =
>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>> == Query Plan ==
>> <Native command: executed by Hive>
>>
>> After renaming:
>> scala> sql("select cre_ts from pmt limit 1").collect
>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>
>> I created a JIRA for it:
>>
>>   https://issues.apache.org/jira/browse/SPARK-4781
>>
>>
>> Jianshi
>>
>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
>> wrote:
>>
>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>> later broadcast join and such will fail...
>>>
>>> Any idea how to fix this issue?
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Very interesting, the line doing drop table will throws an exception.
>>>> After removing it all works.
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Here's the solution I got after talking with Liancheng:
>>>>>
>>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>>
>>>>>     val rdd = parquetFile(file)
>>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>>
>>>>>     val ddl_13 = s"""
>>>>>       |CREATE EXTERNAL TABLE $name (
>>>>>       |  $schema
>>>>>       |)
>>>>>       |STORED AS PARQUET
>>>>>       |LOCATION '$file'
>>>>>       """.stripMargin
>>>>>
>>>>>     sql(ddl_13)
>>>>>
>>>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>>>> had to drop and register table
>>>>>
>>>>>     val t = table(name)
>>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>>>> s.name.replaceAll(".*?::", ""))))
>>>>>     sql(s"drop table $name")
>>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>>
>>>>> I'm testing it for now.
>>>>>
>>>>> Thanks for the help!
>>>>>
>>>>>
>>>>> Jianshi
>>>>>
>>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <jianshi.huang@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>>>> for Spark to consume.
>>>>>>
>>>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>>>> identifier
>>>>>>
>>>>>> e.g.
>>>>>> sorted::id, sorted::cre_ts, ...
>>>>>>
>>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>>
>>>>>>   create external table pmt (
>>>>>>     sorted::id bigint
>>>>>>   )
>>>>>>   stored as parquet
>>>>>>   location '...'
>>>>>>
>>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>>
>>>>>> Any idea how to create a table in HiveContext from these Parquet
>>>>>> files?
>>>>>>
>>>>>> Thanks,
>>>>>> Jianshi
>>>>>> --
>>>>>> Jianshi Huang
>>>>>>
>>>>>> LinkedIn: jianshi
>>>>>> Twitter: @jshuang
>>>>>> Github & Blog: http://huangjs.github.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Michael Armbrust <mi...@databricks.com>.
This is by hive's design.  From the Hive documentation:

The column change command will only modify Hive's metadata, and will not
> modify data. Users should make sure the actual data layout of the
> table/partition conforms with the metadata definition.



On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Ok, found another possible bug in Hive.
>
> My current solution is to use ALTER TABLE CHANGE to rename the column
> names.
>
> The problem is after renaming the column names, the value of the columns
> became all NULL.
>
> Before renaming:
> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>
> Execute renaming:
> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
> res13: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[972] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> <Native command: executed by Hive>
>
> After renaming:
> scala> sql("select cre_ts from pmt limit 1").collect
> res16: Array[org.apache.spark.sql.Row] = Array([null])
>
> I created a JIRA for it:
>
>   https://issues.apache.org/jira/browse/SPARK-4781
>
>
> Jianshi
>
> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>> later broadcast join and such will fail...
>>
>> Any idea how to fix this issue?
>>
>> Jianshi
>>
>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <ji...@gmail.com>
>> wrote:
>>
>>> Very interesting, the line doing drop table will throws an exception.
>>> After removing it all works.
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Here's the solution I got after talking with Liancheng:
>>>>
>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>
>>>>     val rdd = parquetFile(file)
>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>
>>>>     val ddl_13 = s"""
>>>>       |CREATE EXTERNAL TABLE $name (
>>>>       |  $schema
>>>>       |)
>>>>       |STORED AS PARQUET
>>>>       |LOCATION '$file'
>>>>       """.stripMargin
>>>>
>>>>     sql(ddl_13)
>>>>
>>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>>> had to drop and register table
>>>>
>>>>     val t = table(name)
>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>>> s.name.replaceAll(".*?::", ""))))
>>>>     sql(s"drop table $name")
>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>
>>>> I'm testing it for now.
>>>>
>>>> Thanks for the help!
>>>>
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>>> for Spark to consume.
>>>>>
>>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>>> identifier
>>>>>
>>>>> e.g.
>>>>> sorted::id, sorted::cre_ts, ...
>>>>>
>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>
>>>>>   create external table pmt (
>>>>>     sorted::id bigint
>>>>>   )
>>>>>   stored as parquet
>>>>>   location '...'
>>>>>
>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>
>>>>> Any idea how to create a table in HiveContext from these Parquet files?
>>>>>
>>>>> Thanks,
>>>>> Jianshi
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Posted by Michael Armbrust <mi...@databricks.com>.
This is by hive's design.  From the Hive documentation:

The column change command will only modify Hive's metadata, and will not
> modify data. Users should make sure the actual data layout of the
> table/partition conforms with the metadata definition.



On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> Ok, found another possible bug in Hive.
>
> My current solution is to use ALTER TABLE CHANGE to rename the column
> names.
>
> The problem is after renaming the column names, the value of the columns
> became all NULL.
>
> Before renaming:
> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>
> Execute renaming:
> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
> res13: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[972] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> <Native command: executed by Hive>
>
> After renaming:
> scala> sql("select cre_ts from pmt limit 1").collect
> res16: Array[org.apache.spark.sql.Row] = Array([null])
>
> I created a JIRA for it:
>
>   https://issues.apache.org/jira/browse/SPARK-4781
>
>
> Jianshi
>
> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>> later broadcast join and such will fail...
>>
>> Any idea how to fix this issue?
>>
>> Jianshi
>>
>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang <ji...@gmail.com>
>> wrote:
>>
>>> Very interesting, the line doing drop table will throws an exception.
>>> After removing it all works.
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang <ji...@gmail.com>
>>> wrote:
>>>
>>>> Here's the solution I got after talking with Liancheng:
>>>>
>>>> 1) using backquote `..` to wrap up all illegal characters
>>>>
>>>>     val rdd = parquetFile(file)
>>>>     val schema = rdd.schema.fields.map(f => s"`${f.name}`
>>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>>
>>>>     val ddl_13 = s"""
>>>>       |CREATE EXTERNAL TABLE $name (
>>>>       |  $schema
>>>>       |)
>>>>       |STORED AS PARQUET
>>>>       |LOCATION '$file'
>>>>       """.stripMargin
>>>>
>>>>     sql(ddl_13)
>>>>
>>>> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
>>>> had to drop and register table
>>>>
>>>>     val t = table(name)
>>>>     val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
>>>> s.name.replaceAll(".*?::", ""))))
>>>>     sql(s"drop table $name")
>>>>     applySchema(t, newSchema).registerTempTable(name)
>>>>
>>>> I'm testing it for now.
>>>>
>>>> Thanks for the help!
>>>>
>>>>
>>>> Jianshi
>>>>
>>>> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang <ji...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I had to use Pig for some preprocessing and to generate Parquet files
>>>>> for Spark to consume.
>>>>>
>>>>> However, due to Pig's limitation, the generated schema contains Pig's
>>>>> identifier
>>>>>
>>>>> e.g.
>>>>> sorted::id, sorted::cre_ts, ...
>>>>>
>>>>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>>>>
>>>>>   create external table pmt (
>>>>>     sorted::id bigint
>>>>>   )
>>>>>   stored as parquet
>>>>>   location '...'
>>>>>
>>>>> Obviously it didn't work, I also tried removing the identifier
>>>>> sorted::, but the resulting rows contain only nulls.
>>>>>
>>>>> Any idea how to create a table in HiveContext from these Parquet files?
>>>>>
>>>>> Thanks,
>>>>> Jianshi
>>>>> --
>>>>> Jianshi Huang
>>>>>
>>>>> LinkedIn: jianshi
>>>>> Twitter: @jshuang
>>>>> Github & Blog: http://huangjs.github.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jianshi Huang
>>>>
>>>> LinkedIn: jianshi
>>>> Twitter: @jshuang
>>>> Github & Blog: http://huangjs.github.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>