You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kpeng1 <kp...@gmail.com> on 2015/03/16 18:55:22 UTC

Creating a hive table on top of a parquet file written out by spark

Hi All,

I wrote out a complex parquet file from spark sql and now I am trying to put
a hive table on top.  I am running into issues with creating the hive table
itself.  Here is the json that I wrote out to parquet using spark sql:
{"user_id":"4513","providers":[{"id":"4220","name":"dbmvl","behaviors":{"b1":"gxybq","b2":"ntfmx"}},{"id":"4173","name":"dvjke","behaviors":{"b1":"sizow","b2":"knuuc"}}]}
{"user_id":"3960","providers":[{"id":"1859","name":"ponsv","behaviors":{"b1":"ahfgc","b2":"txpea"}},{"id":"103","name":"uhqqo","behaviors":{"b1":"lktyo","b2":"ituxy"}}]}
{"user_id":"567","providers":[{"id":"9622","name":"crjju","behaviors":{"b1":"rhaqc","b2":"npnot"}},{"id":"6965","name":"fnheh","behaviors":{"b1":"eipse","b2":"nvxqk"}}]}

I basically created a hive context and read in the json file using jsonFile
and then I wrote it back out using saveAsParquetFile.

Afterwards I was trying to create a hive table on top of the parquet file. 
Here is the hive hql that I have:
create table test (mycol STRUCT<user_id:String,
providers:ARRAY&lt;STRUCT&lt;id:String, name:String,
behaviors:MAP&lt;String, String>>>>) stored as parquet;
Alter table test set location 'hdfs:///tmp/test.parquet';

I get errors when I try to do a select * on the table:
Failed with exception java.io.IOException:java.lang.IllegalStateException:
Column mycol at index 0 does not exist in {providers=providers,
user_id=user_id}





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-hive-table-on-top-of-a-parquet-file-written-out-by-spark-tp22084.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Creating a hive table on top of a parquet file written out by spark

Posted by Cheng Lian <li...@gmail.com>.

Which version of Spark SQL were you using? I suspect this is related to 
the nullability issue we fixed in 1.3.0.

SQLContext.jsonFile automatically infers schema from a sampling result 
of the provided data (sampling ratio is 1.0 by default). Before 1.3.0, 
if the sampled data doesn't contain null values, the inferred schema is 
non-nullable. However, Hive tables are always nullable. For most file 
formats, this is OK. But in Parquet, nullability is significant.

Cheng

On 3/17/15 1:55 AM, kpeng1 wrote:
> Hi All,
>
> I wrote out a complex parquet file from spark sql and now I am trying to put
> a hive table on top.  I am running into issues with creating the hive table
> itself.  Here is the json that I wrote out to parquet using spark sql:
> {"user_id":"4513","providers":[{"id":"4220","name":"dbmvl","behaviors":{"b1":"gxybq","b2":"ntfmx"}},{"id":"4173","name":"dvjke","behaviors":{"b1":"sizow","b2":"knuuc"}}]}
> {"user_id":"3960","providers":[{"id":"1859","name":"ponsv","behaviors":{"b1":"ahfgc","b2":"txpea"}},{"id":"103","name":"uhqqo","behaviors":{"b1":"lktyo","b2":"ituxy"}}]}
> {"user_id":"567","providers":[{"id":"9622","name":"crjju","behaviors":{"b1":"rhaqc","b2":"npnot"}},{"id":"6965","name":"fnheh","behaviors":{"b1":"eipse","b2":"nvxqk"}}]}
>
> I basically created a hive context and read in the json file using jsonFile
> and then I wrote it back out using saveAsParquetFile.
>
> Afterwards I was trying to create a hive table on top of the parquet file.
> Here is the hive hql that I have:
> create table test (mycol STRUCT<user_id:String,
> providers:ARRAY&lt;STRUCT&lt;id:String, name:String,
> behaviors:MAP&lt;String, String>>>>) stored as parquet;
> Alter table test set location 'hdfs:///tmp/test.parquet';
>
> I get errors when I try to do a select * on the table:
> Failed with exception java.io.IOException:java.lang.IllegalStateException:
> Column mycol at index 0 does not exist in {providers=providers,
> user_id=user_id}
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-hive-table-on-top-of-a-parquet-file-written-out-by-spark-tp22084.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org