You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jianshi Huang <ji...@gmail.com> on 2014/06/11 05:44:34 UTC

Use Parquet as data format, need to have schema embedded (saved from Pig, loaded to Spark)

Hey guys,

I have problem loading Parquet files in Spark where I saved them in Pig. I
need to do it because the schema is only defined in Pig (customized Loader)
and it contains > 200 columns...

This is what I'm doing in Pig:

SET parquet.compression snappy;
store xxx into 'data/parquet/xxx/snapshot=2014-06-08' using
parquet.pig.ParquetStorer();

I couldn't read it back in Pig using ParquetLoader, nor can I load it to
Spark:

In Spark, I'm a little bit confused as well (it's definitely not String,
right?):

val xxx = sc.newAPIHadoopFile("data/parquet/xxx/snapshot=2014-06-08",
classOf[ParquetInputFormat[String]], classOf[Void],classOf[String])


So my question is am I doing anything wrong? And is there a better way to
do it?

Cheers,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/