You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ashok Kumar <as...@yahoo.com.INVALID> on 2016/07/04 06:53:07 UTC

ORC or parquet with Spark

With Spark caching which file format is best to use parquet or ORC
Obviously ORC can be used with Hive. 
My question is whether Spark can use various file, stripe rowset statistics stored in ORC file?
Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not offer any caching to be faster.
So if Spark ignores the underlying stats for ORC files, does it matter which file format to use with Spark.
Thanks