You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Patrick McCarthy <pm...@dstillery.com.INVALID> on 2019/11/01 15:33:28 UTC

Best practices for data like file storage

Hi List,

I'm looking for resources to learn about how to store data on disk for
later access.

For a while my team has been using Spark on top of our existing hdfs/Hive
cluster without much agency as far as what format is used to store the
data. I'd like to learn more about how to re-stage my data to speed up my
own analyses, and to start building expertise to define new data stores.

One example of a problem I'm facing is data which is written to Hive using
a customized protobuf serde. The data contains many very complex types
(arrays of structs of arrays of... ) and I often need very few elements of
any particular record, yet the format requires Spark to deserialize the
entire object.

The sorts of information I'm looking for:

   - Do's and Dont's of laying out a parquet schema
   - Measuring / debugging read speed
   - How to bucket, index, etc.

Thanks!