You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Vitaliy Pisarev <vi...@biocatch.com> on 2018/03/29 12:54:01 UTC

Best practices for optimizing the structure of parquet schema

There is a lot of talk that in order to really benefit from fast queries
over parquet and hdfs, we need to make sure that the data is stored in a
manner that is friendly to compression.

Unfortunately, I did not find any specific guidelines or tips online that
describe do-s and dont-s
in designing the parquet schema.

I am wondering that perhaps someone here can either share sych material or
his or her own experience regarding this.

For example:

I have the following logical structure that I want to store:

{
     root: [
         [int, int float, float],
         [int, int float, float],
         [int, int float, float],
         ....,
         .....
     ]
}

This is of course a list of lists. All the sublists are actually vectors of
the same length where the coordinates match in meaning and type.

If I understand correctly, the best way to *store* this structure is by
going for the columnar paradigm, where I will have 4 very long vectors, one
for each coordinate. rather than many vectors that are short.

What other consideration can I apply?