You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vitaliy Pisarev <vi...@biocatch.com> on 2018/03/29 12:54:01 UTC
Best practices for optimizing the structure of parquet schema
There is a lot of talk that in order to really benefit from fast queries
over parquet and hdfs, we need to make sure that the data is stored in a
manner that is friendly to compression.
Unfortunately, I did not find any specific guidelines or tips online that
describe do-s and dont-s
in designing the parquet schema.
I am wondering that perhaps someone here can either share sych material or
his or her own experience regarding this.
For example:
I have the following logical structure that I want to store:
{
root: [
[int, int float, float],
[int, int float, float],
[int, int float, float],
....,
.....
]
}
This is of course a list of lists. All the sublists are actually vectors of
the same length where the coordinates match in meaning and type.
If I understand correctly, the best way to *store* this structure is by
going for the columnar paradigm, where I will have 4 very long vectors, one
for each coordinate. rather than many vectors that are short.
What other consideration can I apply?