You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Asifali Dauva <as...@nielsen.com> on 2017/06/19 23:41:19 UTC

Storing Map in Parquet

Hello,

We have a use case where we wish to store a JavaRDD<Map<String,String>>
into Parquet. This JavaRDD<Map<String,String>> is produced by a map-reduce
job. The problem is that the keys in Map<String,String> are not known
beforehand so, it is not possible to define the schema for the data
upfront. I know that we can define the schema in the program but it is
expensive since the following steps need to taken.

a) JavaRDD<Map<String,String>> must be mapped to
JavaPairRDD<StructType,Map<String,String>>
and persisted.

b) For each unique StructType generated in step a), the corresponding
Map<String,String> must be read (from persistence storage) and pushed in
Parquet format.

Is it possible for Parquet to keep a mutable Schema and update it based on
each data record. Finally, when its time to write the metadata to storage
it writes the final updated schema. Basically, I want parquet to infer the
schema based on each data record rather than being provided with one
upfront.

Thank you.