You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Yan Qi <ya...@gmail.com> on 2015/07/20 19:40:54 UTC

How to ease the pain of loading the large Summary file?

Right now we are using a MapReduce job to convert some data and store the
result in the Parquet format. The size can be tens of terabytes, leading to
a pretty large summary file (i.e., _metadata).

When we try to use another MapReduce job to read the result, it takes
forever to load the metadata.

We are wondering if it is possible to reduce (ideally eliminate) the cost
of loading the summary file while staring a MR job.


Thanks,

Yan

Re: How to ease the pain of loading the large Summary file?

Posted by Ryan Blue <bl...@cloudera.com>.

On 07/20/2015 10:54 AM, Alex Levenson wrote:
> You can push the reading of the summary file to the mappers instead of
> reading it on the submitter node:
>
> ParquetInputFormat.setTaskSideMetaData(conf, true);

This is the default from 1.6.0 forward.

> or setting "parquet.task.side.metadata" to true in your configuration. We
> had a similar issue, by default the client reads the summary file on the
> submitter node which takes a lot of time and memory. This flag fixes the
> issue for us by instead reading each individual file's metadata from the
> file footer in the mappers (each mapper reads only the metadata it needs).
>
> Another option, which is something we've been talking about in the past, is
> to disable creating this metadata file at all, as we've seen creating it
> can be expensive too, and if you use the task side metadata approach, it's
> never used.

There's an option to suppress the files, which I recommend. Now that 
file metadata is handled on the tasks, there's not much need for the 
summary files.

rb

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: How to ease the pain of loading the large Summary file?

Posted by Alex Levenson <al...@twitter.com.INVALID>.

You can push the reading of the summary file to the mappers instead of
reading it on the submitter node:

ParquetInputFormat.setTaskSideMetaData(conf, true);

or setting "parquet.task.side.metadata" to true in your configuration. We
had a similar issue, by default the client reads the summary file on the
submitter node which takes a lot of time and memory. This flag fixes the
issue for us by instead reading each individual file's metadata from the
file footer in the mappers (each mapper reads only the metadata it needs).

Another option, which is something we've been talking about in the past, is
to disable creating this metadata file at all, as we've seen creating it
can be expensive too, and if you use the task side metadata approach, it's
never used.

On Mon, Jul 20, 2015 at 10:40 AM, Yan Qi <ya...@gmail.com> wrote:

> Right now we are using a MapReduce job to convert some data and store the
> result in the Parquet format. The size can be tens of terabytes, leading to
> a pretty large summary file (i.e., _metadata).
>
> When we try to use another MapReduce job to read the result, it takes
> forever to load the metadata.
>
> We are wondering if it is possible to reduce (ideally eliminate) the cost
> of loading the summary file while staring a MR job.
>
>
> Thanks,
>
> Yan
>

-- 
Alex Levenson
@THISWILLWORK