You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by François Méthot <fm...@gmail.com> on 2016/11/02 00:24:14 UTC

Re: Limit the number of output parquet files in CTAS

Thanks Andries,

I experimented with the order by and it works as you mentionned.

I will do some reading and experimentation with the store.partition.hash_
distribute.

Francois




On Mon, Oct 31, 2016 at 4:24 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> You can try and set store.partition.hash_distribute to true, but it is
> still listed as an alpha feature.
>
> You can also add a sort operation (order by) to the CTAS statement to
> force a single data stream at output. I believe this was discussed a while
> back on the user list.
>
> Ideally you want to look at the data set size and how much parallelism
> would work best in your environment for reading the data later.
>
> --Andries
>
>
> > On Oct 31, 2016, at 12:57 PM, François Méthot <fm...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Is there a way to limit the number of files produced by a CTAS query ?
> > I would like the speed benefits of having hundreds of scanner fragment
> but
> > don't want to deal with hundreds of output files.
> >
> > Our usecase right now is using 880 thread to scan and produce a report
> > output spread over... 880 parquets files.
> > Each resulting file is ~7M.
> >
> > Only way I found to reduce those files to smaller set is  to a perform
> > second CTAS query on the aggregated files with
> planner.width.max_per_query
> > set to smaller number.
> >
> > Any possible way to do this in one query?
> >
> > Thanks
> > Francois
>
>