You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Reed Villanueva <rv...@ucera.org> on 2018/09/25 22:02:36 UTC

Possible to limit the amount of files produced by Drill when running CREATE TABLE statements?

Is it possible to limit the number of files use to create / represent a
table when using apache drill's create table statement?

Currently have sets of parquet files stored in HDFS and am converting them
to TSVs via drill CREATE TABLE, eg.

    alter session set `store.format`='tsv';
    create table dfs.ucera_internal.`/my/workspace/path/tablename/tsv` as
    select col1, col2, from_unixtime(extract_date/1000) as etl_date
    from dfs.ucera_internal.`/my/workspace/path/tablename/parquet`;

The problem is that doing this process can turn ~12 parquet files into ~30
TSV files, which is causing other problems for downstream operations. Is
there a way to limit how many files are used in the creation of this
TSV-version of the table?

Could not find any such info in the docs (here
https://drill.apache.org/docs/create-table-as-ctas/ or here
https://drill.apache.org/docs/configuration-options-introduction/), though
the PARTITION BY clause appears to come close (
https://drill.apache.org/docs/partition-by-clause/#creating-a-partitioned-table-of-ngram-data)
(but not all the tables have nice partitionable fields).

-- 
This electronic message is intended only for the named 
recipient, and may 
contain information that is confidential or 
privileged. If you are not the 
intended recipient, you are 
hereby notified that any disclosure, copying, 
distribution or 
use of the contents of this message is strictly 
prohibited. If 
you have received this message in error or are not the 
named
recipient, please notify us immediately by contacting the 
sender at 
the electronic mail address noted above, and delete 
and destroy all copies 
of this message. Thank you.

Re: Possible to limit the amount of files produced by Drill when running CREATE TABLE statements?

Posted by Arina Yelchiyeva <ar...@gmail.com>.

Consider adjusting the following config options [1]:

planner.slice_target
planner.width.max_per_node
planner.width.max_per_query

[1] https://drill.apache.org/docs/configuration-options-introduction/

On Wed, Sep 26, 2018 at 4:58 AM Divya Gehlot <di...@gmail.com>
wrote:

> Even I looked for it and I couldn't find it .
> Only workaround for now which even I implemented is to merge the files
> using hadoop commands
> hadoop fs -merge /path/to/files  /path/to/mergedfile
>
> and if your data contain headers then you might  have to look to this
> <
> https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv/41785085#41785085
> >
>  .
>
> Hope this helps !
>
> Thanks,
> Divya
>
> On Wed, 26 Sep 2018 at 06:10, Reed Villanueva <rv...@ucera.org>
> wrote:
>
> > Is it possible to limit the number of files use to create / represent a
> > table when using apache drill's create table statement?
> >
> > Currently have sets of parquet files stored in HDFS and am converting
> them
> > to TSVs via drill CREATE TABLE, eg.
> >
> >     alter session set `store.format`='tsv';
> >     create table dfs.ucera_internal.`/my/workspace/path/tablename/tsv` as
> >     select col1, col2, from_unixtime(extract_date/1000) as etl_date
> >     from dfs.ucera_internal.`/my/workspace/path/tablename/parquet`;
> >
> > The problem is that doing this process can turn ~12 parquet files into
> ~30
> > TSV files, which is causing other problems for downstream operations. Is
> > there a way to limit how many files are used in the creation of this
> > TSV-version of the table?
> >
> > Could not find any such info in the docs (here
> > https://drill.apache.org/docs/create-table-as-ctas/ or here
> > https://drill.apache.org/docs/configuration-options-introduction/),
> though
> > the PARTITION BY clause appears to come close (
> >
> >
> https://drill.apache.org/docs/partition-by-clause/#creating-a-partitioned-table-of-ngram-data
> > )
> > (but not all the tables have nice partitionable fields).
> >
> > --
> > This electronic message is intended only for the named
> > recipient, and may
> > contain information that is confidential or
> > privileged. If you are not the
> > intended recipient, you are
> > hereby notified that any disclosure, copying,
> > distribution or
> > use of the contents of this message is strictly
> > prohibited. If
> > you have received this message in error or are not the
> > named
> > recipient, please notify us immediately by contacting the
> > sender at
> > the electronic mail address noted above, and delete
> > and destroy all copies
> > of this message. Thank you.
> >
>

Re: Possible to limit the amount of files produced by Drill when running CREATE TABLE statements?

Posted by Divya Gehlot <di...@gmail.com>.

Even I looked for it and I couldn't find it .
Only workaround for now which even I implemented is to merge the files
using hadoop commands
hadoop fs -merge /path/to/files  /path/to/mergedfile

and if your data contain headers then you might  have to look to this
<https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv/41785085#41785085>
 .

Hope this helps !

Thanks,
Divya

On Wed, 26 Sep 2018 at 06:10, Reed Villanueva <rv...@ucera.org> wrote:

> Is it possible to limit the number of files use to create / represent a
> table when using apache drill's create table statement?
>
> Currently have sets of parquet files stored in HDFS and am converting them
> to TSVs via drill CREATE TABLE, eg.
>
>     alter session set `store.format`='tsv';
>     create table dfs.ucera_internal.`/my/workspace/path/tablename/tsv` as
>     select col1, col2, from_unixtime(extract_date/1000) as etl_date
>     from dfs.ucera_internal.`/my/workspace/path/tablename/parquet`;
>
> The problem is that doing this process can turn ~12 parquet files into ~30
> TSV files, which is causing other problems for downstream operations. Is
> there a way to limit how many files are used in the creation of this
> TSV-version of the table?
>
> Could not find any such info in the docs (here
> https://drill.apache.org/docs/create-table-as-ctas/ or here
> https://drill.apache.org/docs/configuration-options-introduction/), though
> the PARTITION BY clause appears to come close (
>
> https://drill.apache.org/docs/partition-by-clause/#creating-a-partitioned-table-of-ngram-data
> )
> (but not all the tables have nice partitionable fields).
>
> --
> This electronic message is intended only for the named
> recipient, and may
> contain information that is confidential or
> privileged. If you are not the
> intended recipient, you are
> hereby notified that any disclosure, copying,
> distribution or
> use of the contents of this message is strictly
> prohibited. If
> you have received this message in error or are not the
> named
> recipient, please notify us immediately by contacting the
> sender at
> the electronic mail address noted above, and delete
> and destroy all copies
> of this message. Thank you.
>