You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by be...@laposte.net.INVALID on 2018/12/05 15:39:23 UTC

drill parquet - create table as ... partition by ... non present column

In would like to create a parquet with a partition on computed data (without to have to put the result of the computation in the parquet) :
The goal is to optimize the parquet for typical expecting queries.

Imaginary example :
CREATE TABLE `mytable` 
PARTITION BY (substr(name,1,1)) AS  
SELECT name, birthdate, birthcity
ORDER BY bithdate;

So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not in the SELECT list of CTAS

And the comment of the code of the function 
"public static RelNode qualifyPartitionCol(RelNode input, List<String> partitionColumns)" 
confirms that it's not possible actually :
" A partition column is resolved, either (1) the same column appear in the select list of CTAS or (2) CTAS has a * in select list"

But what is the reason of this limitation ?
Is there exists any tricks to do it right now, or can we expect an evolution to allow this possibilities.

I just imagine to do (with the data of the example)
CREATE TABLE `mytable` 
PARTITION BY (sname) AS  
SELECT substr(name,1,1) sname, name, birthdate, birthcity
ORDER BY bithdate;
Then, next, request each partition file to remove the useless data 
, like
CREATE TABLE `mytable_2/partition_x` 
SELECT name, birthdate, birthcity
ORDER BY bithdate;
but it's not really satisfying...

I would appreciate yours comments,
Regards,

benj

Re: drill parquet - create table as ... partition by ... non present column

Posted by Anton Gozhiy <an...@gmail.com>.
Benj,

I meant that without the metadata Drill won't recognize files as
partitions. Although that should not be a problem for the optimization
mechanisms.

As for your original question, I think that it's rather not implemented
than an intended limitation. Feel free to submit a feature request to
Apache JIRA.


On Fri, Dec 7, 2018 at 9:20 PM benj.dev <be...@laposte.net> wrote:

> Hi,
>
> Thanks for details.
> It's the point, I don't want to write additional metadata,
> but just organize the parquet file to have more useful stats.
>
> In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
> (Example SELECT a, b FROM ... GROUP BY a, b, c;)
> In the same way, I think it will be useful to have the possibilities
> to PARTITION BY and say (after the PARTITION BY) which columns put in
> the parquet file.
>
> I can imagine, that it will be difficult to add a "SELECT" close in
> PARTITION BY clauses like :
> [... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]
>
> Intermediately, allow the possibilities to "PARTITION BY" a computation
> on an existing column seem more accessible and should be useful.
> example :
> [... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]
>
> Because it's useless to store the "a < 10" but it may be useful to filer
> half of the parquet when request with condition on column a.
>
> But maybe there is a fundamental reason to not allow these facilities.
>
> Regards,
> Benj
>
> Le 06/12/2018 à 16:42, Anton Gozhiy a écrit :
> > Hi Benj,
> >
> > Creating partitions as in your first example won't work.
> >>From the docs: "During partitioning, Drill creates separate files, but
> not
> > separate directories, for different partitions." (
> > https://drill.apache.org/docs/how-to-partition-data/).
> > Also, Drill doesn't write additional metadata regarding partitioning,
> when
> > it reads parquet files it determines partitions using min/max values.
> > That means that if you want for example to partition using the first
> > letter, you'll need to create a corresponding column. Or you can create
> > partitions manually as directories.
> >
> > On Wed, Dec 5, 2018 at 10:07 PM <be...@laposte.net.invalid> wrote:
> >
> >> In would like to create a parquet with a partition on computed data
> >> (without to have to put the result of the computation in the parquet) :
> >> The goal is to optimize the parquet for typical expecting queries.
> >>
> >> Imaginary example :
> >> CREATE TABLE `mytable`
> >> PARTITION BY (substr(name,1,1)) AS
> >> SELECT name, birthdate, birthcity
> >> ORDER BY bithdate;
> >>
> >> So, if I do that I obtain a VALIDATION ERROR: Partition column ... is
> not
> >> in the SELECT list of CTAS
> >>
> >> And the comment of the code of the function
> >> "public static RelNode qualifyPartitionCol(RelNode input, List<String>
> >> partitionColumns)"
> >> confirms that it's not possible actually :
> >> " A partition column is resolved, either (1) the same column appear in
> the
> >> select list of CTAS or (2) CTAS has a * in select list"
> >>
> >> But what is the reason of this limitation ?
> >> Is there exists any tricks to do it right now, or can we expect an
> >> evolution to allow this possibilities.
> >>
> >> I just imagine to do (with the data of the example)
> >> CREATE TABLE `mytable`
> >> PARTITION BY (sname) AS
> >> SELECT substr(name,1,1) sname, name, birthdate, birthcity
> >> ORDER BY bithdate;
> >> Then, next, request each partition file to remove the useless data
> >> , like
> >> CREATE TABLE `mytable_2/partition_x`
> >> SELECT name, birthdate, birthcity
> >> ORDER BY bithdate;
> >> but it's not really satisfying...
> >>
> >> I would appreciate yours comments,
> >> Regards,
> >>
> >> benj
> >>
> >
> >
>


-- 
Sincerely, Anton Gozhiy
anton5813@gmail.com

Re: drill parquet - create table as ... partition by ... non present column

Posted by "benj.dev" <be...@laposte.net.INVALID>.
Hi,

Thanks for details.
It's the point, I don't want to write additional metadata,
but just organize the parquet file to have more useful stats.

In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
(Example SELECT a, b FROM ... GROUP BY a, b, c;)
In the same way, I think it will be useful to have the possibilities
to PARTITION BY and say (after the PARTITION BY) which columns put in
the parquet file.

I can imagine, that it will be difficult to add a "SELECT" close in
PARTITION BY clauses like :
[... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]

Intermediately, allow the possibilities to "PARTITION BY" a computation
on an existing column seem more accessible and should be useful.
example :
[... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]

Because it's useless to store the "a < 10" but it may be useful to filer
half of the parquet when request with condition on column a.

But maybe there is a fundamental reason to not allow these facilities.

Regards,
Benj

Le 06/12/2018 à 16:42, Anton Gozhiy a écrit :
> Hi Benj,
> 
> Creating partitions as in your first example won't work.
>>From the docs: "During partitioning, Drill creates separate files, but not
> separate directories, for different partitions." (
> https://drill.apache.org/docs/how-to-partition-data/).
> Also, Drill doesn't write additional metadata regarding partitioning, when
> it reads parquet files it determines partitions using min/max values.
> That means that if you want for example to partition using the first
> letter, you'll need to create a corresponding column. Or you can create
> partitions manually as directories.
> 
> On Wed, Dec 5, 2018 at 10:07 PM <be...@laposte.net.invalid> wrote:
> 
>> In would like to create a parquet with a partition on computed data
>> (without to have to put the result of the computation in the parquet) :
>> The goal is to optimize the parquet for typical expecting queries.
>>
>> Imaginary example :
>> CREATE TABLE `mytable`
>> PARTITION BY (substr(name,1,1)) AS
>> SELECT name, birthdate, birthcity
>> ORDER BY bithdate;
>>
>> So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not
>> in the SELECT list of CTAS
>>
>> And the comment of the code of the function
>> "public static RelNode qualifyPartitionCol(RelNode input, List<String>
>> partitionColumns)"
>> confirms that it's not possible actually :
>> " A partition column is resolved, either (1) the same column appear in the
>> select list of CTAS or (2) CTAS has a * in select list"
>>
>> But what is the reason of this limitation ?
>> Is there exists any tricks to do it right now, or can we expect an
>> evolution to allow this possibilities.
>>
>> I just imagine to do (with the data of the example)
>> CREATE TABLE `mytable`
>> PARTITION BY (sname) AS
>> SELECT substr(name,1,1) sname, name, birthdate, birthcity
>> ORDER BY bithdate;
>> Then, next, request each partition file to remove the useless data
>> , like
>> CREATE TABLE `mytable_2/partition_x`
>> SELECT name, birthdate, birthcity
>> ORDER BY bithdate;
>> but it's not really satisfying...
>>
>> I would appreciate yours comments,
>> Regards,
>>
>> benj
>>
> 
> 

Re: drill parquet - create table as ... partition by ... non present column

Posted by Anton Gozhiy <an...@gmail.com>.
Hi Benj,

Creating partitions as in your first example won't work.
From the docs: "During partitioning, Drill creates separate files, but not
separate directories, for different partitions." (
https://drill.apache.org/docs/how-to-partition-data/).
Also, Drill doesn't write additional metadata regarding partitioning, when
it reads parquet files it determines partitions using min/max values.
That means that if you want for example to partition using the first
letter, you'll need to create a corresponding column. Or you can create
partitions manually as directories.

On Wed, Dec 5, 2018 at 10:07 PM <be...@laposte.net.invalid> wrote:

> In would like to create a parquet with a partition on computed data
> (without to have to put the result of the computation in the parquet) :
> The goal is to optimize the parquet for typical expecting queries.
>
> Imaginary example :
> CREATE TABLE `mytable`
> PARTITION BY (substr(name,1,1)) AS
> SELECT name, birthdate, birthcity
> ORDER BY bithdate;
>
> So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not
> in the SELECT list of CTAS
>
> And the comment of the code of the function
> "public static RelNode qualifyPartitionCol(RelNode input, List<String>
> partitionColumns)"
> confirms that it's not possible actually :
> " A partition column is resolved, either (1) the same column appear in the
> select list of CTAS or (2) CTAS has a * in select list"
>
> But what is the reason of this limitation ?
> Is there exists any tricks to do it right now, or can we expect an
> evolution to allow this possibilities.
>
> I just imagine to do (with the data of the example)
> CREATE TABLE `mytable`
> PARTITION BY (sname) AS
> SELECT substr(name,1,1) sname, name, birthdate, birthcity
> ORDER BY bithdate;
> Then, next, request each partition file to remove the useless data
> , like
> CREATE TABLE `mytable_2/partition_x`
> SELECT name, birthdate, birthcity
> ORDER BY bithdate;
> but it's not really satisfying...
>
> I would appreciate yours comments,
> Regards,
>
> benj
>


-- 
Sincerely, Anton Gozhiy
anton5813@gmail.com