You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Piotr Sokólski <pi...@gmail.com> on 2015/05/23 17:13:49 UTC

Estimating cost for fs tables, table names after joins and executing logical plans

Hi, I’ve been playing a bit with v1.0.0 and stumbled upon a few questions/issues:

1. For query cost estimation one usually needs some additional information about a table such as the number of rows. Is the cost estimation implemented for fs sources as well? If yes, how is the metadata extracted and cached? From my understanding some formats like parquet store it in the file footer, but what about json or csv files? Can this information be queried/retrieved somehow by the user?

2. I’ve been working with the following query:

$q = select * from region join nation on region.R_REGIONKEY = nation.N_REGIONKEY;

where region and nation are the sample data files imported into a dfs.tmp schema.
running queries like

select R_REGIONKEY from ($q);

results in an error "Column 'R_REGIONKEY' is ambiguous”. However queries like select R_REGIONKEY from (SELECT * FROM region); work fine, as well as saving the result of the join with CREATE TABLE and then replacing $q with the saved table’s name. Why is that and what are the rules for renaming columns in join queries?

3. I’ve been trying to execute a logical plan using the web interface. It works fine with a simple scan - project query, but when trying to use the output of EXPLAIN … FOR $q (with resultMode changed to “EXEC”) it throws the following error:

SYSTEM ERROR: java.lang.IllegalArgumentException: Conflicting property-based creators: already had [constructor for org.apache.drill.common.logical.data.Join, ...

the whole logical query and full error message are at https://gist.github.com/pyetras/bf625b6697de62284996

4. What are the supported conditions for joins? The sql interface seems to support only (e1 == e2 [AND])*, but the logical operator reference at https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/mobilebasic?pli=1#cmnt7 mentions other relations and also cartesian joins. Are those simply not implemented for the sql parser or not supported in Drill at all?

Sorry for the long read and thanks for your assistance,

--
Piotr Sokólski

Re: Estimating cost for fs tables, table names after joins and executing logical plans

Posted by Aman Sinha <as...@maprtech.com>.

correction: 4 (b) query should have a non-equality condition instead of
equality:
   select n1.n_nationkey from nation n1, (select min(n_regionkey) as
min_regionkey from nation n) n2
               where n1.n_regionkey < n2.min_regionkey;

On Sat, May 23, 2015 at 10:28 AM, Aman Sinha <as...@maprtech.com> wrote:

> 1.  Yes, certainly the cost estimation is applied for file system sources
> as well.  For Parquet the row count is
>       extracted from the metadata available in the files.  For text (csv,
> json), this is estimated based on the size
>       of the files and a configurable
> *store.text.estimated_row_size_bytes* parameter.   The default value is
> 100
>       bytes per row, but you can change it.
>
> 2.  There are 2 things in your query that need special handling in Drill
> because it does not rely on a centralized
>       schema.
>       First, the presence of '*' wildcard in the SELECT list - the '*' is
> expanded at run-time unlike the schema-based
>      systems where it is expanded at the initial query planning time.
> Second, whenever there is a join between 2
>       or more tables, the columns in the query need to be qualified by the
> table name because a column  could
>       potentially be present in either of the tables.
>       The following query which explicitly specifies the qualified columns
> in SELECT clause should work:
>        select r_regionkey from (
>            select r.r_regionkey, n.n_regionkey from region r join nation n
> on r.r_regionkey = n.n_regionkey)
>
>  3.  You might want to file a JIRA with the exact reproducible test case.
>  4.   I believe the link you sent for the plan document is quite old.  The
> best place for the latest documentation is
>        drill.apache.org .
>        Non-equality joins are supported in 2 situations:
>         a)  if there is another equality join between the same 2 tables.
>  e.g
>            select n1.n_nationkey from nation n1, nation n2
>                where n1.n_nationkey = n2.n_nationkey and n1.n_regionkey <
> n2.n_regionkey
>         b) if one side of the join is a scalar subquery:
>             select n1.n_nationkey from nation n1, (select min(n_regionkey)
> as min_regionkey from nation n) n2
>                where n1.n_regionkey = n2.min_regionkey;
>
>          Pure cartesian joins are currently not supported.
>
> Aman
>
>
> On Sat, May 23, 2015 at 8:13 AM, Piotr Sokólski <pi...@gmail.com>
> wrote:
>
>> Hi, I’ve been playing a bit with v1.0.0 and stumbled upon a few
>> questions/issues:
>>
>> 1. For query cost estimation one usually needs some additional
>> information about a table such as the number of rows. Is the cost
>> estimation implemented for fs sources as well? If yes, how is the metadata
>> extracted and cached? From my understanding some formats like parquet store
>> it in the file footer, but what about json or csv files? Can this
>> information be queried/retrieved somehow by the user?
>>
>> 2. I’ve been working with the following query:
>>
>> $q = select * from region join nation on region.R_REGIONKEY =
>> nation.N_REGIONKEY;
>>
>> where region and nation are the sample data files imported into a dfs.tmp
>> schema.
>> running queries like
>>
>> select R_REGIONKEY from ($q);
>>
>> results in an error "Column 'R_REGIONKEY' is ambiguous”. However queries
>> like select R_REGIONKEY from (SELECT * FROM region); work fine, as well as
>> saving the result of the join with CREATE TABLE and then replacing $q with
>> the saved table’s name. Why is that and what are the rules for renaming
>> columns in join queries?
>>
>> 3. I’ve been trying to execute a logical plan using the web interface. It
>> works fine with a simple scan - project query, but when trying to use the
>> output of EXPLAIN … FOR $q (with resultMode changed to “EXEC”) it throws
>> the following error:
>>
>> SYSTEM ERROR: java.lang.IllegalArgumentException: Conflicting
>> property-based creators: already had [constructor for
>> org.apache.drill.common.logical.data.Join,  ...
>>
>> the whole logical query and full error message are at
>> https://gist.github.com/pyetras/bf625b6697de62284996
>>
>> 4. What are the supported conditions for joins? The sql interface seems
>> to support only (e1 == e2 [AND])*, but the logical operator reference at
>> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/mobilebasic?pli=1#cmnt7
>> mentions other relations and also cartesian joins. Are those simply not
>> implemented for the sql parser or not supported in Drill at all?
>>
>> Sorry for the long read and thanks for your assistance,
>>
>> --
>> Piotr Sokólski
>>
>>
>

Re: Estimating cost for fs tables, table names after joins and executing logical plans

Posted by Jacques Nadeau <ja...@apache.org>.

One other note:

The default explain plan is a physical plan.  If you want to get a logical
plan,  you need to do EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR ...

If you're still getting the error around plan consumption, please file a
jira.
On May 23, 2015 10:28 AM, "Aman Sinha" <as...@maprtech.com> wrote:

> 1.  Yes, certainly the cost estimation is applied for file system sources
> as well.  For Parquet the row count is
>       extracted from the metadata available in the files.  For text (csv,
> json), this is estimated based on the size
>       of the files and a configurable *store.text.estimated_row_size_bytes*
> parameter.   The default value is 100
>       bytes per row, but you can change it.
>
> 2.  There are 2 things in your query that need special handling in Drill
> because it does not rely on a centralized
>       schema.
>       First, the presence of '*' wildcard in the SELECT list - the '*' is
> expanded at run-time unlike the schema-based
>      systems where it is expanded at the initial query planning time.
> Second, whenever there is a join between 2
>       or more tables, the columns in the query need to be qualified by the
> table name because a column  could
>       potentially be present in either of the tables.
>       The following query which explicitly specifies the qualified columns
> in SELECT clause should work:
>        select r_regionkey from (
>            select r.r_regionkey, n.n_regionkey from region r join nation n
> on r.r_regionkey = n.n_regionkey)
>
>  3.  You might want to file a JIRA with the exact reproducible test case.
>  4.   I believe the link you sent for the plan document is quite old.  The
> best place for the latest documentation is
>        drill.apache.org .
>        Non-equality joins are supported in 2 situations:
>         a)  if there is another equality join between the same 2 tables.
>  e.g
>            select n1.n_nationkey from nation n1, nation n2
>                where n1.n_nationkey = n2.n_nationkey and n1.n_regionkey <
> n2.n_regionkey
>         b) if one side of the join is a scalar subquery:
>             select n1.n_nationkey from nation n1, (select min(n_regionkey)
> as min_regionkey from nation n) n2
>                where n1.n_regionkey = n2.min_regionkey;
>
>          Pure cartesian joins are currently not supported.
>
> Aman
>
>
> On Sat, May 23, 2015 at 8:13 AM, Piotr Sokólski <pi...@gmail.com>
> wrote:
>
> > Hi, I’ve been playing a bit with v1.0.0 and stumbled upon a few
> > questions/issues:
> >
> > 1. For query cost estimation one usually needs some additional
> information
> > about a table such as the number of rows. Is the cost estimation
> > implemented for fs sources as well? If yes, how is the metadata extracted
> > and cached? From my understanding some formats like parquet store it in
> the
> > file footer, but what about json or csv files? Can this information be
> > queried/retrieved somehow by the user?
> >
> > 2. I’ve been working with the following query:
> >
> > $q = select * from region join nation on region.R_REGIONKEY =
> > nation.N_REGIONKEY;
> >
> > where region and nation are the sample data files imported into a dfs.tmp
> > schema.
> > running queries like
> >
> > select R_REGIONKEY from ($q);
> >
> > results in an error "Column 'R_REGIONKEY' is ambiguous”. However queries
> > like select R_REGIONKEY from (SELECT * FROM region); work fine, as well
> as
> > saving the result of the join with CREATE TABLE and then replacing $q
> with
> > the saved table’s name. Why is that and what are the rules for renaming
> > columns in join queries?
> >
> > 3. I’ve been trying to execute a logical plan using the web interface. It
> > works fine with a simple scan - project query, but when trying to use the
> > output of EXPLAIN … FOR $q (with resultMode changed to “EXEC”) it throws
> > the following error:
> >
> > SYSTEM ERROR: java.lang.IllegalArgumentException: Conflicting
> > property-based creators: already had [constructor for
> > org.apache.drill.common.logical.data.Join,  ...
> >
> > the whole logical query and full error message are at
> > https://gist.github.com/pyetras/bf625b6697de62284996
> >
> > 4. What are the supported conditions for joins? The sql interface seems
> to
> > support only (e1 == e2 [AND])*, but the logical operator reference at
> >
> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/mobilebasic?pli=1#cmnt7
> > mentions other relations and also cartesian joins. Are those simply not
> > implemented for the sql parser or not supported in Drill at all?
> >
> > Sorry for the long read and thanks for your assistance,
> >
> > --
> > Piotr Sokólski
> >
> >
>

Re: Estimating cost for fs tables, table names after joins and executing logical plans

Posted by Aman Sinha <as...@maprtech.com>.

1.  Yes, certainly the cost estimation is applied for file system sources
as well.  For Parquet the row count is
      extracted from the metadata available in the files.  For text (csv,
json), this is estimated based on the size
      of the files and a configurable *store.text.estimated_row_size_bytes*
parameter.   The default value is 100
      bytes per row, but you can change it.

2.  There are 2 things in your query that need special handling in Drill
because it does not rely on a centralized
      schema.
      First, the presence of '*' wildcard in the SELECT list - the '*' is
expanded at run-time unlike the schema-based
     systems where it is expanded at the initial query planning time.
Second, whenever there is a join between 2
      or more tables, the columns in the query need to be qualified by the
table name because a column  could
      potentially be present in either of the tables.
      The following query which explicitly specifies the qualified columns
in SELECT clause should work:
       select r_regionkey from (
           select r.r_regionkey, n.n_regionkey from region r join nation n
on r.r_regionkey = n.n_regionkey)

 3.  You might want to file a JIRA with the exact reproducible test case.
 4.   I believe the link you sent for the plan document is quite old.  The
best place for the latest documentation is
       drill.apache.org .
       Non-equality joins are supported in 2 situations:
        a)  if there is another equality join between the same 2 tables.
 e.g
           select n1.n_nationkey from nation n1, nation n2
               where n1.n_nationkey = n2.n_nationkey and n1.n_regionkey <
n2.n_regionkey
        b) if one side of the join is a scalar subquery:
            select n1.n_nationkey from nation n1, (select min(n_regionkey)
as min_regionkey from nation n) n2
               where n1.n_regionkey = n2.min_regionkey;

         Pure cartesian joins are currently not supported.

Aman


On Sat, May 23, 2015 at 8:13 AM, Piotr Sokólski <pi...@gmail.com>
wrote:

> Hi, I’ve been playing a bit with v1.0.0 and stumbled upon a few
> questions/issues:
>
> 1. For query cost estimation one usually needs some additional information
> about a table such as the number of rows. Is the cost estimation
> implemented for fs sources as well? If yes, how is the metadata extracted
> and cached? From my understanding some formats like parquet store it in the
> file footer, but what about json or csv files? Can this information be
> queried/retrieved somehow by the user?
>
> 2. I’ve been working with the following query:
>
> $q = select * from region join nation on region.R_REGIONKEY =
> nation.N_REGIONKEY;
>
> where region and nation are the sample data files imported into a dfs.tmp
> schema.
> running queries like
>
> select R_REGIONKEY from ($q);
>
> results in an error "Column 'R_REGIONKEY' is ambiguous”. However queries
> like select R_REGIONKEY from (SELECT * FROM region); work fine, as well as
> saving the result of the join with CREATE TABLE and then replacing $q with
> the saved table’s name. Why is that and what are the rules for renaming
> columns in join queries?
>
> 3. I’ve been trying to execute a logical plan using the web interface. It
> works fine with a simple scan - project query, but when trying to use the
> output of EXPLAIN … FOR $q (with resultMode changed to “EXEC”) it throws
> the following error:
>
> SYSTEM ERROR: java.lang.IllegalArgumentException: Conflicting
> property-based creators: already had [constructor for
> org.apache.drill.common.logical.data.Join,  ...
>
> the whole logical query and full error message are at
> https://gist.github.com/pyetras/bf625b6697de62284996
>
> 4. What are the supported conditions for joins? The sql interface seems to
> support only (e1 == e2 [AND])*, but the logical operator reference at
> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/mobilebasic?pli=1#cmnt7
> mentions other relations and also cartesian joins. Are those simply not
> implemented for the sql parser or not supported in Drill at all?
>
> Sorry for the long read and thanks for your assistance,
>
> --
> Piotr Sokólski
>
>