You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Christopher Matta <cm...@mapr.com> on 2015/03/27 18:49:21 UTC

Trying to wrap my head around using WHERE with flatten

I’m using the yelp academic data set <https://www.yelp.com/academic_dataset>
for my testing purposes.

Say I want to list all businesses in the “Pets” category, this query
doesn’t work:

0: jdbc:drill:zk=sen11:5181,sen12:5181> select `name`,
flatten(`categories`) from
maprfs.cmatta.`yelp/yelp_academic_dataset_business.json` WHERE
flatten(`categories`) = 'Pets' limit 10;
Query failed: Query stopped., Failure while trying to materialize
incoming schema.  Errors:

Error in expression at index -1.  Error: Missing function
implementation: [flatten(VARCHAR-REPEATED)].  Full expression:
--UNKNOWN EXPRESSION--.. [ daf9f62a-f39b-4745-a0e3-0160875f5cb7 on
se-node13.se.lab:31010 ]

Error: exception while executing query: Failure while executing query.
(state=,code=0)

But this query with a sub-select does work:

0: jdbc:drill:zk=sen11:5181,sen12:5181> select flat.`name`,
flat.`categories` from (select `name`, flatten(`categories`) as
categories from
maprfs.cmatta.`yelp/yelp_academic_dataset_business.json`) as flat
WHERE lower(flat.`categories`) = 'pets' limit 10;
+------------+------------+
|    name    | categories |
+------------+------------+
| Loving Hands Pet Care | Pets       |
| Amec Mid-City Animal Hospital | Pets       |
| PetSmart   | Pets       |
| A Dog's Life Photography | Pets       |
| Goober Pet Direct | Pets       |
| Pet Planet | Pets       |
| All-Star Animal Hospital | Pets       |
| Team Canine, Inc | Pets       |
| Foothills Pet Resort | Pets       |
| Petco      | Pets       |
+------------+------------+
10 rows selected (0.141 seconds)

Now that I can pick out which businesses are in the “Pets” category, I may
want to get the average star rating and order them by the number of
reviews. However, using the above technique in an aggregation query doesn’t
seem to work:

0: jdbc:drill:zk=sen11:5181,sen12:5181> select
. . . . . . . . . . . . . . . . . . . >        a.`name`,
. . . . . . . . . . . . . . . . . . . >        count(b.`text`),
. . . . . . . . . . . . . . . . . . . >        avg(b.`stars`)
. . . . . . . . . . . . . . . . . . . >    FROM (
. . . . . . . . . . . . . . . . . . . >      SELECT `business_id`,
`name`, flatten(`categories`) as `category`
. . . . . . . . . . . . . . . . . . . >      FROM
maprfs.`cmatta`.`yelp/yelp_academic_dataset_business.json`) a
. . . . . . . . . . . . . . . . . . . >    JOIN
maprfs.`cmatta`.`yelp/yelp_academic_dataset_review.json` b
. . . . . . . . . . . . . . . . . . . >        ON a.`business_id` =
b.`business_id`
. . . . . . . . . . . . . . . . . . . >    WHERE a.`category` = 'Pets'
. . . . . . . . . . . . . . . . . . . >    GROUP BY a.`name`
. . . . . . . . . . . . . . . . . . . >    ORDER BY count(*) DESC
. . . . . . . . . . . . . . . . . . . >    LIMIT 10;
Query failed: Query failed: Failure while running fragment., Failure
while trying to materialize incoming schema.  Errors:

Error in expression at index 2.  Error: Missing function
implementation: [flatten(VARCHAR-REPEATED)].  Full expression: null..
[ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ]
[ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ]

Error: exception while executing query: Failure while executing query.
(state=,code=0)

What does *Error: Missing function implementation:
[flatten(VARCHAR-REPEATED)]* in this context? Am I going about this the
wrong way?

Chris Matta
cmatta@mapr.com
215-701-3146

Re: Trying to wrap my head around using WHERE with flatten

Posted by Jason Altekruse <al...@gmail.com>.

The error message here indicates a planning issue. For query parsing and a
portion of validation and planning, flatten is treated like a standard
function. The actual implementation of flatten is a physical operator,
because Drill does not yet support the concept of table functions, an
interface for functions to produce a table rather than just a scalar value
as a result. To plan the flatten operation we have a rewrite rule that
takes a project operation with a reference to the flatten 'function' and
rewrites it to the flatten operator followed by a new project. What this
error message means is that this rule as not properly applied, in the
project operator it tried looking up a function called flatten that takes a
repeated type as input, but this function does not exist.

If you are working with 0.7, I would recommend trying to run the query on
the 0.8 release candidate currently up for vote on the list. I know there
were a couple of planning fixes related to flatten that may address this
issue. If you still see the problem after upgrading, or you are already
using one of the 0.8 snapshot builds you can go ahead and file a JIRA for
this issue.

I will file a JIRA myself for producing a better error message if this kind
of planning issue comes up the the future.

Thanks,
Jason Altekruse

On Fri, Mar 27, 2015 at 11:01 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> I would recommend to not use a count(*) but rather pick a column to use
> for count.
>
> The missing function implementation is interesting though, would be good
> to understand exactly what that means.
>
> —Andries
>
>
> On Mar 27, 2015, at 10:49 AM, Christopher Matta <cm...@mapr.com> wrote:
>
> > I’m using the yelp academic data set <
> https://www.yelp.com/academic_dataset>
> > for my testing purposes.
> >
> > Say I want to list all businesses in the “Pets” category, this query
> > doesn’t work:
> >
> > 0: jdbc:drill:zk=sen11:5181,sen12:5181> select `name`,
> > flatten(`categories`) from
> > maprfs.cmatta.`yelp/yelp_academic_dataset_business.json` WHERE
> > flatten(`categories`) = 'Pets' limit 10;
> > Query failed: Query stopped., Failure while trying to materialize
> > incoming schema.  Errors:
> >
> > Error in expression at index -1.  Error: Missing function
> > implementation: [flatten(VARCHAR-REPEATED)].  Full expression:
> > --UNKNOWN EXPRESSION--.. [ daf9f62a-f39b-4745-a0e3-0160875f5cb7 on
> > se-node13.se.lab:31010 ]
> >
> > Error: exception while executing query: Failure while executing query.
> > (state=,code=0)
> >
> > But this query with a sub-select does work:
> >
> > 0: jdbc:drill:zk=sen11:5181,sen12:5181> select flat.`name`,
> > flat.`categories` from (select `name`, flatten(`categories`) as
> > categories from
> > maprfs.cmatta.`yelp/yelp_academic_dataset_business.json`) as flat
> > WHERE lower(flat.`categories`) = 'pets' limit 10;
> > +------------+------------+
> > |    name    | categories |
> > +------------+------------+
> > | Loving Hands Pet Care | Pets       |
> > | Amec Mid-City Animal Hospital | Pets       |
> > | PetSmart   | Pets       |
> > | A Dog's Life Photography | Pets       |
> > | Goober Pet Direct | Pets       |
> > | Pet Planet | Pets       |
> > | All-Star Animal Hospital | Pets       |
> > | Team Canine, Inc | Pets       |
> > | Foothills Pet Resort | Pets       |
> > | Petco      | Pets       |
> > +------------+------------+
> > 10 rows selected (0.141 seconds)
> >
> > Now that I can pick out which businesses are in the “Pets” category, I
> may
> > want to get the average star rating and order them by the number of
> > reviews. However, using the above technique in an aggregation query
> doesn’t
> > seem to work:
> >
> > 0: jdbc:drill:zk=sen11:5181,sen12:5181> select
> > . . . . . . . . . . . . . . . . . . . >        a.`name`,
> > . . . . . . . . . . . . . . . . . . . >        count(b.`text`),
> > . . . . . . . . . . . . . . . . . . . >        avg(b.`stars`)
> > . . . . . . . . . . . . . . . . . . . >    FROM (
> > . . . . . . . . . . . . . . . . . . . >      SELECT `business_id`,
> > `name`, flatten(`categories`) as `category`
> > . . . . . . . . . . . . . . . . . . . >      FROM
> > maprfs.`cmatta`.`yelp/yelp_academic_dataset_business.json`) a
> > . . . . . . . . . . . . . . . . . . . >    JOIN
> > maprfs.`cmatta`.`yelp/yelp_academic_dataset_review.json` b
> > . . . . . . . . . . . . . . . . . . . >        ON a.`business_id` =
> > b.`business_id`
> > . . . . . . . . . . . . . . . . . . . >    WHERE a.`category` = 'Pets'
> > . . . . . . . . . . . . . . . . . . . >    GROUP BY a.`name`
> > . . . . . . . . . . . . . . . . . . . >    ORDER BY count(*) DESC
> > . . . . . . . . . . . . . . . . . . . >    LIMIT 10;
> > Query failed: Query failed: Failure while running fragment., Failure
> > while trying to materialize incoming schema.  Errors:
> >
> > Error in expression at index 2.  Error: Missing function
> > implementation: [flatten(VARCHAR-REPEATED)].  Full expression: null..
> > [ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ]
> > [ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ]
> >
> > Error: exception while executing query: Failure while executing query.
> > (state=,code=0)
> >
> > What does *Error: Missing function implementation:
> > [flatten(VARCHAR-REPEATED)]* in this context? Am I going about this the
> > wrong way?
> >
> > Chris Matta
> > cmatta@mapr.com
> > 215-701-3146
> > 
>
>

Re: Trying to wrap my head around using WHERE with flatten

Posted by Andries Engelbrecht <ae...@maprtech.com>.

I would recommend to not use a count(*) but rather pick a column to use for count.

The missing function implementation is interesting though, would be good to understand exactly what that means.

—Andries


On Mar 27, 2015, at 10:49 AM, Christopher Matta <cm...@mapr.com> wrote:

> I’m using the yelp academic data set <https://www.yelp.com/academic_dataset>
> for my testing purposes.
> 
> Say I want to list all businesses in the “Pets” category, this query
> doesn’t work:
> 
> 0: jdbc:drill:zk=sen11:5181,sen12:5181> select `name`,
> flatten(`categories`) from
> maprfs.cmatta.`yelp/yelp_academic_dataset_business.json` WHERE
> flatten(`categories`) = 'Pets' limit 10;
> Query failed: Query stopped., Failure while trying to materialize
> incoming schema.  Errors:
> 
> Error in expression at index -1.  Error: Missing function
> implementation: [flatten(VARCHAR-REPEATED)].  Full expression:
> --UNKNOWN EXPRESSION--.. [ daf9f62a-f39b-4745-a0e3-0160875f5cb7 on
> se-node13.se.lab:31010 ]
> 
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> 
> But this query with a sub-select does work:
> 
> 0: jdbc:drill:zk=sen11:5181,sen12:5181> select flat.`name`,
> flat.`categories` from (select `name`, flatten(`categories`) as
> categories from
> maprfs.cmatta.`yelp/yelp_academic_dataset_business.json`) as flat
> WHERE lower(flat.`categories`) = 'pets' limit 10;
> +------------+------------+
> |    name    | categories |
> +------------+------------+
> | Loving Hands Pet Care | Pets       |
> | Amec Mid-City Animal Hospital | Pets       |
> | PetSmart   | Pets       |
> | A Dog's Life Photography | Pets       |
> | Goober Pet Direct | Pets       |
> | Pet Planet | Pets       |
> | All-Star Animal Hospital | Pets       |
> | Team Canine, Inc | Pets       |
> | Foothills Pet Resort | Pets       |
> | Petco      | Pets       |
> +------------+------------+
> 10 rows selected (0.141 seconds)
> 
> Now that I can pick out which businesses are in the “Pets” category, I may
> want to get the average star rating and order them by the number of
> reviews. However, using the above technique in an aggregation query doesn’t
> seem to work:
> 
> 0: jdbc:drill:zk=sen11:5181,sen12:5181> select
> . . . . . . . . . . . . . . . . . . . >        a.`name`,
> . . . . . . . . . . . . . . . . . . . >        count(b.`text`),
> . . . . . . . . . . . . . . . . . . . >        avg(b.`stars`)
> . . . . . . . . . . . . . . . . . . . >    FROM (
> . . . . . . . . . . . . . . . . . . . >      SELECT `business_id`,
> `name`, flatten(`categories`) as `category`
> . . . . . . . . . . . . . . . . . . . >      FROM
> maprfs.`cmatta`.`yelp/yelp_academic_dataset_business.json`) a
> . . . . . . . . . . . . . . . . . . . >    JOIN
> maprfs.`cmatta`.`yelp/yelp_academic_dataset_review.json` b
> . . . . . . . . . . . . . . . . . . . >        ON a.`business_id` =
> b.`business_id`
> . . . . . . . . . . . . . . . . . . . >    WHERE a.`category` = 'Pets'
> . . . . . . . . . . . . . . . . . . . >    GROUP BY a.`name`
> . . . . . . . . . . . . . . . . . . . >    ORDER BY count(*) DESC
> . . . . . . . . . . . . . . . . . . . >    LIMIT 10;
> Query failed: Query failed: Failure while running fragment., Failure
> while trying to materialize incoming schema.  Errors:
> 
> Error in expression at index 2.  Error: Missing function
> implementation: [flatten(VARCHAR-REPEATED)].  Full expression: null..
> [ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ]
> [ c28a76ef-ef20-426a-a47e-908e9b68b875 on se-node13.se.lab:31010 ]
> 
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> 
> What does *Error: Missing function implementation:
> [flatten(VARCHAR-REPEATED)]* in this context? Am I going about this the
> wrong way?
> 
> Chris Matta
> cmatta@mapr.com
> 215-701-3146
>