You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Julian Hyde <ju...@gmail.com> on 2013/01/29 00:40:28 UTC

Purpose of Scan.selection and Operator.ref?

Hello drillers,

I'm still puzzling the purpose of the "selection" attribute of the "Scan" operator and the "ref" attribute of various operators such as "Scan", "Transform", "Group".

I notice that "selection" is not used (which is good, since there is no "activity" attribute in donuts.json).

I understand that "ref" chooses the output expression(s) of each operator, and see those expressions are necessary. But I don't understand why every "ref" in simple_plan.json is prefixed with "donuts".

My understanding is that each operator's input and output is a JSON array. The elements of that array (the "rows" in SQL parlance) are usually JSON objects (i.e. records with named fields) but might sometimes be scalars or arrays.

The output of the "aggregate" operator in simple_plan.json would be something like

[
  {
    "donuts": {
      "sales" : 1099.22,
      "typeCount" : 1,
      "quantity" : 10000,
      "ppu" : 0.11
  },
  {
    "donuts": {
      "sales" : 109.71,
      "typeCount" : 2,
      "quantity" : 159,
      "ppu" : 0.69
    }
  },
  {
    "donuts": {
      "sales" : 184.25,
      "typeCount" : 2,
      "quantity" : 335,
      "ppu" : 0.55
  }
]

The output is a list of objects, each of which has just one field "donuts", whose value is an object. The only purpose of the "donuts" prefix is to increase the nesting level. And other operators do the same thing. It would seem to me more natural to just use one level of nesting:

[
  {
    "sales" : 1099.22,
    "typeCount" : 1,
    "quantity" : 10000,
    "ppu" : 0.11
  },
  ...
]

Of course it's not wrong to do this, but I wanted to ask why someone would choose an extra level of nesting. Or to check whether my understanding was wrong. (I'm pondering how to make a SQL front-end generate something like simple_plan.json and right now I can see no reason why it would generate a ref values with a "donuts." prefix.)

Is the intent of "selection" to remove a level of nesting when reading a source?

Julian

Re: Purpose of Scan.selection and Operator.ref?

Posted by Jacques Nadeau <ja...@gmail.com>.
On Mon, Jan 28, 2013 at 3:40 PM, Julian Hyde <ju...@gmail.com> wrote:

> Hello drillers,
>
> I'm still puzzling the purpose of the "selection" attribute of the "Scan"
> operator and the "ref" attribute of various operators such as "Scan",
> "Transform", "Group".
>
> I notice that "selection" is not used (which is good, since there is no
> "activity" attribute in donuts.json).
>

Selection's purpose is portions of pushdown to the storage engine.  Example
might be a table name in hbase.  In the case of files, no selection
criteria is necessary.  That should be an empty object {}.


>
> I understand that "ref" chooses the output expression(s) of each operator,
> and see those expressions are necessary. But I don't understand why every
> "ref" in simple_plan.json is prefixed with "donuts".
>
> My understanding is that each operator's input and output is a JSON array.
> The elements of that array (the "rows" in SQL parlance) are usually JSON
> objects (i.e. records with named fields) but might sometimes be scalars or
> arrays.
>
> The output of the "aggregate" operator in simple_plan.json would be
> something like
>
> [
>   {
>     "donuts": {
>       "sales" : 1099.22,
>       "typeCount" : 1,
>       "quantity" : 10000,
>       "ppu" : 0.11
>   },
>   {
>     "donuts": {
>       "sales" : 109.71,
>       "typeCount" : 2,
>       "quantity" : 159,
>       "ppu" : 0.69
>     }
>   },
>   {
>     "donuts": {
>       "sales" : 184.25,
>       "typeCount" : 2,
>       "quantity" : 335,
>       "ppu" : 0.55
>   }
> ]
>
> The output is a list of objects, each of which has just one field
> "donuts", whose value is an object. The only purpose of the "donuts" prefix
> is to increase the nesting level. And other operators do the same thing. It
> would seem to me more natural to just use one level of nesting:
>
> [
>   {
>     "sales" : 1099.22,
>     "typeCount" : 1,
>     "quantity" : 10000,
>     "ppu" : 0.11
>   },
>   ...
> ]
>
> Of course it's not wrong to do this, but I wanted to ask why someone would
> choose an extra level of nesting. Or to check whether my understanding was
> wrong. (I'm pondering how to make a SQL front-end generate something like
> simple_plan.json and right now I can see no reason why it would generate a
> ref values with a "donuts." prefix.)
>
>
The nesting happens at the scan level.  Its purpose is to manage name
spaces much like table names in traditional sql.  The best example of use
is when two scan sources are joined.  Then the two sets of fields are
still referable after the join.

The orignal aggregation example in the donuts/simple_plan example kept the
donuts name so that everything was held in the same space.  There is no
requirement for that.  It could just as well have been all writing to root
level name space.


> Is the intent of "selection" to remove a level of nesting when reading a
> source?
>

The project operator is what is used to remove a level of nesting.  As
mentioned above, the selection is for subdividing a data source.



> Julian